Bayesian Language Detection

Excellent – I managed to write a Bayesian filter today from scratch (well, something that vaguely resembled and works like one). The application is a fairly common one – you pass it a slab of text and it detects which language the content is written in.

This is going to be used in the background of Blogwise, for blogs where I can’t autodetect the language from metadata, and will silently run on new blog submissions.

The application is relatively straightforward. I feed it a load of text content known to be in a specific language. It ranks the word by frequency, and links it to the language. When it’s done it has a frequency chart for each language. So, for example, the top words in English are the, of, and; the top in French are de, la, le; in Spanish it’s de, la, que; and so on.

You’ll notice that Spanish and French share the same top words – it’s an art to get the checking algorithm tweaked to produce accurate results. The algorithm takes into consideration many words, and particularly looks for words that exist or are popular in one language but are rare in another.

The sample data is rather thin, but already I’m seeing promising results. Of the six languages tested (French, Spanish, German, Italian and English (UK & US)), I provided five samples. I then tested them with two further samples.

The results were very promising (even with such a small test group) – not only was the system correct on all but one case, it was making enough of a distinction between languages for the confidence to be very high indeed.

The only error was an incorrect assessment of a UK English blog as US English, but there’s hardly a massive difference. US English was a close second place in its assessment, and I’m probably going to drop regional differences anyway (UK or US can be confirmed from the country – and does the reader really care?)

So this is good news, and all in the name of progress for Blogwise! Not bad for a few hours’ work.

Spammiest Blog Domains

Of the blog domains out there from which over 100 blogs have been submitted to Blogwise, comes out top of the spam list. 15.35% of blogs on have been rejected as spam, compared to 5% on Blogspot (which is usually cited as a fairly spammy domain).

It’s an interesting stat – but one where I’ve been very keen to emphasise how the results should be treated. It’s not a recommendation for or against any particular domain, nor are the stats used to effectively prejudice against blogs that come from ‘spammy’ host.

Search Optimisations

I’m still hacking away at the Blogwise search, trying to improve things. My latest move seems to have cut cached requests by up to a half (and sliced 0.5/0.6 seconds off the uncached search results). As ever, Grabperf shows the details.

The latest speed increase is because I’ve rewritten the aggregator as a HTTP server of its own (previously it was run from mini_httpd). DBI and all the other modules are loaded in the server itself, so individual requests can be delivered immediately.

In fact, the aggregator is clocking in at about 0.09 seconds for cached results – the 0.5+ seconds you see on Grabperf reflect other bits, such as page rendering & transmission, DNS lookups and TCP connection (Grabperf’s servers are in the States).
At some point, if anybody is interested, I’ll write up how I’m doing this in detail.