Bayesian Language Detection

Excellent – I managed to write a Bayesian filter today from scratch (well, something that vaguely resembled and works like one). The application is a fairly common one – you pass it a slab of text and it detects which language the content is written in.

This is going to be used in the background of Blogwise, for blogs where I can’t autodetect the language from metadata, and will silently run on new blog submissions.

The application is relatively straightforward. I feed it a load of text content known to be in a specific language. It ranks the word by frequency, and links it to the language. When it’s done it has a frequency chart for each language. So, for example, the top words in English are the, of, and; the top in French are de, la, le; in Spanish it’s de, la, que; and so on.

You’ll notice that Spanish and French share the same top words – it’s an art to get the checking algorithm tweaked to produce accurate results. The algorithm takes into consideration many words, and particularly looks for words that exist or are popular in one language but are rare in another.

The sample data is rather thin, but already I’m seeing promising results. Of the six languages tested (French, Spanish, German, Italian and English (UK & US)), I provided five samples. I then tested them with two further samples.

The results were very promising (even with such a small test group) – not only was the system correct on all but one case, it was making enough of a distinction between languages for the confidence to be very high indeed.

The only error was an incorrect assessment of a UK English blog as US English, but there’s hardly a massive difference. US English was a close second place in its assessment, and I’m probably going to drop regional differences anyway (UK or US can be confirmed from the country – and does the reader really care?)

So this is good news, and all in the name of progress for Blogwise! Not bad for a few hours’ work.

4 thoughts on “Bayesian Language Detection”

  1. Very interesting project!

    As far as the difference between UK English and US English, is there that much need for distinction? As an American expat who has adopted mostly British spelling in my day to day writing, couldn’t it be assumed that many more hybridise their writing as well? Although one could split hairs and draw boundaries and code rules regarding Canadian English and Australian English, I doubt it would matter which side of the Atlantic the nuances originate from.

    It is another matter entirely when considering Portuguese v Spanish. Although several spellings are similar and even identical, the languages are intrinsically different, whereas in English, if I spell it ‘color’ or ‘colour’ no one will really care.

    Great work Sven — I look forward to seeing more in your blog about this! 😀

  2. The original intention of splitting en-UK and en-US was to use the language as one of the factors when determining the country, eg. if it’s en-UK more chance of a UK blog (although, of course, they could be an ex-pat!)

    The system works fine for distinctive languages (I haven’t tried Portuguese vs. Spanish yet, or any of the Far-East languages), but it’s not sophisticated enough to detect regional differences, and because of the results and the well-made points you highlighted I’m not going to try.

    I hadn’t realised these kinds of articles would be very interesting to anybody else – if you do enjoy them I’ll definitely write more, thanks!

  3. I’m sort of a language geek. I can only speak my native tongue well, but the very structure and presentation of language interests me. Apparently, so do geeky ways of dissecting it. 😉

Comments are closed.