General Theory

It would be nice to have some software which recognized languages. For example, given some greetings, I want to be able to produce something like this:

GreetingLanguage
EnglishGermanCzechFrench
Good morning99.8%0.1%0.1%0.0%
Guten Morgen0.4%98.2%0.2%1.2%
Dobre jitro0.3%0.0%99.5%0.2%
Bonjour11.8%0.0%7.4%80.7%

The actual motivation for this is not to parse greetings, but rather to see which geocaches in places like Prague have English descriptions. Accordingly I can't just compile a list of common greetings and map them onto the relevant languages.

In fact I'd like to have a model of language. This is, something which, given a string of characters will return the probability that the string comes from that language.

Moreover, I'd like to have multiple models, each corresponding to a different language: then I can ask 'How likely are these data, given that they're in French ?', or '... in English ?', and so on.

Formally, the language model tells us the likelihood of the data given our assumptions i.e. p(D|L). From this Bayes' theorem lets us calculate the probability of each language given the data:

p(L|D) = p(D|L) p(L) / p(D)

Accordingly,

p(L|D) ∝ p(D|L).

A toy language model

Recall that the language model must be able to award a number to each and every string saying how likely it is. For example, in English we'd expect 'the cat sat on the mat' to get a higher score than 'dfkdsfnksd'.

Happily it isn't necessary to actually understand a language to write a model for it. Rather we can just look at lots of English text and infer a model from them: we can think of this is teaching the model about English by getting it to study a corpus of English sentences.

It's important to realize that there isn't one 'right' model, and that some models might be very sophisticated. For example a clever model might award a higher score to 'the cat sat on the mat' than 'the cat sat on the dog'.

Our resources are much more modest, so we'll do something very much simpler. In fact I'm interested in the simplest model which has a reasonable chance of working.

Two notational points:

D = { d1, d2, d3, ... }.

We can always expand joint probability distributions: p(A,B) = p(A) p(B|A). Applying this twice:

p(d1, d2, d3, ...) = p(d1) p(d2|d1) p(d3, ...|d1,d2)

What does this mean ? Well, the first term p(d1) is just the probability of the first letter. We might build our model by looking at the start of lots of English texts and counting how often we saw each letter.

The next term p(d2|d1) is more complicated: it's the probability of the second letter given a particular first letter. For example, if the first letter is 'T' then it's quite likely to be followed by 'H'. Again this is something we could estimate by looking at lots of texts.

We could go on like this, but at some point we'd find ourselves asking what't we'd expect to see after T,H,E,C,A,T,S,A,T,O,N,T,H,E,M,A. This approach just isn't feasible: quite apart from the computational burden it would be impossible to find data to get reasonably representative samples for strings this long.

Rather than regarding the string of characters blindly, we have to start breaking it up.

The simplest way to do this is to just ignore the conditioning on previous letters:

p(d2|d1) ≅ p(d2).

This is equivalent to making the assumption that all letters are independent: the probability of the string is just the product of the probabilities of each letter in the string.

p(D) = i p(di).

Now our training process is much easier: once we've seen enough text to estimate the probabilities of each character there's little point in doing much more.

In practice, this model doesn't do too badly, but it relies rather heavily on different languages having different letter distributions. In some very unscientific tests, this failed to discriminate between short strings of English and French.

Perhaps, we've just gone a bit too far: instead of completely ignoring the context of each character, let's keep the single previous character.

Formally, expand the joint probability distribution again but this time go a step further:

p(d1, d2, d3, d4, ...) = p(d1) p(d2|d1) p(d3|d1,d2) p(d4, ...|d1,d2,d3)

Now approximate:

p(d3|d2, d1) ≅ p(d3|d2),
p(d4|d1,d2,d3) ≅ p(d4|d3),
...

and thus,

p(D) = p(d1) i > 1 p(di|di-1).

This is a simple model for language which uses the probability of seeing each pair of letters. Normally we'd call this a bi-gram model. We'll obviously need a lot more data to train it properly than we needed for the uni-gram version above, but it should still be tractable.

Technical details

Spaces

It's obvious that spaces in prose are helpful and contain information: justtrytoreadthisforexample! However, deciding how to encode this information is more bother than I want to think about so I'll simply throw away any non-alphabetic characters from my strings.

Similarly I'll ignore case of characters too.

Obviously it's important to do this both when estimating the probabilities to build the model, and when scoring a particular string with the model.

Unseen tokens

Key to the model are the probabilities p(di|di-1) and we plan to estimate by those by just looking at some training corpus. Of course it's perfectly possible that when we're working with real data we'll encounter a situation which wasn't in the training data. For example, we might train it on English prose then present it with a bit of Chinese.

One possibility is to simply assign a zero probability to unseen letter pairs, but that seems a trifle harsh. Instead it's probably better to just assign some low probability, so the possibility is deprecated rather than ruled-out completely.

To do the job properly we should assign a probability to all of possible bigrams, but there are an awful lot of those: Unicode defines about 100,000 characters so there are about 1010 bigrams. If we assumed each unseen bigram was equally likely then it would thus have a probability of somewhat less than 10-10.

By contrast in my tiny English training set of about 3 million letter pairs characters, only 642 of the possible 676 bigrams were seen. Dealing with unseen combinations properly surely should give these a significantly higher probability than 10-10.

In practice, if we've seen N letter pairs then assigning a probability of (1/N) seems to work well enough. In principle we should renormalize the distribution after we've extended it i.e. rescale so that the probabilities sum to one again, but in practice I don't bother—after all expanding the distribution on the fly is a little bit dodgy.

Another approach we might use is to synthesize the bigram probabilities from the unigram statistics: again I'm not persuing that here.

logprob

Multiplying together lots of numbers, some of which might be close to zero, is always fraught with error. In practice then, it's much easier to work with logs, and just add them up.

Sample data

Happily the Internet is full of good prose in different languages. Project Gutenberg1 is a fine source.

Although there aren't many texts on Gutenberg in, say, Czech, these less common languages often have a significantly different alphabet which compensates.

Software

It's easy to write some software to implement this, but I've done it for you. This tarball2 contains both a pre-trained model (MultiLM.pm) and the software to generate a new model from your own training data (compile).

Further documentation is available in those files:

$ wget http://www.mjoldfield.com/atelier/2010/06/toy2l/toy2l_0.1.tar.gz
$ tar xzvf toy2l_0.1.tar.gz
$ cd toy2l-0.1
$ ./hello
$ perldoc MultiML.pm
$ perldoc compile

Here's an example which generated the data used in the table at the top of this article:

#! /usr/bin/perl

use strict;
use warnings;

use MultiLM;
use YAML;

my $lm = MultiLM->new;

foreach my $txt ('Good morning', 'Bonjour', 'Guten Morgen', 'Dobre jitro')
{
my $h = $lm->prob_language($txt);
print "$txt\n", Dump($h), "\n";
}

Training data

Given that we're only interested in the frequencies of letter pairs, I don't expect the training data to matter much, but if you're interested I used: