Maximum Likelihood Distance uses a naive Bayes classifier to determine the likelihood that a particular test corpus was generated by some training corpus. This is similar to language identification work done in the early 90s (see for example Dunning 1994), but instead of taking the maximum of all the likelihoods, MLD looks at the actual numbers and compares them. This provides an additional level of detail. The technique is also different in that, when identifying a source language, the test corpus is compared to multiple training corpora, one for each language. With MLD, distance is measured between multiple test corpora and a single training corpus.
MLD is quite simple to figure once you have made the naive Bayes assumption. First get a type/token count from the training corpus and convert it to a type/token frequency. Then multiply the probability of each token in the test corpus. You will probably want to smooth and normalise for corpus length, since unseen types will be a problem, and longer test corpora will always generate lower probabilities. I used bigrams of phones, but some more complicated combination of bigrams and trigrams would probably be beneficial.
Note: MLD is probably not technically a distance measure, just a dissimilarity measure, since I don't think it satisfies all the properties of a distance measure. I haven't checked, though.
