Slightly better results with a bit of thresholding which removes a lot of Danish false positives (because Danish was tested first, it artificially got many lines which weren’t really matched by any of the models.)

Also note that this is more like a leave-one-out test error and not a true error on an independent sample. Still, looks good enough.

Since people have been asking on Twitter: We’re using an n-gram model on the characters, and predict the language which has the highest likelihood. We also added some normalization by computing the percentiles for a likelihood (that is, replace p(x) with the percentage of items which has a likelihood smaller or equal to p(x)).

Slightly better results with a bit of thresholding which removes a lot of Danish false positives (because Danish was tested first, it artificially got many lines which weren’t really matched by any of the models.)

Also note that this is more like a leave-one-out test error and not a true error on an independent sample. Still, looks good enough.

Since people have been asking on Twitter: We’re using an n-gram model on the characters, and predict the language which has the highest likelihood. We also added some normalization by computing the percentiles for a likelihood (that is, replace p(x) with the percentage of items which has a likelihood smaller or equal to p(x)).

Development blog for TWIMPACT
beta.twimpact.com

Members
Mikio Braun
Leo Jugel

twitter.com/twimpact

view archive