Language detection, part 3: This time with liblinear, trained one-against-the rest with a balanced number of negative examples per class. Numbers are very similar to what we got from the n-gram model.

From first tests, it looks that this works much better than n-grams on entirely new data. The problem with n-grams seems to be that the whole text needs to match the model well. However, when you have entirely new words, you get a very low score. The SVM on the other hand focuses on what seems relevant to a class, and is less sensitive to unseen data (as long as it’s irrelevant).

This brings us directly back to the discussion on discriminative vs. generative models.

Language detection, part 3: This time with liblinear, trained one-against-the rest with a balanced number of negative examples per class. Numbers are very similar to what we got from the n-gram model.

From first tests, it looks that this works much better than n-grams on entirely new data. The problem with n-grams seems to be that the whole text needs to match the model well. However, when you have entirely new words, you get a very low score. The SVM on the other hand focuses on what seems relevant to a class, and is less sensitive to unseen data (as long as it’s irrelevant).

This brings us directly back to the discussion on discriminative vs. generative models.

Development blog for TWIMPACT
beta.twimpact.com

Members
Mikio Braun
Leo Jugel

twitter.com/twimpact

view archive