Language detection, part 3: This time with liblinear, trained one-against-the rest with a balanced number of negative examples per class. Numbers are very similar to what we got from the n-gram model.
From first tests, it looks that this works much better than n-grams on entirely new data. The problem with n-grams seems to be that the whole text needs to match the model well. However, when you have entirely new words, you get a very low score. The SVM on the other hand focuses on what seems relevant to a class, and is less sensitive to unseen data (as long as it’s irrelevant).
This brings us directly back to the discussion on discriminative vs. generative models.