I'm searching for a simple algorithm or an open source library (PHP) allowing to estimate whether a text mainly uses a specific language. I found the following answer relating to Python, which probably leads in the right direction. But something working out-of-the-box for PHP would be a charm.
Of course something like an n-gram estimator wouldn't be too hard to implement, but it requires a reference database as well.
The actual problem to solve is as follows. I run a WordPress blog, which currently is flooded by SPAM. The blog is in German language and virtually all trackback spam is English. My idea is to inmmediately spam all trackbacks seeming to be English. However, I cannot use marker words, because I do not want to spam typos or citations.
My solution:
Using the answers to this question I implemented a solution, which detects German by a simple stopword ratio. Any comment must contain at least 25% German stopwords, if it has a link. So you can still comment something like "cool article", which has no stopwords at all, but if you put a link, you should bother to write proper language.
Unfortunately the stopwords from NLTK are incorrect. The list contains words, which do not exist in German. So I used the snowball list. Using the Perl regexp optimizer I condensed the entire list into a single regexp and count the stopwords using preg_match_all(). The whole filter is 25 lines, a third of the Perl code to produce the regexp from the list. Let's see how it performs in the wild.
Thanks for your help.