4

I'm searching for a simple algorithm or an open source library (PHP) allowing to estimate whether a text mainly uses a specific language. I found the following answer relating to Python, which probably leads in the right direction. But something working out-of-the-box for PHP would be a charm.

Of course something like an n-gram estimator wouldn't be too hard to implement, but it requires a reference database as well.

The actual problem to solve is as follows. I run a WordPress blog, which currently is flooded by SPAM. The blog is in German language and virtually all trackback spam is English. My idea is to inmmediately spam all trackbacks seeming to be English. However, I cannot use marker words, because I do not want to spam typos or citations.

My solution:

Using the answers to this question I implemented a solution, which detects German by a simple stopword ratio. Any comment must contain at least 25% German stopwords, if it has a link. So you can still comment something like "cool article", which has no stopwords at all, but if you put a link, you should bother to write proper language.

Unfortunately the stopwords from NLTK are incorrect. The list contains words, which do not exist in German. So I used the snowball list. Using the Perl regexp optimizer I condensed the entire list into a single regexp and count the stopwords using preg_match_all(). The whole filter is 25 lines, a third of the Perl code to produce the regexp from the list. Let's see how it performs in the wild.

Thanks for your help.

4

2 回答 2

1

I agree with @Thomas that what you are looking for is a spam classifier rather than a language detection algorithm. Nonetheless, I think this language detection solution is simple enough and out of the box as you want. Basically if you count the number of stop-words in different languages and select the language with higher number of them in the document you have a simple, yet very effective language classifier.

Now, the best part is that you do not need to code almost anything as you can used standard stop-words list and processing packages like nltk to deal with the information. Here you have the example of how to implement it from scratch with Python and nltk.

I hope this helps.

于 2013-06-13T19:33:37.280 回答
0

If all you want to do is recognize English then there's a very easy hack. If you just check for the letters in a post, English is one of the only languages that will be entirely in the pure-ASCII range. It's hacky, but it's a decently simplification on an otherwise very difficult problem I believe.

My guess on efficacy, just doing some quick back on the envelope calculations on a couple French and German blogs would be ~85%, which isn't foolproof, but is pretty good for the simplicity of it I would think.

于 2013-06-13T19:27:05.110 回答