language-agnostic - 确定网站内容语言

Question

对于我的一个应用程序，我需要通过获取网站的内容来确定网站的语言。

我想知道您对制作语言决定脚本的看法。你会使用哪些方法？哪种脚本语言？等等

此时我用几个方法用PHP写了一些代码；

通过 Content-Language 元标记确定语言
通过获取标题、描述、关键字来确定语言，并将这些与英语、荷兰语、德语等单词列表进行比较
通过html语言标签确定语言'
通过获取所有页面内容（将单词分隔成数组）来确定语言，并通过 array_search 将其与单词列表进行比较（最高匹配的语言数组是内容语言）。
通过语言标题确定语言

我现在正在采取这些步骤来确定语言，也正是按照这个顺序。如果一种方法成功地确定了语言，我将退出下一个函数。

这种方法是有效的，但并不总是那么准确。有人可以告诉我更多关于我可以检查的事情吗？也许是检查语言的其他方式（我不想使用 api）。

（最后我需要将这些语言设置为 MySql 数据库）。

期待听到一些建议！

提前致谢。

缺口

score 0 · Accepted Answer

That will depends as long your text is..

First of all parse all html and extract only the text.

If it is long you can use a cheap method by looking only to stopwords. Get a list of stopwords for each language and figure out how many of them is into your text. You can get a nice list of stopwords in NLTK corpus(python) and take advantage of some good functions to tokenize sentences and words.

import nltk

ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english'))
NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS

STOPWORDS_DICT = {lang: set(nltk.corpus.stopwords.words(lang)) for lang in  
                                            nltk.corpus.stopwords.fileids()}

def get_language(text):
    words = set(nltk.wordpunct_tokenize(text.lower()))
    return max(((lang, len(words & stopwords)) for lang, stopwords in STOPWORDS_DICT.items()), 
                                                                  key = lambda x: x[1])[0]
lang = get_language('This is my test text')

More explanation on http://www.algorithm.co.il/blogs/programming/python/cheap-language-detection-nltk/

If you want to go through python+nltk don't forget to download nltk corpus after installing.

import nltk
nltk.download()

language-agnostic - 确定网站内容语言

1 回答 1

Related

Reference