python - 从python中的句子中删除非英语单词

Question

我编写了一个向 Google 发送查询并返回结果的代码。我从这些结果中提取片段（摘要）以进行进一步处理。但是，有时非英语单词会出现在我不想要的这些片段中。例如：

/\u02b0w\u025bn w\u025bn unstressed \u02b0w\u0259n w\u0259n/

我只想要这句话中的“未重读”这个词。我怎样才能做到这一点？谢谢

score 4 · Accepted Answer

PyEnchant 对您来说可能是一个简单的选择。我不知道它的速度，但你可以这样做：

>>> import enchant
>>> d = enchant.Dict("en_US")
>>> d.check("Hello")
True
>>> d.check("Helo")
False
>>>

一个教程是在这里找到的，它也有返回建议的选项，你可以再次查询或其他东西。此外，您可以检查您的结果是否为 latin-1 (is_utf8() 存在，不知道 is_latin-1() 是否也存在，也许使用诸如Enca之类的东西来检测文本文件的编码，基于知识他们的语言。）

score 1 · Accepted Answer

You can compare the words you receive with a dictionary of english words, for example /usr/share/dict/words on a BSD system.

I would guess that googles results for the most part is grammatically correct, but if not, you might have to look into stemming in order to match against your dictionary.

score 1 · Accepted Answer

您可以使用 PyWordNet。那是 WordNet 的 python 接口。只需将您的句子分成空格并检查每个单词是否在字典中。

python - 从python中的句子中删除非英语单词

3 回答 3

Related

Reference