python - Polyglot 未检测到多种语言

Question

我正在polyglot用 Python 测试包以检测混合语言文档中的语言。

我并不期望它得到最准确的预测，但从包开始不会返回任何东西，而是一种语言作为答案，即使对于其中包含 2 或 3 种语言的文本也是如此。

我使用的文本平均有 20 个单词，例如：

text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.'

我总是得到类似以下的东西 - 没有多种语言的答案：

Prediction is reliable: True
Language 1: name: English     code: en       confidence:  98.0 read bytes:   682
Language 2: name: un          code: un       confidence:   0.0 read bytes:     0
Language 3: name: un          code: un       confidence:   0.0 read bytes:     0

它与文档中的示例相去甚远：

> China (simplified Chinese: 中国; traditional Chinese: 中國),
> 
> name: English     code: en       confidence:  71.0 read bytes:   887
> name: Chinese     code: zh_Hant  confidence:  11.0 read bytes:  1755
> name: un          code: un       confidence:   0.0 read bytes:     0

尽管老实说，当我使用上面的中英文示例运行检测器时，我确实得到了混合语言的答案。

代码如下：

from polyglot.detect import Detector

text = 'Je travaillais en France. Je suis tres heureux. I work in London. I grew up in Manchester.'

answer = Detector(text)

print(answer)

为什么会这样？

附言

此外，在检测一个（甚至是非常常见的）单词的语言的情况下polyglot是非常糟糕的。 例如，对于单词quantita（意大利语），它会返回英语。

我知道这些软件包中的许多在拥有大文本时主要是成功的，但令人惊讶的是它们甚至无法捕捉这些简单的单词。

Textblob似乎对单个单词也很好，但是您可以向它发送非常有限数量的请求（在这两种情况下，可能是因为它使用了 Google API）。

score 0 · Accepted Answer

我认为 Polyglot 通过阅读文本中使用的字符来检测语言。你上面提到的例子都是用英文（音译）写的。不管是法语、意大利语、西班牙语、中文等等。 langaueg。它都将被检测为英语，因为它是使用英语字符集编写的。

因此，Polyglot 仅适用于其中使用非拉丁字符的语言，如希腊语、俄语、阿拉伯语或中文。

这就是为什么在下面的情况下你也有中文，信心很低，因为中文字符很少，而拉丁字符更多：

中国（简体中文：中国；繁体中文：中国），

名称：英文代码：en 置信度：71.0 读取字节数：887 名称：中文代码：zh_Hant 置信度：11.0 读取字节数：1755 名称：un 代码：un 置信度：0.0 读取字节数：0

python - Polyglot 未检测到多种语言

1 回答 1

Related

Reference