java - 语言检测未按预期工作

Question

我正在使用https://code.google.com/p/language-detection java 库来检测给定文本的语言。使用的配置文件与库一起提供。然而，结果有时出乎意料地与预期不同。代码中可能有什么问题，或者我应该重新生成配置文件？

我试过“ld.detect（“en”）；” 已评论和未评论。空格会影响语言检测吗？

    LanguageDetect ld = new LanguageDetect();
    ld.init("C:\\James\\languageTest\\profiles");
    //ld.detect("en");

    String textCurrentLine;
    BufferedReader br = null;
    try {
        br = new BufferedReader(new FileReader("C:\\James\\failcases.txt"));

        while ((textCurrentLine = br.readLine()) != null) {
           System.out.println(ld.detect(textCurrentLine));

        }
    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            if (br != null) {
                br.close();
            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }
}

以下是我用几句话得到的

Communication - en
Timing - tl
none - it
user - it
No - pt
Yes - fr
user - no
generated - da
Diagnostic - it
not supported - en
supported - en
Bus Speed - en
Protocol - it

score 1 · Accepted Answer

As the FAQ of the library is stating:

Can langdetect handle short texts?

This library requires that a detection text has some length, almost 10-20 words over.

It may return a wrong language for very short text with 1-10 words.

You are trying it on one-word or two-word texts, this is not the use case this library is build for, so you're gonna have wrong results.

For single words without context, you can try to match them with dictionaries of the languages you are targetting.

java - 语言检测未按预期工作

1 回答 1

Related

Reference