java - 为什么 maven 给我的 utf-8 字符与 eclipse 不同（在 eclipse 中测试运行，在 maven 中失败）？

Question

我目前的项目涉及解析自然语言。一项测试从文件中读取文本，删除某些字符，并将文本标记为单个单词。该测试实际上比较了唯一词的数量。在eclipse中，这个测试是“绿色的”，在maven中，我得到的单词数量比预期的要多。比较单词列表，我看到以下附加单词：

收购方⊙s
卡片⊙s
机构⊙s
发行人⊙s
提供者⊙s
psam⊙s
⊜来自⊝</li>
⊜插槽⊝</li>
⊜to⊝</li>

查看文本源，它包含以下应过滤掉的字符：“ ” '</p>

这适用于eclipse，但不适用于maven。我正在使用 utf-8。这些文件似乎编码正确，在 maven pom 中我指定以下内容：

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <org.apache.lucene.version>3.6.0</org.apache.lucene.version>
</properties>

编辑：这是读取文件的代码（根据eclipse，编码为UTF-8）。

        BufferedReader reader = new BufferedReader(
                new FileReader(this.file));
        String line = "";
        while ((line = reader.readLine()) != null) {
            // the csv contains a text and a classification
            String[] reqCatType = line.split(";");
            String reqText = reqCatType[0].trim();
            String reqCategory = reqCatType[1].trim();
            // the tokenizer also removes unwanted characters:
            String[] sentence = this.filter.filterStopWords(this.tokenizer
                    .tokenize(reqText));
            // we use this data to train a machine learning algorithm
            this.dataSet.learn(sentence, reqCategory);
        }
        reader.close();

编辑：以下信息可能对分析问题有用：

mvn -v
Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800)
Maven home: /usr/share/maven
Java version: 1.6.0_33, vendor: Apple Inc.
Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x", version: "10.6.8", arch: "x86_64", family: "mac"

score 4 · Accepted Answer

因此，您的数据文件采用 UTF-8 格式。该文件上的 eclipse 设置没有任何影响，因为正在运行的 Java 程序将是解释含义的指令。

FileReader 总是使用平台默认编码，这通常是个坏主意。Eclipse 可能会为您设置“平台默认值”，而 Maven 不是。

修复您的代码以指定编码。

请参阅 JavaDoc：

To specify these values yourself, construct an InputStreamReader on a FileInputStream.

java - 为什么 maven 给我的 utf-8 字符与 eclipse 不同（在 eclipse 中测试运行，在 maven 中失败）？

1 回答 1

Related

Reference