java - Java：如何确定流的正确字符集编码

Question

以编程方式确定输入流/文件的正确字符集编码的最佳方法是什么？

我尝试过使用以下内容：

File in =  new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());

但是在我知道用 ISO8859_1 编码的文件上，上面的代码会产生 ASCII，这是不正确的，并且不允许我将文件的内容正确地呈现回控制台。

score 105 · Accepted Answer

您无法确定任意字节流的编码。这就是编码的本质。编码意味着字节值与其表示之间的映射。所以每个编码“可能”都是正确的。

getEncoding()方法将返回为流设置（读取JavaDoc ）的编码。它不会为您猜测编码。

一些流会告诉您使用哪种编码来创建它们：XML、HTML。但不是任意字节流。

无论如何，如果必须，您可以尝试自己猜测编码。每种语言的每个字符都有一个共同的频率。在英语中，char e 经常出现，但 ê 很少出现。在 ISO-8859-1 流中，通常没有 0x00 字符。但是 UTF-16 流有很多。

或者：您可以询问用户。我已经看到应用程序以不同的编码向您显示文件的片段，并要求您选择“正确”的那个。

score 77 · Accepted Answer

我使用了这个库，类似于 jchardet 来检测 Java 中的编码： https ://github.com/albfernandez/juniversalchardet

score 40 · Accepted Answer

看看这个： http ://site.icu-project.org/ (icu4j) 他们有用于从 IOStream 检测字符集的库可能很简单，如下所示：

BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();

if (cm != null) {
   reader = cm.getReader();
   charset = cm.getName();
}else {
   throw new UnsupportedCharsetException()
}

score 31 · Accepted Answer

这是我的最爱：

TikaEncodingDetector

依赖：

<dependency>
  <groupId>org.apache.any23</groupId>
  <artifactId>apache-any23-encoding</artifactId>
  <version>1.1</version>
</dependency>

样本：

public static Charset guessCharset(InputStream is) throws IOException {
  return Charset.forName(new TikaEncodingDetector().guessEncoding(is));    
}

猜测编码

依赖：

<dependency>
  <groupId>org.codehaus.guessencoding</groupId>
  <artifactId>guessencoding</artifactId>
  <version>1.4</version>
  <type>jar</type>
</dependency>

样本：

  public static Charset guessCharset2(File file) throws IOException {
    return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
  }

score 14 · Accepted Answer

您当然可以通过使用 a 对其进行解码并注意“格式错误的输入”或“不可映射的字符”错误来验证特定字符集的文件。当然，这只会告诉您字符集是否错误；它不会告诉你它是否正确。为此，您需要一个比较基础来评估解码结果，例如，您是否事先知道字符是否被限制在某个子集，或者文本是否遵循某种严格的格式？底线是字符集检测是没有任何保证的猜测。CharsetDecoder

score 14 · Accepted Answer

使用哪个库？

在撰写本文时，它们是出现的三个库：

我不包括Apache Any23，因为它在后台使用 ICU4j 3.4。

如何判断哪一个检测到了正确的字符集（或尽可能接近）？

无法验证上述每个库检测到的字符集。但是，可以依次询问他们并对返回的响应进行评分。

如何对返回的响应进行评分？

每个响应可以被分配一个点。响应的点数越多，检测到的字符集的置信度就越高。这是一种简单的评分方法。你可以详细说明其他的。

有没有示例代码？

这是实现前几行中描述的策略的完整片段。

public static String guessEncoding(InputStream input) throws IOException {
    // Load input data
    long count = 0;
    int n = 0, EOF = -1;
    byte[] buffer = new byte[4096];
    ByteArrayOutputStream output = new ByteArrayOutputStream();

    while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
        output.write(buffer, 0, n);
        count += n;
    }
    
    if (count > Integer.MAX_VALUE) {
        throw new RuntimeException("Inputstream too large.");
    }

    byte[] data = output.toByteArray();

    // Detect encoding
    Map<String, int[]> encodingsScores = new HashMap<>();

    // * GuessEncoding
    updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());

    // * ICU4j
    CharsetDetector charsetDetector = new CharsetDetector();
    charsetDetector.setText(data);
    charsetDetector.enableInputFilter(true);
    CharsetMatch cm = charsetDetector.detect();
    if (cm != null) {
        updateEncodingsScores(encodingsScores, cm.getName());
    }

    // * juniversalchardset
    UniversalDetector universalDetector = new UniversalDetector(null);
    universalDetector.handleData(data, 0, data.length);
    universalDetector.dataEnd();
    String encodingName = universalDetector.getDetectedCharset();
    if (encodingName != null) {
        updateEncodingsScores(encodingsScores, encodingName);
    }

    // Find winning encoding
    Map.Entry<String, int[]> maxEntry = null;
    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
            maxEntry = e;
        }
    }

    String winningEncoding = maxEntry.getKey();
    //dumpEncodingsScores(encodingsScores);
    return winningEncoding;
}

private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
    String encodingName = encoding.toLowerCase();
    int[] encodingScore = encodingsScores.get(encodingName);

    if (encodingScore == null) {
        encodingsScores.put(encodingName, new int[] { 1 });
    } else {
        encodingScore[0]++;
    }
}    

private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
    System.out.println(toString(encodingsScores));
}

private static String toString(Map<String, int[]> encodingsScores) {
    String GLUE = ", ";
    StringBuilder sb = new StringBuilder();

    for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
        sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
    }
    int len = sb.length();
    sb.delete(len - GLUE.length(), len);

    return "{ " + sb.toString() + " }";
}

改进： 该guessEncoding方法完全读取输入流。对于大型输入流，这可能是一个问题。所有这些库都会读取整个输入流。这意味着检测字符集会耗费大量时间。

可以将初始数据加载限制为几个字节并仅对这几个字节执行字符集检测。

score 9 · Accepted Answer

据我所知，在这种情况下没有通用库适合所有类型的问题。因此，对于每个问题，您都应该测试现有的库并选择满足您问题约束的最佳库，但通常它们都不合适。在这些情况下，您可以编写自己的编码检测器！正如我所写...

我使用 IBM ICU4j 和 Mozilla JCharDet 作为内置组件编写了一个用于检测 HTML 网页的字符集编码的元 java 工具。在这里你可以找到我的工具，请先阅读自述文件部分。此外，您可以在我的论文及其参考文献中找到这个问题的一些基本概念。

下面我提供了一些我在工作中遇到的有用的评论：

字符集检测不是一个万无一失的过程，因为它本质上是基于统计数据，而实际发生的是猜测而不是检测
icu4j 是 IBM 在这种情况下的主要工具，恕我直言
TikaEncodingDetector 和 Lucene-ICU4j 都在使用 icu4j，它们的准确性与我的测试中的 icu4j 没有显着差异（我记得最多为 %1）
icu4j 比 jchardet 更通用，icu4j 只是有点偏向 IBM 系列编码，而 jchardet 强烈偏向 utf-8
由于 UTF-8 在 HTML 世界中的广泛使用；总体而言，jchardet 是比 icu4j 更好的选择，但不是最佳选择！
icu4j 非常适合东亚特定编码，例如 EUC-KR、EUC-JP、SHIFT_JIS、BIG5 和 GB 系列编码
icu4j 和 jchardet 在处理具有 Windows-1251 和 Windows-1256 编码的 HTML 页面时都失败了。Windows-1251 aka cp1251 广泛用于基于西里尔文的语言，如俄语，Windows-1256 aka cp1256 广泛用于阿拉伯语
几乎所有编码检测工具都使用统计方法，因此输出的准确性很大程度上取决于输入的大小和内容
一些编码本质上是相同的，只是有部分差异，所以在某些情况下，猜测或检测到的编码可能是错误的，但同时也是正确的！关于 Windows-1252 和 ISO-8859-1。（参考我论文5.2部分的最后一段）

score 8 · Accepted Answer

上面的库是简单的 BOM 检测器，当然只有在文件开头有 BOM 时才有效。看看http://jchardet.sourceforge.net/它确实扫描了文本

score 5 · Accepted Answer

我找到了一个不错的第三方库，可以检测实际编码： http ://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding

我没有对它进行广泛的测试，但它似乎有效。

score 5 · Accepted Answer

如果您使用 ICU4J ( http://icu-project.org/apiref/icu4j/ )

这是我的代码：

String charset = "ISO-8859-1"; //Default chartset, put whatever you want

byte[] fileContent = null;
FileInputStream fin = null;

//create FileInputStream object
fin = new FileInputStream(file.getPath());

/*
 * Create byte array large enough to hold the content of the file.
 * Use File.length to determine size of the file in bytes.
 */
fileContent = new byte[(int) file.length()];

/*
 * To read content of the file in byte array, use
 * int read(byte[] byteArray) method of java FileInputStream class.
 *
 */
fin.read(fileContent);

byte[] data =  fileContent;

CharsetDetector detector = new CharsetDetector();
detector.setText(data);

CharsetMatch cm = detector.detect();

if (cm != null) {
    int confidence = cm.getConfidence();
    System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
    //Here you have the encode name and the confidence
    //In my case if the confidence is > 50 I return the encode, else I return the default value
    if (confidence > 50) {
        charset = cm.getName();
    }
}

记得把所有的try-catch 都需要它。

我希望这对你有用。

score 4 · Accepted Answer

如果你不知道你的数据的编码，那不是那么容易确定的，但是你可以尝试使用一个库来猜测它。此外，还有一个类似的问题。

score 2 · Accepted Answer

对于 ISO8859_1 文件，没有简单的方法可以将它们与 ASCII 区分开来。然而，对于 Unicode 文件，通常可以根据文件的前几个字节检测到这一点。

UTF-8 和 UTF-16 文件在文件的开头包含一个字节顺序标记(BOM)。BOM 是一个零宽度的不间断空间。

不幸的是，由于历史原因，Java 不会自动检测到这一点。记事本等程序将检查 BOM 并使用适当的编码。使用 unix 或 Cygwin，您可以使用 file 命令检查 BOM。例如：

$ file sample2.sql 
sample2.sql: Unicode text, UTF-16, big-endian

对于 Java，我建议您查看此代码，它将检测常见的文件格式并选择正确的编码：如何读取文件并自动指定正确的编码

score 1 · Accepted Answer

TikaEncodingDetector 的替代方法是使用Tika AutoDetectReader。

Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();

score 0 · Accepted Answer

处理此问题的一个好策略是使用自动检测输入字符集的方法。

我在 Java 11 中使用 org.xml.sax.InputSource 来解决它：

...    
import org.xml.sax.InputSource;
...

InputSource inputSource = new InputSource(inputStream);
inputStreamReader = new InputStreamReader(
    inputSource.getByteStream(), inputSource.getEncoding()
  );

输入样本：

<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...**strong text**

score -1 · Accepted Answer

在纯 Java 中：

final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };

List<String> lines;

for (String encoding : encodings) {
    try {
        lines = Files.readAllLines(path, Charset.forName(encoding));
        for (String line : lines) {
            // do something...
        }
        break;
    } catch (IOException ioe) {
        System.out.println(encoding + " failed, trying next.");
    }
}

这种方法将一个一个地尝试编码，直到一个工作或者我们用完它们。（顺便说一句，我的编码列表只有这些项目，因为它们是每个 Java 平台上所需的字符集实现，https: //docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html ）

score -12 · Accepted Answer

你能在Constructor中选择合适的字符集吗：

new InputStreamReader(new FileInputStream(in), "ISO8859_1");

java - Java：如何确定流的正确字符集编码

16 回答 16

使用哪个库？

如何判断哪一个检测到了正确的字符集（或尽可能接近）？

如何对返回的响应进行评分？

有没有示例代码？

Related

Reference