java - 如何在不使用 BOM 的情况下识别不同的编码？

Question

我有一个文件观察器，它从一个使用 utf-16LE 编码的不断增长的文件中获取内容。写入它的第一个数据位具有可用的 BOM——我使用它来识别针对 UTF-8 的编码（我的大部分文件都是用其中编码的）。我捕获了 BOM 并重新编码为 UTF-8，这样我的解析器就不会崩溃。问题在于，由于它是一个不断增长的文件，因此并非所有数据都包含 BOM。

这是我的问题 - 如果不将 BOM 字节添加到我拥有的每组数据（因为我无法控制源），我可以只查找 UTF-16 \000 中固有的空字节，然后使用那作为我的标识符而不是BOM？这会让我头疼吗？

我的架构涉及一个 ruby Web 应用程序，当我用 java 编写的解析器拾取它时，将接收到的数据记录到一个临时文件中。

现在写我的识别/重新编码代码如下所示：

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);

    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      String asString = new String(contents, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

更新

我想支持欧元、破折号和其他字符之类的东西。我修改了上面的代码看起来像这样，它似乎通过了我对这些字符的所有测试：

  // guess encoding if utf-16 then
  // convert to UTF-8 first
  try {
    FileInputStream fis = new FileInputStream(args[args.length-1]);
    byte[] contents = new byte[fis.available()];
    fis.read(contents, 0, contents.length);
    byte[] real = null;

    int found = 0;

    // if found a BOM then skip out of here... we just need to convert it
    if ( (contents[0] == (byte)0xFF) && (contents[1] == (byte)0xFE) ) {
      found = 3;
      real = contents;

    // no BOM detected but still could be UTF-16
    } else {

      for(int cnt=0; cnt<10; cnt++) {
        if(contents[cnt] == (byte)0x00) { found++; };

        real = new byte[contents.length+2];
        real[0] = (byte)0xFF;
        real[1] = (byte)0xFE;

        // tack on BOM and copy over new array
        for(int ib=2; ib < real.length; ib++) {
          real[ib] = contents[ib-2];
        }
      }

    }

    if(found >= 2) {
      String asString = new String(real, "UTF-16");
      byte[] newBytes = asString.getBytes("UTF8");
      FileOutputStream fos = new FileOutputStream(args[args.length-1]);
      fos.write(newBytes);
      fos.close();
    }

    fis.close();
    } catch(Exception e) {
      e.printStackTrace();
  }

大家怎么看？

score 6 · Accepted Answer

通常，您无法 100% 准确识别数据流的字符编码。您可以做的最好的事情是尝试使用一组有限的预期编码进行解码，然后对解码结果应用一些启发式方法，以查看它是否“看起来像”预期语言的文本。（但任何启发式方法都会对某些数据流产生误报和误报。）或者，让人类参与循环来决定哪种解码最有意义。

更好的解决方案是重新设计您的协议，以便提供数据的任何东西都必须提供用于数据的编码方案。（如果你不能，责怪负责设计/实现无法给你编码方案的系统的人！）。

编辑：根据您对问题的评论，数据文件是通过 HTTP 传递的。在这种情况下，您应该安排您的 HTTP 服务器对传递数据的 POST 请求的“内容类型”标头进行扫描，从标头中提取字符集/编码，并将其保存在文件解析器可以的方式/位置处理。

score 0 · Accepted Answer

毫无疑问，这会让你头疼。对于简单的情况，您可以检查交替的零字节（仅限 ASCII，UTF-16，任一字节顺序），但是当您开始在 0x7f 代码点上方获取字符流时，该方法就变得无用了。

如果你有文件句柄，最好的办法是保存当前文件指针，寻找开始，读取 BOM，然后寻找到原始位置。

要么以某种方式记住 BOM。

除非您绝对确定所有输入的字符范围都将受到限制，否则依赖数据内容是一个坏主意。

score 0 · Accepted Answer

This question contains a few options for character detection which don't appear to require a BOM.

My project is currently using jCharDet but I might need to look at some of the other options listed there as jCharDet is not 100% reliable.

java - 如何在不使用 BOM 的情况下识别不同的编码？

3 回答 3

Related

Reference