java - 重构自动检测文件的编码

Question

我需要检查编码文件。此代码有效，但有点长。如何能够对这个逻辑进行任何重构。也许可以为此目标使用其他变体？

代码：

class CharsetDetector implements Checker {

    Charset detectCharset(File currentFile, String[] charsets) {
        Charset charset = null;

        for (String charsetName : charsets) {
            charset = detectCharset(currentFile, Charset.forName(charsetName));
            if (charset != null) {
                break;
            }
        }

        return charset;
    }

    private Charset detectCharset(File currentFile, Charset charset) {
        try {
            BufferedInputStream input = new BufferedInputStream(
                    new FileInputStream(currentFile));

            CharsetDecoder decoder = charset.newDecoder();
            decoder.reset();

            byte[] buffer = new byte[512];
            boolean identified = false;
            while ((input.read(buffer) != -1) && (!identified)) {
                identified = identify(buffer, decoder);
            }

            input.close();

            if (identified) {
                return charset;
            } else {
                return null;
            }

        } catch (Exception e) {
            return null;
        }
    }

    private boolean identify(byte[] bytes, CharsetDecoder decoder) {
        try {
            decoder.decode(ByteBuffer.wrap(bytes));
        } catch (CharacterCodingException e) {
            return false;
        }
        return true;
    }

    @Override
    public boolean check(File fileChack) {
        if (charsetDetector(fileChack)) {
            return true;
        }
        return false;
    }

    private boolean charsetDetector(File currentFile) {
        String[] charsetsToBeTested = { "UTF-8", "windows-1253", "ISO-8859-7" };

        CharsetDetector charsetDetector = new CharsetDetector();
        Charset charset = charsetDetector.detectCharset(currentFile,
                charsetsToBeTested);

        if (charset != null) {
            try {
                InputStreamReader reader = new InputStreamReader(
                        new FileInputStream(currentFile), charset);

                @SuppressWarnings("unused")
                int valueReaders = 0;
                while ((valueReaders = reader.read()) != -1) {
                    return true;
                }

                reader.close();
            } catch (FileNotFoundException exc) {
                System.out.println("File not found!");
                exc.printStackTrace();
            } catch (IOException exc) {
                exc.printStackTrace();
            }
        } else {
            System.out.println("Unrecognized charset.");
            return false;
        }

        return true;
    }
}

问题：

这个程序逻辑是如何重构的？
还有哪些检测编码的方法（如UTF-16序列等）？

score 5 · Accepted Answer

重构此代码的最佳方法是引入为您进行字符检测的 3rd 方库，因为他们可能会做得更好，并且会使您的代码更小。看到这个问题有几个选择

score 3 · Accepted Answer

正如已经指出的那样，您无法“知道”或“检测”文件的编码。完全准确需要您被告知，因为几乎总是有一个字节序列对于几个字符编码是模棱两可的。

您会在这个SO question中找到更多关于检测 UTF-8 与 ISO8859-1 的讨论。. 基本答案是检查文件中的每个字节序列，以验证其与预期编码的兼容性。有关 UTF-8 字节编码规则，请参阅http://en.wikipedia.org/wiki/UTF-8。

特别是，有一篇关于检测字符编码/集的非常有趣的论文 http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html 他们声称它们具有极高的准确性（猜测！）。价格是一个非常复杂的检测系统，包含有关不同语言中字符频率的知识，不适合 OP 暗示的 30 行代码大小。显然检测算法是内置在 Mozilla 中的，所以你可以找到并提取它。

我们选择了一个更简单的方案：a）相信你被告知的字符集是，如果你被告知 b）如果不是，检查 BOM 并相信它所说的如果存在，否则嗅探纯 7 位 ascii，然后是 utf8 ，或按此顺序排列的 iso8859。您可以构建一个丑陋的例程，一次通过文件执行此操作。

（我认为随着时间的推移，问题会变得更糟。Unicode 每年都会有一个新版本，在有效代码点上存在真正细微的差异。要做到这一点，你需要检查每个代码点的有效性。如果我们幸运的话，它们都是向后兼容的。）

[编辑：OP 似乎在用 Java 编码时遇到了麻烦。我们的解决方案和另一页上的草图不是用 Java 编码的，所以我不能直接复制和粘贴答案。我将根据他的代码在这里起草一个Java版本；它没有被编译或测试。YMMV]

int UTF8size(byte[] buffer, int buf_index)
// Java-version of character-sniffing test on other page
// This only checks for UTF8 compatible bit-pattern layout
// A tighter test (what we actually did) would check for valid UTF-8 code points
{   int first_character=buffer[buf_index];

    // This first character test might be faster as a switch statement
    if ((first_character & 0x80) == 0) return 1; // ASCII subset character, fast path
    else ((first_character & 0xF8) == 0xF0) { // start of 4-byte sequence
        if (buf_index+3>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80)
         && ((buffer[buf_index + 3] & 0xC0) == 0x80))
            return 4;
    }
    else if ((first_character & 0xF0) == 0xE0) { // start of 3-byte sequence
        if (buf_index+2>=buffer.length) return 0;
        if (((buffer[buf_index + 1] & 0xC0) == 0x80)
         && ((buffer[buf_index + 2] & 0xC0) == 0x80))
            return 3;
    }
    else if ((first_character & 0xE0) == 0xC0) { // start of 2-byte sequence
        if (buf_index+1>=buffer.length) return 0;
        if ((buffer[buf_index + 1] & 0xC0) == 0x80)
            return 2;
    }
    return 0;
}

public static boolean isUTF8 ( File file ) {
    int file_size;
    if (null == file) {
        throw new IllegalArgumentException ("input file can't be null");
    }
    if (file.isDirectory ()) {
        throw new IllegalArgumentException ("input file refers to a directory");
    }

    file_size=file.size();
    // read input file
    byte [] buffer = new byte[file_size];
    try {
        FileInputStream fis = new FileInputStream ( input ) ;
        fis.read ( buffer ) ;
        fis.close ();
    }
    catch ( IOException e ) {
        throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () );
    }

    { int buf_index=0;
      int step;

      while (buf_index<file_size) {
         step=UTF8size(buffer,buf_index);
         if (step==0) return false; // definitely not UTF-8 file
         buf_index+=step;

      }

    }

   return true ; // appears to be UTF-8 file
}

java - 重构自动检测文件的编码

2 回答 2

Related

Reference