java - OutOfMemoryError - 从检测 UTF-8 编码

Question

这个类应该检查currentFile和检测编码。如果结果是 UTF-8 return true。

运行后的输出是 - java.lang.OutOfMemoryError: Java heap space。

对于读取数据，您需要为此使用JDK 7Files.readAllBytes(path)

代码：

class EncodingsCheck implements Checker {

    @Override
    public boolean check(File currentFile) {
        return isUTF8(currentFile);
    }

    public static boolean isUTF8(File file) {
        // validate input
        if (null == file) {
            throw new IllegalArgumentException("input file can't be null");
        }
        if (file.isDirectory()) {
            throw new IllegalArgumentException(
                    "input file refers to a directory");
        }

        // read input file
        byte[] buffer;
        try {
            buffer = readUTFHeaderBytes(file);
        } catch (IOException e) {
            throw new IllegalArgumentException(
                    "Can't read input file, error = " + e.getLocalizedMessage());
        }

        if (0 == (buffer[0] & 0x80)) {
            return true; // ASCII subset character, fast path
        } else if (0xF0 == (buffer[0] & 0xF8)) { // start of 4-byte sequence
            if (buffer[3] >= buffer.length) {
                return false;
            }
            if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))
                    && (0x80 == (buffer[3] & 0xC0)))
                return true;
        } else if (0xE0 == (buffer[0] & 0xF0)) { // start of 3-byte sequence
            if (buffer[2] >= buffer.length) {
                return false;
            }
            if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))) {
                return true;
            }
        } else if (0xC0 == (buffer[0] & 0xE0)) { // start of 2-byte sequence
            if (buffer[1] >= buffer.length) {
                return false;
            }
            if (0x80 == (buffer[1] & 0xC0)) {
                return true;
            }
        }

        return false;
    }

    private static byte[] readUTFHeaderBytes(File input) throws IOException {
        // read data
        Path path = Paths.get(input.getAbsolutePath());
        byte[] data = Files.readAllBytes(path);
        return data;
    }
}

问题：

如何解决这个问题？
如何以这种方式检查 UTF-16（我们需要担心这个或这只是无用的麻烦）？

score 2 · Accepted Answer

您不需要阅读整个文件。

private static byte[] readUTFHeaderBytes(File input) throws IOException {
    FileInputStream fileInputStream = new FileInputStream(input);
    try{
        byte firstBytes[] = new byte[4];
        int count = fileInputStream.read(firstBytes);
        if(count < 4){
            throw new IOException("Empty file");
        }
        return firstBytes;
    } finally {
        fileInputStream.close();
    }
}

为了检测其他 UTF 编码，使用给定的模式：

字节编码形式
00 00 FE FF UTF-32，大端
FF FE 00 00 UTF-32，小端
FE FF UTF-16，大端
FF FE UTF-16，小端
EF BB BF UTF-8

java - OutOfMemoryError - 从检测 UTF-8 编码

1 回答 1

Related

Reference