java - 如果我逐字节读取文件的内容不应该保持不变吗？

Question

为什么以下代码会更改“öäüß”？（我正在使用它将大文件分成多个小文件......）

InputStream is = new BufferedInputStream(new FileInputStream(file));
File newFile;
BufferedWriter bw;
newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
files.add(newFile);
bw = new BufferedWriter(new FileWriter(newFile));
try {
    byte[] c = new byte[1024];
    int lineCount = 0;
    int readChars = 0;
    while ( ( readChars = is.read(c) ) != -1 )
        for ( int i=0; i<readChars; i++ ) {
            bw.write(c[i]);
            if ( c[i] == '\n' )
                if ( ++lineCount % linesPerFile == 0 ) {
                    bw.close();
                    newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
                    files.add(newFile);
                    bw = new BufferedWriter(new FileWriter(newFile));
                }
        }
} finally {
    bw.close();
    is.close();
}

我对字符编码的理解是，只要我保持每个字节相同，一切都应该保持不变。为什么这段代码会改变字节？

先多谢了~

==================== 解决方案=====================

错误在于FileWriter解释字节并且不应该仅用于输出纯字节，感谢@meriton 和@Jonathan Rosen。只是将所有内容更改为BufferedOutputStream都不会这样做，因为BufferedOutputStream太慢了！我最终改进了我的文件拆分和复制代码，以包含更大的读取数组大小，并且仅write()在必要时...

File newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
files.add(newFile);
InputStream iS = new BufferedInputStream(new FileInputStream(file));
OutputStream oS = new FileOutputStream(newFile); // BufferedOutputStream wrapper toooo slow!
try {
    byte[] c;
    if ( linesPerFile > 65536 )
        c = new byte[65536];
    else
        c = new byte[1024];
    int lineCount = 0;
    int readChars = 0;
    while ( ( readChars = iS.read(c) ) != -1 ) {
        int from = 0;
        for ( int idx=0; idx<readChars; idx++ )
            if ( c[idx] == '\n' && ++lineCount % linesPerFile == 0 ) {
                oS.write(c, from, idx+1 - from);
                oS.close();
                from = idx+1;
                newFile = new File(filePathBase + "." + String.valueOf(files.size() + 1) + fileExtension);
                files.add(newFile);
                oS = new FileOutputStream(newFile);
            }
        oS.write(c, from, readChars - from);
    }
} finally {
    iS.close();
    oS.close();
}

score 4 · Accepted Answer

InputStream 读取字节，OutputStream 写入它们。Reader 读取字符，Writer 写入它们。

你用 InputStream 读，用 FileWriter 写。也就是说，您读取字节，但写入字符。具体来说，

bw.write(c[i]);

调用方法

public void write(int c) throws IOException

其Javadoc说：

写入单个字符。要写入的字符包含在给定整数值的低 16 位中；16 个高位被忽略。

也就是说，字节被隐式转换为 int，然后重新解释为 unicode 代码点，然后使用平台默认编码写入文件（因为您没有指定 FileWriter 应该使用的编码）。

score 1 · Accepted Answer

您正在读取字节并写入字符。线 bw.write(c[i]); 假设每个字节都是一个字符，但在输入文件中不一定如此，它取决于使用的编码。UTF-8 等编码每个字符可能使用 2 个或更多字节，并且您正在单独转换每个字节。例如，在 UTF-8 中，ö 编码为 2 个字节，十六进制 c3 b6。当您单独处理它们时，您可能会看到第一个字符为 Ã。

score 0 · Accepted Answer

尝试调试您的while条件( readChars = is.read(c) ) != -1，因此它进入无限循环并且bw.close();永远不会被调用并且文件仍处于读取模式，如果同时您尝试执行某些操作文件会损坏并且您会得到不希望的结果。

java - 如果我逐字节读取文件的内容不应该保持不变吗？

3 回答 3

Related

Reference