java - 将带有德语字符的 ANSI 文件转换为 UTF8

Question

我从德国网站下载了一些纯文本文件，但我不确定编码是什么。文件中没有字节标记。我正在使用一个假定文件以 UTF8 编码的解析器，因此它没有正确处理某些重音字符（那些落在字节范围 > 127 中的字符）

我想将其转换为 UTF8，但我不确定是否需要知道编码才能正确执行此操作。

其他人处理这些文件的方式是在 Windows 记事本中手动打开，然后以 UTF8 格式重新保存。这个过程保留了重音字符，所以如果可能的话，我想在不求助于 Windows 记事本的情况下自动进行这种转换。

Windows 记事本如何知道如何将其正确转换为 UTF8？
我应该如何将文件转换为 UTF8（在 Java 6 中）？

score 2 · Accepted Answer

在 Java 7 中获取带有“Windows-1252”的文本，这是 Windows Latin-1。

Path oldPath = Paths.get("C:/Temp/old.txt");
Path newPath = Paths.get("C:/Temp/new.txt");
byte[] bytes = Files.readAllBytes(oldPath);
String content = "\uFEFF" + new String(bytes, "Windows-1252");
bytes = content.getBytes("UTF-8");
Files.write(newPath, bytes, StandardOption.WRITE);

这需要字节，将它们解释为 Windows Latin-1。而对于记事本来说，诀窍是：记事本通过前面的 BOM 标记字符识别编码。一个零宽度空间，通常不用于 UTF-8。

然后它从字符串中获取 UTF-8 编码。

Windows-1252 是 ISO-8859-1（纯拉丁语 1），但有一些特殊字符，如逗号引号，范围为 0x80 - 0xBF。

在 Java 6 中：

File oldPath = new File("C:/Temp/old.txt");
File newPath = new File("C:/Temp/new.txt");
long longLength = oldPath.length();
if (longLength > Integer.MAX_VALUE) {
    throw new IllegalArgumentException("File too large: " + oldPath.getPath());
}
int fileSize = (int)longLength;
byte[] bytes = new byte[fileSize];
InputStream in = new FileInputStream(oldPath);
int nread = in.read(bytes);
in.close();
assert nread == fileSize;

String content = "\uFEFF" + new String(bytes, "Windows-1252");
bytes = content.getBytes("UTF-8");

OutputStream out = new FileOutputStream(newPath);
out.write(bytes);
out.close();

java - 将带有德语字符的 ANSI 文件转换为 UTF8

1 回答 1

Related

Reference