java - 从具有代码页 1252 的 FileItem 中读取包括欧元符号在内的内容

Question

我的问题的设置如下：

在包括 Web 服务通信的客户端/服务器架构中，我在服务器端从客户端获取 CSV 文件。API 给了我一个 org.apache.commons.fileupload.FileItem

这些文件允许的代码页是代码页 850 和代码页 1252。

一切正常，唯一的问题是欧元符号 (€)。在代码页 1252 的情况下，我的代码无法正确处理欧元符号。而不是它，我看到带有 unicode U+00A4 的标志： ¤ 当我在 Eclipse 中将它打印到控制台时。

目前我使用以下代码。它分布在一些类中。我已经提取了相关的行。

byte[] inputData = call.getImportDatei().get();

// the following method works correctly
// it returns Charset.forName("CP850") or Charset.forName("CP1252")
final Charset charset = retrieveCharset(inputData);

char[] stringContents;
final StringBuffer sb = new StringBuffer();

final String s = new String(inputData, charset.name());

// here I see the problem with the euro sign already
// the following code shouldn't be the problem

// here some special characters are converted, but this doesn't affect the problem, so I removed those lines
stringContents = s.toCharArray();
for(final char c : stringContents){
  sb.append(c);
}
final Reader stringReader = new StringReader(sb.toString());


// org.supercsv.io.CsvListReader
CsvListReader reader = new CsvListReader(stringReader, CsvPreference.EXCEL_NORTH_EUROPE_PREFERENCE);
// now this reader is used to read the CSV content...

我尝试了不同的东西：

FileItem.getInputStream()

我使用 FileItem.getInputStream() 来获取 byte[] 但结果是一样的。

FileItem.getString()

当我使用 FileItem.getString() 时，它与代码页 1252 完美配合：欧元符号被正确读取。当我将它打印到 Eclipse 中的控制台时，我看到了它。但是对于代码页 850，许多特殊字符都是错误的。

FileItem.getString（字符串编码）

所以我的想法是使用 FileItem.getString(String encoding)。但是我试图告诉他使用代码页 1252 的所有字符串都没有产生异常，而是产生了错误的结果。

例如 getString(Charset.forName("CP1252").name()) 会导致问号而不是欧元符号。

使用 org.apache.commons.fileupload.FileItem 时如何指定编码？

或者这是错误的方式？

提前感谢您的帮助！

score 1 · Accepted Answer

I see it when I print it to the console in Eclipse. But with code page 850 may special characters are wrong.

You're being misled by focusing too much to the results presented by the Eclipse console. The underlying data is correct, but Eclipse presented it wrongly. On Windows, it's by default configured to use cp1252 to present the characters printed by System.out.println(). This way the characters which were originally decoded with a different charset would obviously not be presented correctly.

You'd better reconfigure the Eclipse console to use UTF-8 to present those characters. UTF-8 covers every single character the world is aware of. You can do that by setting the Window > Preferences > General > Workspace > Text File Encoding proprety to UTF-8.

Then, given that you're apparently using FileItem from Apache Commons FileUpload, you could obtain the FileItem content as properly encoded Reader in a much simpler way as follows:

byte[] content = fileItem.get();
Charset charset = retrieveCharset(content); // No idea what you're doing there, but kudos that it's returning the right charset.
Reader reader = new InputStreamReader(new ByteArrayInputStream(content), charset);
// ...

Note that, when you intend to write this CSV afterwards to a character based output stream other than System.out.println(), such as FileWriter, then don't forget to explicitly specify set the charset to UTF-8 as well! You could do that in OutputStreamWriter. Otherwise, the platform default encoding will still be used, which is cp1252 in Windows.

java - 从具有代码页 1252 的 FileItem 中读取包括欧元符号在内的内容

FileItem.getInputStream()

FileItem.getString()

FileItem.getString（字符串编码）

1 回答 1

See also:

java - 从具有代码页 1252 的 FileItem 中读取包括欧元符号在内的内容

FileItem.getInputStream()

FileItem.getString()

FileItem.getString（字符串编码）

1 回答 1

See also:

Related

Reference