2
4

1 回答 1

0

Without knowing the file's encoding in advance, the best you can do is guess what that encoding is.

I suggest you read this great article: http://www.joelonsoftware.com/articles/Unicode.html

It's fun to read and I personally found some valuable information/clarifications in there.

But the main takeaway from the article is this:

It does not make sense to have a string without knowing what encoding it uses.


Leaving theory aside, I know in practice it is sometimes impossible to ask a user what is the encoding of the file they just submitted/uploaded.

So, again, the best you can do is guess.

I have dealt with this problem a few times in my career and every time I managed to find a good enough encoding guessing algorithm, depending on the nature of the system being developed.

The best thing you can do is grab as many sample files as you can, manually analyze their encodings and see if you can find patterns, such as:

  • all users submit files encoded in UTF-8, except users A and B who use ISO-8859-1
  • if a file contains a certain byte sequence, it's very likely that it's encoding A, otherwise use default encoding B
于 2013-01-29T08:22:26.703 回答