php - 为什么两个相同编码的单词在 htmlentities 中看起来不同？

Question

score 3 · Accepted Answer

Your first question was: How can it be that two identical words with the same encoding (UTF-8) are nevertheless different?

In this case, the encoding isn't really UTF-8 in both cases. The first variable is in "real" UTF-8, while in the second, greek characters are not really in UTF-8, but in ASCII, with non-ASCII characters (greek) encoded using something called a CER (Character Entity Reference).

A web browser and some too friendly "WYSIWYG" editors will render these strings as identical, but the binary representations of the actual strings (which is what the computer will compare) are different. This is why the equal test fails, even if the strings appear to be the same upon human visual ispection in a browser or editor.

I don't think you can rely on mb_detect_encoding to detect encoding in such cases, since there is no way of telling utf-8 apart from ASCII using CER to represent non-ASCII.

Your second question was: How could I fix this problem?

Before you can compare strings that may be encoded differently, you need to convert them to canonical form ( Wikipedia: Canonicalization ) so that their binary representation is identical.

Here is how I've solved it: I've implemented a handy function named utf8_normalize that converts just about any common character representation (in my case: CER, NER, iso-8859-1 and CP-1252) to canonical utf-8 before comparing strings. What you throw in there must to some extent be determined by what are "popular" character representations in the type of environment your software will operate, but if you just make sure that your strings are on canonical form before comparing, it will work.

As noted in the comment below from the OP (phpheini), there also exists the PHP Normalizer class, which may do a better job of normalization that a home-grown function.

php - 为什么两个相同编码的单词在 htmlentities 中看起来不同？

1 回答 1

Related

Reference