0
4

1 回答 1

1

What you have here are the different Unicode normalization forms. There are combined characters, where a base character is combined with a diacritic or other character to form an alternate version, but sometimes this alternative version may also exist as a standalone character. E.g.:

ਸ਼ GURMUKHI LETTER SHA (U+0A36)
ਸ GURMUKHI LETTER SA  (U+0A38)
 ਼ GURMUKHI SIGN NUKTA (U+0A3C)
ਸ +  ਼ (U+0A38 + U+0A3C) equivalent to ਸ਼ U+0A36

(I'm not actually sure if the GURMUKHI SIGN NUKTA is the correct combining dot here, since I don't know Gurmukhi, but you get the idea.)

For storage and comparison, you should decide on one form or the other, since it's often impossible to predict which format the input will be in. You do this using the Unicode Normalization process, which converts between both forms. In PHP you do this with the Normalizer class.

i need to search with md5 because when i do it in a normalized form, it considers the letter with and without the dot same..

You second problem is that you're inventing an overcomplicated solution to a simple problem: collations. The database uses collation rules for "fuzzy" matching, i.e. to treat "matinee" and "matineé" the same, or in your case "ਸ਼" and "ਸ". You set the default collation on the column, but you can influence it during query time as well:

SELECT ... WHERE foo = 'bar' COLLATE utf8_bin;

If you want absolute matches, use the utf8_bin collation or another equivalent _bin (binary) collation for your chosen encoding.

于 2013-07-21T09:28:46.787 回答