2

我有一个名字“ Göran ”,我希望将它转换为“ Goran ”,这意味着我需要取消特定单词的重音。但是我尝试过的似乎并没有使所有单词都重音。

这是我用来 Unaccent 的代码:

private function Unaccent($string)
{
    return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}

不工作的地方(不正确的匹配):我的意思是它没有在右手边给出预期的结果,

JÃŒrgen => Juergen
InÚs => Ines

它工作的地方(正确匹配):

Göran => Goran
Jørgen Ole => Jorgen
Jérôme => Jerome

可能是什么原因?怎么修?您有更好的方法来处理所有案件吗?

4

2 回答 2

4

这可能是您正在寻找的

如何将特殊字符转换为普通字符?

但改用“utf-8”。

$text = iconv('utf-8', 'ascii//TRANSLIT', $text);

http://us2.php.net/manual/en/function.iconv.php

于 2012-10-11T06:21:09.147 回答
2

简短的回答

你有两个问题:

首先。这些名称没有重音。它们格式错误。

您似乎有一个 UTF-8 文件,但正在使用 ISO-8559-1 处理它们。例如,如果您告诉您的编辑器使用 ISO-8859-1 并使用 UTF-8 将文本复制粘贴到浏览器的文本区域中。然后,您将格式错误的名称保存在数据库中。我见过很多这样的问题是由复制粘贴引起的。

如果名称格式正确,则可以解决第二个问题。不重音。已经有一个问题处理这个问题:如何将特殊字符转换为普通字符?

长答案(仅关注格式错误的重音字母)

为什么你有Göran你想要的时候Göran

Let's begin with Unicode: The letter ö is in Unicode LATIN SMALL LETTER O WITH DIAERESIS. Its Unicode code point is F6 hexadecimal or, respectively, 246 decimal. See this link to the Unicode database.

In ISO-8859-1 code points from 0 to 255 are left as is. The small letter o with diaeresis is saved as only one byte: 246.

UTF-8 and ISO-8859-1 treat the code points 0 to 127 (aka ASCII) the same. They are left as is and saved as only one byte. They differ in the treatment of the code points 128 to 255. UTF-8 can encode the whole Unicode code point set, while ISO-8859-1 can only cope with the first 256 code points.

So, what does UTF-8 do with code points above 128? There is a staggered set of encoding possibilities for code points as they get bigger and bigger. For code points up to 2047 two bytes suffice. They are encoded like this: (see this bit schema)

x xxxx xxxx xxxx => 110xxxxx 10xxxxxx

Let's encode small letter o with diaresis in UTF-8. The bits are: 0 0000 1111 0110 and gets encoded to 11000011 10110110. This is nice.

However, these two bytes can be misunderstood as two valid (!) ISO-8559-1 bytes. What are 11000011 (C3 hex) and 10110110 (B6 hex)? Let's consult an ISO-8859-1 table. C3 is Capital A tilde, and B6 is Paragraph sign. Both signs are valid and no software can detect this misunderstanding by just looking at the bits.

It definitively needs people who know what names look like. Göran is just not a name. There is an uppercase letter smack in the middle of the name and the paragraph sign is not a letter at all. Sadly, this misunderstanding does not stop here. Because all characters are valid, they can be copy-pasted and re-rendered. In this process the misunderstanding can be repeated again. Let's do this with Göran. We already misunderstood it once and got a badly formatted Göran. The letter Capital A, tilde and the paragraph sign render to two bytes in UTF-8 each (!) and are interpreted as four bytes of gobbledygook, something like GÃÅ.ran.

Poor Jürgen! The umlaut ü got mistreated twice and we have JÃŒrgen.

We have a terrible mess with the umlauts here. It's even possible that the OP got this data as is from his customer. This happened to me once: I got mixed data: well formatted, badly formatted once, twice and thrice in the same file. It's extremely frustrating.

于 2012-10-11T06:22:20.273 回答