4

我有一些包含一些非 ASCII 字符的文本文件,我想删除它们,但保留格式字符。

我试过了

$description = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $description);

然而,这似乎去除了换行符和其他格式,并且一些希伯来语也有问题,然后将其转换为

משפטים נוספים מהמומחה。נסוותהנו!חג חנוכה שמח **************************************** חדש - האפליקציה היחידה שאומרת לך מה מצב הסוללה שלך ** 1.1 版新功能 - 专家会谈!!!*

对此

1.4 :", ..."" ..."" 50 ..." 。, . !****************************************** - ** 1.1 版新增功能 - 专家会谈!!!*

4

2 回答 2

3

That's not replacing non-ASCII characters... Ascii characters are inside of the range 0-127. So basically what you're trying to do is write a rexeg to convert one character set to another (not just replace out some of the characters, which is a lot harder)...

As for what you want to do, I think you want the iconv function... You'll need to know the input encoding, but once you do you can then tell it to ignore non-representable characters:

$text = iconv('UTF-8', 'ASCII//IGNORE', $text);

You could also use ISO-8859-1, or any other target character set you want.

于 2010-08-23T16:54:05.403 回答
1

您所做的将不起作用,因为您将 UTF-8 字符串视为单字节编码。您实际上是在删除部分字符。如果您必须将u标志添加到正则表达式以激活 UTF-8 模式。

由于您只想保留控制字符和其他 ASCII 范围字符,因此您必须将所有其他字符替换为 ''。所以:

$description = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $description);

为您提供输入:

. !********************* - * 1.1 版的新功能 - 专家会谈!!!*
于 2010-08-23T17:10:38.417 回答