php - 如何删除虚假的非 ascii 字符，但保留空格和换行符？

Question

我有一些包含一些非 ASCII 字符的文本文件，我想删除它们，但保留格式字符。

我试过了

$description = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $description);

然而，这似乎去除了换行符和其他格式，并且一些希伯来语也有问题，然后将其转换为

משפטים נוספים מהמומחה。נסוותהנו！חג חנוכה שמח **************************************** חדש - האפליקציה היחידה שאומרת לך מה מצב הסוללה שלך ** 1.1 版新功能 - 专家会谈！！！*

对此

1.4 :", ..."" ..."" 50 ..." 。, . ！****************************************** - ** 1.1 版新增功能 - 专家会谈！！！*

score 3 · Accepted Answer

That's not replacing non-ASCII characters... Ascii characters are inside of the range 0-127. So basically what you're trying to do is write a rexeg to convert one character set to another (not just replace out some of the characters, which is a lot harder)...

As for what you want to do, I think you want the iconv function... You'll need to know the input encoding, but once you do you can then tell it to ignore non-representable characters:

$text = iconv('UTF-8', 'ASCII//IGNORE', $text);

You could also use ISO-8859-1, or any other target character set you want.

score 1 · Accepted Answer

您所做的将不起作用，因为您将 UTF-8 字符串视为单字节编码。您实际上是在删除部分字符。如果您必须将u标志添加到正则表达式以激活 UTF-8 模式。

由于您只想保留控制字符和其他 ASCII 范围字符，因此您必须将所有其他字符替换为 ''。所以：

$description = preg_replace('/[^\x{0000}-\x{007F}]/u', '', $description);

为您提供输入：

. ！********************* - * 1.1 版的新功能 - 专家会谈！！！*

php - 如何删除虚假的非 ascii 字符，但保留空格和换行符？

2 回答 2

Related

Reference