python - python convert unknown character to ascii

Question

In a text file I'm processing, I have characters like ��. Not sure what they are.

I'm wondering how to remove/convert these characters.

I have tried to convert it into ascii by using .encode(‘ascii’,'ignore’). python told me char is not whithin 0,128

I have also tried unicodedata, unicodedata.normalize('NFKD', text).encode('ascii','ignore'), with the same error

Anyone help?

Thanks!

score 7 · Accepted Answer

您始终可以使用 Unicode 字符串并使用您显示的代码：

my_ascii = my_uni_string.encode('ascii', 'ignore')

如果这给了您一个错误，那么您实际上并没有一个 Unicode 字符串开始。如果这是真的，那么你有一个字节字符串。您需要知道它使用的是什么编码，并且可以将其转换为 Unicode 字符串：

my_uni_string = my_byte_string.decode('utf8')

（假设您的编码是 UTF-8）。

字节字符串和 Unicode 字符串之间的这种拆分可能会令人困惑。我的演示文稿Pragmatic Unicode 或 How Do I Stop The Pain可以帮助您保持直截了当。

score 1 · Accepted Answer

它并不完美（特别是对于较短的字符串），但 chardet 库将在这里使用：

要让 chardet 找出编码，然后编码为 unicode，您可以：

import chardet
encoding = chardet.detect(some_string)['encoding']
unicode_string = unicode(some_string, encoding)

当然，如果它们超出 ascii 范围，您将无法将它们编码为 ascii。

2 回答 2