python - 使用python识别垃圾Unicode字符串

Question

我的脚本是从 csv 文件中读取数据，csv 文件可以有多个英文或非英文单词字符串。

有时文本文件有垃圾字符串，我想识别这些字符串并跳过这些字符串并处理其他字符串

doc = codecs.open(input_text_file, "rb",'utf_8_sig')
fob = csv.DictReader(doc)
for row, entry in enumerate(f):
    if is_valid_unicode_str(row['Name']):
         process_futher

def is_valid_unicode_str(value):
     try:
         function
         return True
     except UnicodeEncodeError:
         return false

csv输入：

"Name"
"Ã¨Â¢â€¹Ã¨Â¢âdcx€¹Ã¤Â¸Å½Ã¦Å“â€¹Ã¥Ââ€¹Ã¤Â»Â¬Ã§â€ÂµÃ¥ÂÂÃ¥â€¢â€"
"元大寶來證券"
"John Dove"

我想破坏函数 is_valid_unicode_str() ，它将识别垃圾字符串并仅处理有效字符串。

我尝试使用 decode is 但它在解码垃圾字符串时没有失败

value.decode('utf8')

预期输出为待处理的中英文字符串

你能指导我如何实现过滤有效Unicode文件的功能吗？

score 8 · Accepted Answer

(ftfy developer here)

I've figured out that the text is likely to be '袋袋与朋友们电子商'. I had to guess at the characters 友, 子, and 商, because some unprintable characters are characters missing in the string in your question. When guessing, I picked the most common character from the small number of possibilities. And I don't know where the "dcx" goes or why it's there.

Google Translate is not very helpful here but it seems to mean something about e-commerce.

So here's everything that happened to your text:

It was encoded as UTF-8 and decoded incorrectly as sloppy-windows-1252, twice
It had the letters "dcx" inserted into the middle of a UTF-8 sequence
Characters that don't exist in windows-1252 -- with byte values 81, 8d, 8f, 90, and 9d -- were removed
A non-breaking space (byte value a0) was removed from the end

If just the first problem had happened, ftfy.fix_text_encoding would be able to fix it. It's possible that the remaining problems just happened while you were trying to get the string onto Stack Overflow.

So here's my recommendation:

Find out who keeps decoding the data incorrectly as sloppy-windows-1252, and get them to decode it as UTF-8 instead.
If you end up with a string like this again, try ftfy.fix_text_encoding on it.

score 3 · Accepted Answer

3

于 2015-03-16T09:51:46.997 回答

python - 使用python识别垃圾Unicode字符串

2 回答 2

Related

Reference