3

我的脚本是从 csv 文件中读取数据,csv 文件可以有多个英文或非英文单词字符串。

有时文本文件有垃圾字符串,我想识别这些字符串并跳过这些字符串并处理其他字符串

doc = codecs.open(input_text_file, "rb",'utf_8_sig')
fob = csv.DictReader(doc)
for row, entry in enumerate(f):
    if is_valid_unicode_str(row['Name']):
         process_futher

def is_valid_unicode_str(value):
     try:
         function
         return True
     except UnicodeEncodeError:
         return false

csv输入:

"Name"
"袋è¢âdcx€¹Ã¤Â¸Å½Ã¦Å“‹å‹们çâ€ÂµÃ¥Â­Âå•â€"
"元大寶來證券"
"John Dove"

我想破坏函数 is_valid_unicode_str() ,它将识别垃圾字符串并仅处理有效字符串。

我尝试使用 decode is 但它在解码垃圾字符串时没有失败

value.decode('utf8')

预期输出为待处理的中英文字符串

你能指导我如何实现过滤有效Unicode文件的功能吗?

4

2 回答 2

8

(ftfy developer here)

I've figured out that the text is likely to be '袋袋与朋友们电子商'. I had to guess at the characters 友, 子, and 商, because some unprintable characters are characters missing in the string in your question. When guessing, I picked the most common character from the small number of possibilities. And I don't know where the "dcx" goes or why it's there.

Google Translate is not very helpful here but it seems to mean something about e-commerce.

So here's everything that happened to your text:

  1. It was encoded as UTF-8 and decoded incorrectly as sloppy-windows-1252, twice
  2. It had the letters "dcx" inserted into the middle of a UTF-8 sequence
  3. Characters that don't exist in windows-1252 -- with byte values 81, 8d, 8f, 90, and 9d -- were removed
  4. A non-breaking space (byte value a0) was removed from the end

If just the first problem had happened, ftfy.fix_text_encoding would be able to fix it. It's possible that the remaining problems just happened while you were trying to get the string onto Stack Overflow.

So here's my recommendation:

  • Find out who keeps decoding the data incorrectly as sloppy-windows-1252, and get them to decode it as UTF-8 instead.
  • If you end up with a string like this again, try ftfy.fix_text_encoding on it.
于 2015-03-16T20:53:35.987 回答
3
于 2015-03-16T09:51:46.997 回答