我正在尝试使用 ftfy Python 包修复 csv 文件中的 unicode 错误,但它在包含 \xa0 的行中失败
我不明白为什么会发生这种情况以及如何正确修复它!
这是导致问题的示例:
>>> txt = 'Linköpings Universitet, LiU'
>>> ftfy.explain_unicode(txt)
U+004C L [Lu] LATIN CAPITAL LETTER L
U+0069 i [Ll] LATIN SMALL LETTER I
U+006E n [Ll] LATIN SMALL LETTER N
U+006B k [Ll] LATIN SMALL LETTER K
U+00C3 Ã [Lu] LATIN CAPITAL LETTER A WITH TILDE
U+00B6 ¶ [Po] PILCROW SIGN
U+0070 p [Ll] LATIN SMALL LETTER P
U+0069 i [Ll] LATIN SMALL LETTER I
U+006E n [Ll] LATIN SMALL LETTER N
U+0067 g [Ll] LATIN SMALL LETTER G
U+0073 s [Ll] LATIN SMALL LETTER S
U+0020 [Zs] SPACE
U+0055 U [Lu] LATIN CAPITAL LETTER U
U+006E n [Ll] LATIN SMALL LETTER N
U+0069 i [Ll] LATIN SMALL LETTER I
U+0076 v [Ll] LATIN SMALL LETTER V
U+0065 e [Ll] LATIN SMALL LETTER E
U+0072 r [Ll] LATIN SMALL LETTER R
U+0073 s [Ll] LATIN SMALL LETTER S
U+0069 i [Ll] LATIN SMALL LETTER I
U+0074 t [Ll] LATIN SMALL LETTER T
U+0065 e [Ll] LATIN SMALL LETTER E
U+0074 t [Ll] LATIN SMALL LETTER T
U+002C , [Po] COMMA
U+00A0 \xa0 [Zs] NO-BREAK SPACE
U+004C L [Lu] LATIN CAPITAL LETTER L
U+0069 i [Ll] LATIN SMALL LETTER I
U+0055 U [Lu] LATIN CAPITAL LETTER U
>>> print(ftfy.fix_text(txt))
Linköpings Universitet, LiU
对不包含 \xa0 的子字符串进行测试可以正常工作:
>>> print(ftfy.fix_text(txt[:24]))
Linköpings Universitet,
用空格替换 \xa0 也可以:
>>> print(ftfy.fix_text(txt.replace('\xa0',' ')))
Linköpings Universitet, LiU
我不确定这是否是解决此问题的正确方法以及是否可以安全使用而不会错过其他事情?