我在将一般文件读入我制作的程序时遇到了一些麻烦。我目前遇到的问题是 pdf 基于某种变异的 utf-8,包括一个 BOM,它给我的整个操作带来了麻烦。在我的应用程序中,我使用需要 ascii 输入的 Snowball 词干算法。有许多主题涉及解决 utf-8 的错误,但是没有一个涉及将它们发送到 Snowball 算法中,或者考虑到 ascii 是我想要的最终结果这一事实。目前我使用的文件是使用标准 ANSI 编码的记事本文件。我得到的具体错误信息是这样的:
File "C:\Users\svictoroff\Desktop\Alleyoop\Python_Scripts\Keywords.py", line 38, in Map_Sentence_To_Keywords
Word = Word.encode('ascii', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128)
我的理解是,在 python 中,包括 ignore 参数只会传递遇到的任何非 ascii 字符,这样我会绕过任何 BOM 或特殊字符,但显然情况并非如此。调用的实际代码在这里:
def Map_Sentence_To_Keywords(Sentence, Keywords):
'''Takes in a sentence and a list of Keywords, returns a tuple where the
first element is the sentence, and the second element is a set of
all keywords appearing in the sentence. Uses Snowball algorithm'''
Equivalence = stem.SnowballStemmer('english')
Found = []
Sentence = re.sub(r'^(\W*?)(.*)(\n?)$', r'\2', Sentence)
Words = Sentence.split()
for Word in Words:
Word = Word.lower().strip()
Word = Word.encode('ascii', 'ignore')
Word = Equivalence.stem(Word)
return (Sentence, Found)
通过将一般的非贪婪的非字符正则表达式删除包含到字符串的前面,我还希望可以删除麻烦的字符,但事实并非如此。除了 ascii 之外,我还尝试了许多其他编码,并且严格的 base64 编码有效,但对于我的应用程序来说非常不理想。关于如何以自动化方式解决此问题的任何想法?
Element 的初始解码失败,但在实际传递给编码器时返回 unicode 错误。
for Element in Curriculum_Elements:
Element = Element.decode('utf-8-sig')
print Element
Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))
def scraping(File):
'''Takes in txt file of curriculum, removes all newlines and returns that occur \
after a lowercase character, then splits at all remaining newlines'''
Curriculum_Elements = []
Document = open(File, 'rb').read()
Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
Curriculum_Elements = Document.split('\r\n')
return Curriculum_Elements
for Element in Curriculum_Elements:
Element = unicode(Element, 'utf-8-sig', 'ignore')
print Element
Warning (from warnings module):
File "C:\Python27\lib\encodings\utf_8_sig.py", line 19
if input[:3] == codecs.BOM_UTF8:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal