3

我在将一般文件读入我制作的程序时遇到了一些麻烦。我目前遇到的问题是 pdf 基于某种变异的 utf-8,包括一个 BOM,它给我的整个操作带来了麻烦。在我的应用程序中,我使用需要 ascii 输入的 Snowball 词干算法。有许多主题涉及解决 utf-8 的错误,但是没有一个涉及将它们发送到 Snowball 算法中,或者考虑到 ascii 是我想要的最终结果这一事实。目前我使用的文件是使用标准 ANSI 编码的记事本文件。我得到的具体错误信息是这样的:

File "C:\Users\svictoroff\Desktop\Alleyoop\Python_Scripts\Keywords.py", line 38, in Map_Sentence_To_Keywords
    Word = Word.encode('ascii', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 0: ordinal not in range(128)

我的理解是,在 python 中,包括 ignore 参数只会传递遇到的任何非 ascii 字符,这样我会绕过任何 BOM 或特殊字符,但显然情况并非如此。调用的实际代码在这里:

def Map_Sentence_To_Keywords(Sentence, Keywords):
    '''Takes in a sentence and a list of Keywords, returns a tuple where the
    first element is the sentence, and the second element is a set of
    all keywords appearing in the sentence. Uses Snowball algorithm'''
    Equivalence = stem.SnowballStemmer('english')
    Found = []
    Sentence = re.sub(r'^(\W*?)(.*)(\n?)$', r'\2', Sentence)
    Words = Sentence.split()
    for Word in Words:
        Word = Word.lower().strip()
        Word = Word.encode('ascii', 'ignore')
        Word = Equivalence.stem(Word)
        Found.append(Word)
    return (Sentence, Found)

通过将一般的非贪婪的非字符正则表达式删除包含到字符串的前面,我还希望可以删除麻烦的字符,但事实并非如此。除了 ascii 之外,我还尝试了许多其他编码,并且严格的 base64 编码有效,但对于我的应用程序来说非常不理想。关于如何以自动化方式解决此问题的任何想法?

Element 的初始解码失败,但在实际传递给编码器时返回 unicode 错误。

for Element in Curriculum_Elements:
        try:
            Element = Element.decode('utf-8-sig')
        except:
            print Element 
        Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))

def scraping(File):
    '''Takes in txt file of curriculum, removes all newlines and returns that occur \
    after a lowercase character, then splits at all remaining newlines'''
    Curriculum_Elements = []
    Document = open(File, 'rb').read()
    Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
    Curriculum_Elements = Document.split('\r\n')
    return Curriculum_Elements

显示的代码生成所见的课程元素。

 for Element in Curriculum_Elements:
        try:
            Element = unicode(Element, 'utf-8-sig', 'ignore')
        except:
            print Element 

这种类型转换的hackaround确实有效,但是转换回ascii有点不稳定。返回此错误:

Warning (from warnings module):
  File "C:\Python27\lib\encodings\utf_8_sig.py", line 19
    if input[:3] == codecs.BOM_UTF8:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
4

1 回答 1

5

Try decoding the UTF-8 input into a unicode string first, then encode that into ASCII (ignoring non-ASCII). It really doesn't make sense to encode a string that's already encoded.

input = file.read()   # Replace with your file input code...
input = input.decode('utf-8-sig')   # '-sig' handles BOM

# Now isinstance(input, unicode) is True

# ...
Sentence = Sentence.encode('ascii', 'ignore')

After the edits, I see that you were already attempting to decode the strings before encoding them in ASCII. But, it seems the decoding was happening too late, after the file's contents had already been manipulated. This can cause problems since not every UTF-8 byte is a character (some characters take several bytes to encode). Imagine an encoding that transforms any string to a sequence of as and bs. You wouldn't want to manipulate it before decoding it, because you'd see as and bs everywhere even if there weren't any in the unencoded string -- the same problem arises with UTF-8, albeit much more subtly because most bytes really are characters.

So, decode once, before you do anything else:

def scraping(File):
    '''Takes in txt file of curriculum, removes all newlines and returns that occur \
    after a lowercase character, then splits at all remaining newlines'''
    Curriculum_Elements = []
    Document = open(File, 'rb').read().decode('utf-8-sig')
    Document = re.sub(r'(?<=[a-zA-Z,])\r?\n', ' ', Document)
    Curriculum_Elements = Document.split('\r\n')
    return Curriculum_Elements

# ...

for Element in Curriculum_Elements:
    Curriculum_Tuples.append(Map_Sentence_To_Keywords(Element, Keywords))

Your original Map_Sentence_To_Keywords function should work without modification, though I would suggest encoding to ASCII before splitting, just to improve efficiency/readability.

于 2012-06-06T14:58:03.520 回答