python - 带有预解码 UTF-8 的 Python UnicodeEncodeError

Question

我正在尝试解析 tar.gz 文件中的一堆日志文件（最多 4GiB）。源文件来自 RedHat 5.8 Server 系统和 SunOS 5.10，必须在 WindowsXP 上进行处理。

我遍历 tar.gz 文件，读取文件，将文件内容解码为 UTF-8 并在进一步处理之前用正则表达式解析它们。

当我写出处理后的数据以及从 tar.gz 读取的原始数据时，我收到以下错误：

Traceback (most recent call last):
  File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 375, in <module>
    p.analyze_longtails()
  File "C:\WoMMaxX\lt_automation\Tools\LogParser.py", line 196, in analyze_longtails
    oFile.write(entries[key]['source'] + '\n')
  File "C:\Python\3.2\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 24835-24836: character maps
to <undefined>

这是我读取和解析日志文件的部分：

def getSalesSoaplogEntries(perfid=None):                
        for tfile in parser.salestarfiles:
            path = os.path.join(parser.logpath,tfile)            
            if os.path.isfile(path):
                if tarfile.is_tarfile(path):
                    tar = tarfile.open(path,'r:gz')
                    for tarMember in tar.getmembers():
                        if 'salescomponent-soap.log' in tarMember.name:
                            tarMemberFile = tar.extractfile(tarMember)
                            content = tarMemberFile.read().decode('UTF-8','surrogateescape')

                            for m in parser.soaplogregex.finditer(content):
                                entry = {}
                                entry['time'] = datetime(datetime.now().year, int(m.group('month')), int(m.group('day')),int(m.group('hour')), int(m.group('minute')), int(m.group('second')), int(m.group('millis'))*1000)
                                entry['perfid'] = m.group('perfid')
                                entry['direction'] = m.group('direction')
                                entry['payload'] = m.group('payload')
                                entry['file'] = tarMember.name
                                entry['source'] = m.group(0)
                                sm = parser.soaplogmethodregex.match(entry['payload'])
                                if sm:
                                    entry['method'] = sm.group('method')

                                    if entry['time'] >= parser.starttime and entry['time'] <= parser.endtime:
                                        if perfid and entry['perfid'] == perfid:
                                            yield entry
                        tar.members = []

这是我将处理后的信息与原始数据一起写出的部分（它是一个特定进程的所有日志条目的聚合：

if len(entries) > 0:
    time = perfentry['time']
    filename = time.isoformat('-').replace(':','').replace('-','') + 'longtail_' + perfentry['perfid'] + '.txt'
    oFile = open(os.path.join(parser.logpath,filename), 'w')
    oFile.write(perfentry['source'] +'\n')
    oFile.write('------\n')
    for key in sorted(entries.keys()):
        oFile.write('------\n')
        oFile.write(entries[key]['source'] + '\n') #<-- here it is failing

我不明白为什么以 UTF-8 读取文件似乎是正确的，不可能将它们写成 UTF-8。我究竟做错了什么？

score 1 · Accepted Answer

您的输出文件使用操作系统的默认编码，它不是UTF-8。使用codecs.open代替open并指定encoding='utf-8'.

oFile = codecs.open(os.path.join(parser.logpath,filename), 'w', encoding='utf-8')

请参阅http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data

python - 带有预解码 UTF-8 的 Python UnicodeEncodeError

1 回答 1

Related

Reference