python - 在python中读/写带有变音符号的文件（html到txt）

Question

我知道这已经被问过好几次了，但我认为我做的一切都是正确的，但它仍然不起作用，所以在我临床发疯之前，我会发一个帖子。这是代码（它应该将 HTML 文件转换为 txt 文件并省略某些行）：

fid = codecs.open(htmlFile, "r", encoding = "utf-8")
if not fid:
    return
htmlText = fid.read()
fid.close()

stripped = strip_tags(unicode(htmlText))   ### strip html tags (this is not the prob)
lines = stripped.split('\n')
out = []

for line in lines: # just some stuff i want to leave out of the output
    if len(line) < 6:
        continue
    if '*' in line or '(' in line or '@' in line or ':' in line:
        continue
    out.append(line)

result=  '\n'.join(out)
base, ext = os.path.splitext(htmlFile)
outfile = base + '.txt'

fid = codecs.open(outfile, "w", encoding = 'utf-8')
fid.write(result)
fid.close()

谢谢！

score 0 · Accepted Answer

您还没有指定问题，所以这是一个完整的猜测。

你的strip_tags()函数返回了什么？它是返回一个 unicode 对象，还是一个字节字符串？如果是后者，当您尝试将其写入文件时可能会导致解码问题。例如，如果strip_tags()返回一个 utf-8 编码的字节字符串：

>>> s = u'This is \xe4 test\nHere is \xe4nother line.'
>>> print s
This is ä test
Here is änother line.

>>> s_utf8 = s.encode('utf-8')
>>> f=codecs.open('test', 'w', encoding='utf8')
>>> f.write(s)    # no problem with this... s is unicode, but
>>> f.write(s_utf8)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib64/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

如果这是您所看到的，您需要确保将 unicode 传入fid.write(result)，这可能意味着确保 unicode 由strip_tags().

此外，我顺便注意到了一些其他的事情：

codecs.open()IOError如果无法打开文件，将引发异常。它不会返回 None，因此if not fid:测试将无济于事。您需要使用try/except，最好与with.

try:
    with codecs.open(htmlFile, "r", encoding = "utf-8") as fid:
        htmlText = fid.read()
except IOError, e:
    # handle error
    print e

而且，您从通过打开的文件中读取的数据codecs.open()将自动转换为 unicode，因此调用unicode(htmlText)没有任何效果。

score 0 · Accepted Answer

不确定，但通过做

'\n'.join(out)

使用非 unicode 字符串（但普通的旧bytes字符串），您可能会退回到某些非 UTF-8 编解码器。尝试：

u'\n'.join(out)

确保您在任何地方都使用 unicode 对象。

python - 在python中读/写带有变音符号的文件（html到txt）

2 回答 2

Related

Reference