python-3.x - 如何只在文本文件中保留 Big5 个字符

Question

我在一个台湾网站上找到了一些文字。我摆脱了 HTML，只保留了我需要的 txt 文件。txt 文件的内容在 Firefox/Chrome 中正确显示。使用 Python3，如果我这样做了，f = open(text_file).read()我会收到此错误：

'utf-8' codec can't decode byte 0xa1 in position 29: invalid start byte

ETA：我使用 ubuntu，所以我对 Python 或终端中的任何解决方案都很满意！

如果我这样做f = codecs.open(os.path.join(path, 'my_text.txt'), 'r', encoding='Big5')，然后read()我会收到以下消息：

'big5' codec can't decode byte 0xf9 in position 1724: illegal multibyte sequence

我只需要汉字，我怎么能只保留那些编码为Big5的？这将摆脱错误，是吗？

score 1 · Accepted Answer

with open(filename, encoding='utf-8', errors='replace') as file:
    text = file.read()

您的文件可能使用了一些其他字符编码，或者甚至（如果保存文本的代码有问题）多种字符编码的混合。

您可以查看浏览器使用的编码，例如，在 Chrome 中：“更多工具 -> 编码”。

1 回答 1