python - 在 python 3.2 中阅读希伯来语

Question

我有一个希伯来语 (.txt) 的对话日志。我打开文件，使用 f=open("./WhatsApp.txt",'r',encoding="cp037"). 在文件中，每一行文本都包含一个日期和一个文本（例如：14/01/13 12:10:52: דני נרדפייטרס: איילת יא רעה）。我定义a=f.readlines()了，这就是问题所在：

>>> a[0] 'Õ]×\x91\x94\x07\x90\x91\x07\x91\x93\x80\x91\x16\x9a\x91\x90\x9a\x91\x04\x9a\x80SØ¡\x8b\x99\x04\x16\x80\x95\x90\x05\x16\x96\x94\x05\x91\x90\x98\x04SØÐ\x9a\x80PmPsPrPxPy\x80PzPmPjPpPrPyPnP¡\x80PmPrPn\x80PlPÆPnPxPyPqPrPnP¡\x80PnPæP°\x80PÆPzPnPpPlPnP¡\n'

我试图解码这个（我想要日期并从这个字符串中获取它们很难），我做到了codecs.decode(a[0],"cp037")，我得到了

`Traceback (most recent call last):
  File "<pyshell#37>", line 1, in <module>
    codecs.decode(a[0],"cp037")
  File "C:\Python32\lib\encodings\cp037.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
TypeError: 'str' does not support the buffer interface`

为什么会这样？如何以可以将每一行分隔为日期和文本的方式读取文件？

score 1 · Accepted Answer

您已使用编码以文本模式打开文件，因此它已被解码。您无需再次对其进行解码。

现在该文本看起来没有被正确解码，这表明该文本可能最初不在 cp037 中。尝试以二进制模式打开它，然后告诉我们文件的样子。

（事实上，我尝试过使用 UTF-8，并且成功了。文件是 UTF-8 格式，所以只需将 cp037 更改为 'UTF-8' 就可以了）。

python - 在 python 3.2 中阅读希伯来语

1 回答 1

Related

Reference