python-3.x - python3：readlines() 索引问题？

Question

Python 3.1.2（r312:79147，2010 年 11 月 9 日，09:41:54）
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] 在 linux2 上
输入“帮助”、“版权”、“信用”或“许可”以获取更多信息。
>>> open("/home/madsc13ntist/test_file.txt", "r").readlines()[6]
回溯（最近一次通话最后）：
  文件“”，第 1 行，在
  解码中的文件“/usr/local/lib/python3.1/codecs.py”，第 300 行
    （结果，消耗）= self._buffer_decode（数据，self.errors，最终）
UnicodeDecodeError：“utf8”编解码器无法解码位置 2230 中的字节 0xae：意外的代码字节

但是...

Python 2.4.3（#1，2010 年 9 月 8 日，11:37:47）
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] 在 linux2 上
输入“帮助”、“版权”、“信用”或“许可”以获取更多信息。
>>> open("/home/madsc13ntist/test_file.txt", "r").readlines()[6]
'2010-06-14 21:14:43 613 xxx.xxx.xxx.xxx 200 TCP_NC_MISS 4198 635 GET http www.thelegendssportscomplex.com 80 /thumbnails/t/sponsors/145x138/007.gif - - - 直接 www.thelegendssportscomplex .com 图片/gif http://www.thelegendssportscomplex.com/ “Mozilla/4.0（兼容；MSIE 8.0；Windows NT 5.1；Trident/4.0；.NET CLR 2.0.50727；InfoPath.1；MS-RTC LM 8） " 观察到 "运动/娱乐" - xxx.xxx.xxx.xxx xxx.xxx.xxx.xxx\r\n'

有谁知道为什么 .readlines()[6] 对 python-3 不起作用但在 2.4 中起作用？

还有……我以为 0xAE 是 ®

score 0 · Accepted Answer

从问这个问题到现在大约有两年时间，你可能已经知道原因了。基本上，Python 3 字符串是 Unicode 字符串。为了使它们抽象，您需要告诉 Python 文件使用什么编码。

Python 2 字符串实际上是字节序列，Python 从文件中读取任何字节都感觉很好。一些字符被解释（换行符，制表符，...），但其余的保持不变。

Python 3open()类似于 Python 2 codecs.open()。

...时间到了...通过接受其中一个答案来结束问题。

score 0 · Accepted Answer

打开函数文档：

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

永远使用编码读取文件：

open("/home/madsc13ntist/test_file.txt", "r",encoding='iso8859-1').readlines()[6]

忽略解码错误？设置错误='忽略'。“errors”的默认值为“None”，与“strict”相同。

score 0 · Accepted Answer

来自Python 维基：

UnicodeDecodeError 通常在从特定编码解码 str 字符串时发生。由于编码仅将有限数量的 str 字符串映射到 unicode 字符，因此 str 字符的非法序列将导致特定于编码的 decode() 失败

看起来好像您的编码与您认为的不同。

python-3.x - python3：readlines() 索引问题？

3 回答 3

Related

Reference