python - 美丽的汤和 UnicodeDecodeError

Question

我正在尝试抓取一个页面，但我有一个 UnicodeDecodeError。这是我的代码：

def soup_def(link):
    req = urllib2.Request(link, headers={'User-Agent' : "Magic Browser"}) 
    usock = urllib2.urlopen(req)
    encoding = usock.headers.getparam('charset')
    page = usock.read().decode(encoding)
    usock.close()
    soup = BeautifulSoup(page)
    return soup

soup = soup_def("http://www.geekbuying.com/item/Ainol-Novo-10-Hero-II-Quad-Core--Tablet-PC-10-1-inch-IPS-1280-800-1GB-RAM-16GB-ROM-Android-4-1--HDMI-313618.html")

和错误：

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 284: invalid start byte

我检查了更多用户是否有相同的错误，但我想不出任何解决方案。

score 2 · Accepted Answer

另一种可能性是您尝试解析的隐藏文件（这在 Mac 上很常见）。

添加一个简单的 if 语句，以便您只创建实际上是 html 文件的 BeautifulSoup 对象：

for root, dirs, files in os.walk(folderPath, topdown = True):
    for fileName in files:
        if fileName.endswith(".html"):
            soup = BeautifulSoup(open(os.path.join(root, fileName)).read(), 'lxml')

score 0 · Accepted Answer

这是我从维基百科中得到的关于字符的信息0xff，它是 UTF-16 的符号。

UTF-16[edit]
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.
If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text display that expects the text to be ISO-8859-1.
if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a text display that expects the text to be ISO-8859-1.
Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).

所以我在这里有两个想法：

(1) 这可能是由于它应该被视为utf-16而不是utf-8

(2) 发生错误是因为您试图将整个汤打印到屏幕上。然后它涉及到您的 IDE（Eclipse/Pycharm）是否足够聪明以显示这些 unicode。

如果我是你，我会尽量不打印整个汤而继续前进，只收集你想要的那块。看到你在到达那一步时遇到问题。如果那里没有问题，那么为什么不能将整个汤打印到屏幕上。

如果您真的想将汤打印到屏幕上，请尝试：

print soup.prettify(encoding='utf-16')

python - 美丽的汤和 UnicodeDecodeError

2 回答 2

Related

Reference