python - 使用 Python 在抓取数据中进行编码

Question

我想用 Python 抓取网站的内容。像这样：

Apple’s stock continued to dominate the news over the weekend, with Barron’s placing it on the top of its favorite 2013 stock list.

但是以错误结果打印它们：

Apple âs stock continued to dominate the news over the weekend, with Barronâs placing it on the top of its favorite 2013 stock list.

无法显示符号“'”，这是我的代码：

    #-*- coding: utf-8 -*-

    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    import urllib
    from lxml import *
    import urllib
    import lxml.html as HTML

    url = "http://www.forbes.com/sites/panosmourdoukoutas/2012/12/09/apple-tops-barrons- 10-favorite-stocks-for-2013/?partner=yahootix"
    sock = urllib.urlopen(url)
    htmlSource = sock.read()
    sock.close()

    root = HTML.document_fromstring(htmlSource)
    contents = ' '.join([x.strip() for x in root.xpath("//div[@class='body']/descendant::text()")])

    print contents

    f = open('C:/Users/yinyao/Desktop/Python Code/data.txt','w')
    f.write(contents)
    f.close()

但是设置之后，printf的功能就没有用了。为什么？我该怎么办？我用的是Windows，默认的编码方式是gbk。

score 1 · Accepted Answer

首先，确保您了解每个软件开发人员绝对、绝对必须了解 Unicode 和字符集的绝对最低要求（没有借口！）

其次，始终在内部使用 unicode。早解码，晚编码：当您废弃网站时，将其解码为 unicode，并在脚本内部将其处理为 unicode。否则，您的代码将在随机点崩溃，例如因为它在某些中文网页的评论中遇到意外字符。只有当您稍后将它传递给某个地方（例如，某些可写流）时，您才应该对其进行编码（最好是“utf-8”）

三、使用BeautifulSoup 4

python - 使用 Python 在抓取数据中进行编码

1 回答 1

Related

Reference