python - lxml unicode输出问题

Question

python 和 lxml 的新手，所以请多多包涵。现在卡住了似乎是 unicode 的问题。我尝试了 .encode，美丽的汤的 unicodedammit 没有运气。搜索了论坛和网络，但我缺乏 python 技能未能将建议的解决方案应用于我的特定代码。感谢任何帮助，谢谢。

代码：

import requests
import lxml.html

sourceUrl = "http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty.htm"

sourceHtml = requests.get(sourceUrl)

htmlTree = lxml.html.fromstring(sourceHtml.text)

for stockCodes in htmlTree.xpath('''/html/body/printfriendly/table/tr/td/table/tr/td/table/tr/table/tr/td'''):
    string = stockCodes.text
    print string

错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)

score 0 · Accepted Answer

当我像这样运行您的代码时python lx.py，我没有收到错误消息。但是当我将结果发送到 sdtoutpython lx.py > output.txt时，它就会发生。所以试试这个：

# -*- coding: utf-8 -*-
import requests
import lxml.html
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

这允许您从默认的 ASCII 切换到 UTF-8，Python 运行时将在必须将字符串缓冲区解码为 unicode 时使用它。

score 0 · Accepted Answer

text 属性总是返回纯字节，但 content 属性应该尝试为您编码。您也可以尝试：sourceHTML.text.encode('utf-8')或者sourceHTML.text.encode('ascii')但我相当肯定后者会导致同样的异常。

python - lxml unicode输出问题

2 回答 2

Related

Reference