python - 使用 BeautifulSoup 抓取数据的问题

Question

我编写了以下试用代码，以从欧洲议会中检索立法法案的标题。

import urllib2
from BeautifulSoup import BeautifulSoup

search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"

for number in xrange(1,10):   
    url = search_url % number
    page = urllib2.urlopen(url).read()
    soup = BeautifulSoup(page)
    title = soup.findAll("title")
    print title

但是，每当我运行它时，我都会收到以下错误：

Traceback (most recent call last):
  File "<stdin>", line 20, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128)

我已将其范围缩小到 BeautifulSoup 无法读取循环中的第四个文档。谁能向我解释我做错了什么？

亲切的问候

托马斯

score 4 · Accepted Answer

BeautifulSoup 在 Unicode 中工作，因此它不对解码错误负责。更有可能的是，您的问题与print语句有关——您的标准输出似乎在 ascii 中（即，sys.stdout.encoding = 'ascii'或不存在），因此如果尝试打印包含非 ascii 字符的字符串，您确实会遇到此类错误。

你的操作系统是什么？您的控制台 AKA 终端设置如何（例如，如果在 Windows 上使用什么“代码页”）？您是在环境PYTHONIOENCODING中设置了控制sys.stdout.encoding还是只是希望自动获取编码？

在我的 Mac 上，在检测到编码正确的情况下，运行您的代码（为了清晰起见，除了将数字与每个标题一起打印外）工作正常并显示：

$ python ebs.py 
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>]
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>]
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>]
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>]
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>]
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>]
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>]
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>]
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>]
$

score 1 · Accepted Answer

更换

print title

和

for t in title:
    print(t)

或者

print('\n'.join(t.string for t in title))

作品。我不完全确定为什么print <somelist>有时有效，但有时却无效。

score 0 · Accepted Answer

If you want to print the titles to a file, you need to specify some encoding that can represent the non-ascii char, utf8 should work fine. To do this, you need to add:

out = codecs.open('titles.txt', 'w', 'utf8')

at the top of the script

and print to the file:

print >> out, title

python - 使用 BeautifulSoup 抓取数据的问题

3 回答 3

Related

Reference