python - Python - 使用 lxml 将 urlib2 替换为 Requests

Question

我试图在这段代码中替换为简单地从页面中提取一些信息urllib2。requests我不是 100% 确定我应该如何移动图书馆。这就是我到目前为止所拥有的错误，我做错了什么？

代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests, sys
from lxml import etree
# import urllib2

# UTF8
reload(sys)
sys.setdefaultencoding("utf-8")

# url = 'http://countrycode.org/Germany'
# opener = urllib2.build_opener()
# opener.addheaders = [('User-agent', 'USERAGENT')]
r = requests.get('http://countrycode.org/Germany')
response = r.text
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)

countryCodeXpath = '//*[@id="main_table_blue_2"]/tr[3]/td[2]'
countryCode = tree.xpath(countryCodeXpath)
destCountryCode = countryCode[0].text

print destCountryCode

错误：

Traceback (most recent call last):
  File "/home/ubuntu/test.py", line 16, in <module>
    tree = etree.parse(response, htmlparser)
  File "lxml.etree.pyx", line 3196, in lxml.etree.parse (src/lxml/lxml.etree.c:64039)
  File "parser.pxi", line 1549, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:91262)
  File "parser.pxi", line 1578, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:91546)
  File "parser.pxi", line 1478, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:90613)
  File "parser.pxi", line 1025, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:87527)
  File "parser.pxi", line 565, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:83101)
  File "parser.pxi", line 656, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:84083)
  File "parser.pxi", line 594, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83379)
IOError: Error reading file '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<SNIP>

score 3 · Accepted Answer

In addition to abamert's answer, you might be able to fix this using the raw response:

response = requests.get(<ur>, stream = True)
tree = etree.parse(response.raw, htmlparser)

See Raw Response Content in the Requests package documentation.

This way Requests should not read all data into the text attribute but keep the raw response as a file-like object which should be readable by etree.parse().

score 1 · Accepted Answer

1

于 2013-09-25T20:01:03.453 回答

score 1 · Accepted Answer

requests return string so first we need to convert that string to html:

import requests
from lxml import html

response = requests.get('http://your.url')
parsed_body = html.fromstring(response.text)

Source: http://jakeaustwick.me/python-web-scraping-resource/

python - Python - 使用 lxml 将 urlib2 替换为 Requests

代码：

错误：

3 回答 3

Related

Reference