9

我想用lxml解析下载的RSS,但不知道如何处理UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

但我收到一个错误:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
4

3 回答 3

45

我遇到了类似的问题,事实证明这与编码无关。发生的事情是这样的 - lxml 给你一个完全不相关的错误。在这种情况下,错误是 .parse 函数需要文件名或 URL,而不是包含内容本身的字符串。但是,当它试图打印出错误时,它会阻塞非 ascii 字符并显示完全令人困惑的错误消息。非常不幸,其他人在这里对此问题发表了评论:

https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html

幸运的是,您的解决方案非常简单。只需将 .parse 替换为 .fromstring 就可以了:

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)

## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)

刚刚在我的机器上测试过,效果很好。希望能帮助到你!

于 2012-01-18T21:49:56.887 回答
4

首先为 lxml 库加载和排序字符串通常更容易,然后在其上调用 fromstring,而不是依赖 lxml.etree.parse() 函数及其难以管理的编码选项。

这个特殊的 rss 文件以编码声明开头,所以一切都应该正常工作:

<?xml version="1.0" encoding="utf-8"?>

以下代码显示了一些不同的变体,您可以应用这些变体来为不同的编码进行 etree 解析。您也可以请求它写出不同的编码,这些编码将出现在标题中。

import lxml.etree
import urllib2

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request).read()
print [response]
        # ['<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\xc5\x9bci...']

uresponse = response.decode("utf8")
print [uresponse]    
        # [u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\u015bci...']

tree = lxml.etree.fromstring(response)
res = lxml.etree.tostring(tree)
print [res]
        # ['<feed xmlns="http://www.w3.org/2005/Atom">\n<title>Wiadomo&#347;ci...']

lres = lxml.etree.tostring(tree, encoding="latin1")
print [lres]
        # ["<?xml version='1.0' encoding='latin1'?>\n<feed xmlns=...<title>Wiadomo&#347;ci...']


# works because the 38 character encoding declaration is sliced off
print lxml.etree.fromstring(uresponse[38:])   

# throws ValueError(u'Unicode strings with encoding declaration are not supported.',)
print lxml.etree.fromstring(uresponse)

代码可以在这里尝试:http: //scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#

于 2011-05-04T10:36:24.460 回答
0

您可能只应尝试将字符编码定义为最后的手段,因为很清楚基于 XML 序言的编码是什么(如果不是通过 HTTP 标头)。无论如何,etree.XMLParser除非您愿意,否则没有必要将编码传递给覆盖编码;所以摆脱encoding参数,它应该工作。

编辑:好的,问题实际上似乎与lxml. 无论出于何种原因,以下工作:

parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)
于 2011-04-28T00:12:28.897 回答