0

我想解析来自百度的 xml 提要(DB2312 编码)http://news.baidu.com/n?cmd=1&class=civilnews&tn=rss

我总是出错

xml.parsers.expat.ExpatError: not well-formed (invalid token): line 3, column 8

如果我将 xml 更改为谷歌提要http://news.google.com/news?cf=all&ned=us&hl=en&topic=b&output=rss,它可以工作。有什么建议么?

def get_feeds():
        import sys
        import xml.etree.ElementTree as etree
        from urllib import urlopen
        URL = "http://news.baidu.com/n?cmd=1&class=civilnews&tn=rss"
        #URL = "http://news.google.com/news?cf=all&ned=us&hl=en&topic=b&output=rss"
        tree = etree.parse(urlopen(URL))

if __name__ == '__main__':
        get_feeds()
4

1 回答 1

0

Use the excellent feedparser library, it has no problems parsing that URL:

>>> import feedparser
>>> feed = feedparser.parse('http://news.baidu.com/n?cmd=1&class=civilnews&tn=rss')
>>> print feed['feed']['title']
百度国内焦点新闻
>>> len(feed['entries'])
20
>>> print feed['entries'][0]['title']
强台风“天兔”正逐渐接近台湾陆地
于 2013-09-20T22:14:50.930 回答