6

我正在使用 python 中的 feedparser 库从本地报纸中检索新闻(我的意图是对这个语料库进行自然语言处理),并希望能够从 RSS 提要中检索许多过去的条目。

我不太熟悉 RSS 的技术问题,但我认为这应该是可能的(我可以看到,例如,Google Reader 和 Feedly 可以在我移动滚动条时“按需”执行此操作)。

当我执行以下操作时:

import feedparser

url = 'http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml'
feed = feedparser.parse(url)
for post in feed.entries:
   title = post.title

我只收到十几个条目。我在想几百个。如果可能的话,可能是上个月的所有条目。是否可以仅使用 feedparser 执行此操作?

我打算从 rss 提要中仅获取新闻项的链接,并使用 BeautifulSoup 解析整个页面以获得我想要的文本。另一种解决方案是使用爬虫跟踪页面中的所有本地链接以获取大量新闻项目,但我现在想避免这种情况。

--

出现的一种解决方案是使用 Google Reader RSS 缓存:

http://www.google.com/reader/atom/feed/http://feeds.folha.uol.com.br/folha/emcimadahora/rss091.xml?n=1000

但是要访问这个我必须登录到谷歌阅读器。有人知道我是如何从 python 做到这一点的吗?(我真的对网络一无所知,我通常只会弄乱数值微积分)。

4

2 回答 2

10

您只会收到十几个条目左右,因为这就是提要包含的内容。如果您想要历史数据,则必须找到所述数据的提要/数据库。

Check out this ReadWriteWeb article for some resources on finding open data on the web.

Note that Feedparser has nothing to do with this as your title suggests. Feedparser parses what you give it. It can't find historic data unless you find it and pass it into it. It is simply a parser. Hope that clears things up! :)

于 2009-11-04T20:02:51.713 回答
3

To expand on Bartek's answer: You could also start storing all of the entries in the feed that you've already seen, and build up your own historical archive of the feed's content. This would delay your ability to start using it as a corpus (because you'd have to do this for a month to build up a collection of a month's worth of entries), but you wouldn't be dependent on anyone else for the data.

I may be mistaken, but I'm pretty sure that's how Google Reader can go back in time: They have each feed's past entries stored somewhere.

于 2009-11-04T20:13:56.057 回答