python - 如何“清理”feedparser 提要中的所有条目

Question

我以 Google 的 XML 格式备份了我的博客。它很长。到目前为止，我已经这样做了：

>>> import feedparser
>>> blogxml = feedparser.parse('blog.xml')
>>> type(blogxml)
<class 'feedparser.FeedParserDict'>

在我正在阅读的书中，作者这样做：

>>> import feedparser
>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
>>> llog['feed']['title'] u'Language Log'
>>> len(llog.entries) 15
>>> post = llog.entries[2]
>>> post.title u"He's My BF"
>>> content = post.content[0].value
>>> content[:70] u'<p>Today I was chatting with three of our visiting graduate students f'
>>> nltk.word_tokenize(nltk.html_clean(content))

这对我来说是逐个条目的。如您所见，我已经有了一种使用 NLTK 清理 HTML 的方法。但我真正想要的是获取所有条目，将它们从 HTML 中清除（我已经知道该怎么做，而不是问怎么做，请仔细阅读问题），然后将它们作为明文字符串。这与正确使用 feedparser 有关。有没有一种简单的方法可以做到这一点？

更新：

事实证明，我仍然没有找到一种简单的方法来做到这一点。由于我对 python 的无能，我被迫做一些有点丑陋的事情。

这就是我想我会做的：

import feedparser
import nltk

blog = feedparser.parse('myblog.xml')

with open('myblog','w') as outfile:
    for itemnumber in range(0, len(blog.entries)):
        conts = blog.entries[itemnumber].content
        cleanconts = nltk.word_tokenize(nltk.html_clean(conts))
        outfile.write(cleanconts)

所以，非常感谢@Rob Cowie，但你的版本（看起来很棒）不起作用。我为没有早点指出这一点并接受答案而感到难过，但我没有太多时间来处理这个项目。我在下面放的东西就是我可以开始工作的全部内容，但是如果有人有更优雅的东西，我会留下这个问题。

import feedparser
import sys

blog = feedparser.parse('myblog.xml')
sys.stdout = open('blog','w')

for itemnumber in range(0, len(blog.entries)):
    print blog.entries[itemnumber].content

sys.stdout.close()

然后我 CTRL-D 退出解释器，因为我不知道如何在不关闭 Python 标准输出的情况下关闭打开的文件。然后我重新进入解释器，打开文件，读取文件，并从那里清理 HTML。（nltk.html_clean 是 NLTK 书本身的在线版本中的一个错字，顺便说一下……它实际上是 nltk.clean_html）。我最终得到的几乎是但不完全是纯文本。

score 1 · Accepted Answer

import feedparser
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")

with open('myblog.txt', 'w') as outfile:
    for entry in llog.entries:
        ## Do your processing here
        content = entry.content[0].value
        clean_content = nltk.word_tokenize(nltk.html_clean(content))
        outfile.write(clean_content)

从根本上说，您需要打开一个文件，迭代条目 ( feed.entries)，根据需要处理条目并将适当的表示写入文件。

我不假设您要如何在文本文件中分隔帖子内容。此片段也不会将帖子标题或任何元数据写入文件。

python - 如何“清理”feedparser 提要中的所有条目

1 回答 1

Related

Reference