python - 如何在任何维基百科文章的 div(id="BodyContent") 中抓取文本。我正在使用 Python 的 BeautifulSoup 和 nltk

Question

page=nltk.clean_html(soup.findAll('div',id="bodyContent"))

当我尝试运行此代码时，它显示：

Traceback (most recent call last):
  File "C:\Python27\wiki3.py", line 36, in <module>
    page=nltk.clean_html(soup.findAll('div',id="bodyContent"))
  File "C:\Python27\lib\site-packages\nltk-2.0.4-py2.7.egg\nltk\util.py", line 340, in clean_html
    cleaned = re.sub(r"(?is)<(script|style).*?>.*?(</\1>)", "", html.strip())
AttributeError: 'ResultSet' object has no attribute 'strip'

score 1 · Accepted Answer

您正在提供clean_html一个可迭代的BeautifulSoup对象（这是findAll返回的），而不是一个字符串（这是clean_html想要的）。

假设您想要一个div已清理的字符串列表，请执行以下操作：

page = [nltk.clean_html(str(d)) for d in soup.findAll('div',id="bodyContent")]

或者

page = map(nltk.clean_html, soup.findAll('div',id="bodyContent"))

score 0 · Accepted Answer

导入 urllib 从 BeautifulSoup 导入 urllib2 导入 BeautifulSoup 导入 nltk 导入重新导入编解码器

文章=“马拉松帝国”文章= urllib.quote（文章）

opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia 需要这个

资源 = opener.open("http://en.wikipedia.org/wiki/" + 文章) 数据 = resource.read()

汤 = BeautifulSoup（数据）

对于 soup.findAll('div',id="bodyContent") 中的节点：page = ''.join(node.findAll(text=True))

f=codecs.open("wikiscrap2","w","utf-8-sig") f.write(page); .............至少使用此代码，我可以使用 bodyContent 标记检索维基百科页面的内容

python - 如何在任何维基百科文章的 div(id="BodyContent") 中抓取文本。我正在使用 Python 的 BeautifulSoup 和 nltk

2 回答 2

Related

Reference