python - Beautifulsoup 使用 BeautifulSoup 和 Python 提取 HTML 数据

Question

我的 HTML 文本看起来像以下结构的许多实例：

<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>

我需要做的是索引每个结构，包括 DocNo、标题和文本，以便稍后进行分析（标记化等）。

我正在考虑使用 BeautifulSoup，这是我到目前为止的代码：

soup = BeautifulSoup (file("AP880212.html").read()) 
num = soup.findAll('docno')

但这只会给我以下格式的结果：

<docno> AP880212-0166 </docno>, <docno> AP880212-0167 </docno>, <docno> AP880212-0168 </docno>, <docno> AP880212-0169 </docno>, <docno> AP880212-0170 </docno>

如何提取 <> 中的数字？并将它们与标题和文本联系起来？

非常感谢你，

萨沙

score 2 · Accepted Answer

要获取标签的内容：

docnos = soup.findAll('docno')
for docno in docnos:
    print docno.contents[0]

score 1 · Accepted Answer

像这样的东西：

html = """<DOC>
<DOCNO> XXX-2222 </DOCNO>
<FILEID>AP-NR-02-12-88 2344EST</FILEID>
<HEAD>Reports Former Saigon Officials Released from Re-education Camp</HEAD>
<TEXT>
Lots of text here
</TEXT>
</DOC>
"""

import bs4

d = {}

soup = bs4.BeautifulSoup(html, features="xml")
docs = soup.findAll("DOC")
for doc in docs:
    d[doc.DOCNO.getText()] = (doc.HEAD.getText(), doc.TEXT.getText())

print d
#{u' XXX-2222 ': 
#   (u'Reports Former Saigon Officials Released from Re-education Camp', 
#    u'\nLots of text here\n')}

请注意，我传递features="xml"给构造函数。这是因为您的输入中有很多非标准的 html 标签。在将其保存到字典之前，您可能还想输入.strip()文本，这样它就不会对空格那么敏感（当然，除非这是您的意图）。

更新：

如果同一个文件中有多个 DOC，并且features="xml"限制为一个，则可能是因为 XML 解析器期望只有一个根元素。

例如，如果您将整个输入 XML 包装在单个根元素中，它应该可以工作：

<XMLROOT>
    <!-- Existing XML (e.g. list of DOC elements) -->
</XMLROOT>

所以你可以在你的文件中执行此操作，或者我建议在将输入文本传递给beautifulsoup之前以编程方式执行此操作：

root_element_name = "XMLROOT"  # this can be anything
rooted_html = "<{0}>\n{1}\n</{0}>".format(root_element_name, html)
soup = bs4.BeautifulSoup(rooted_html, features="xml")

score 0 · Accepted Answer

docnos = soup.findAll('docno')
for docno in docnos:
       print docno.renderContents()

您还可以使用renderContents() 从标签中提取信息。

python - Beautifulsoup 使用 BeautifulSoup 和 Python 提取 HTML 数据

3 回答 3

Related

Reference