python - 在 python 中解析 Solr 输出

Question

我正在尝试解析表单的 solr 输出：

<doc>
<str name="source">source:A</str>
<str name="url">URL:A</str>
<date name="p_date">2012-09-08T10:02:01Z</date>
</doc>
<doc>
<str name="source">source:B</str>
<str name="url">URL:B</str>
<date name="p_date">2012-08-08T11:02:01Z</date>
</doc>

我热衷于使用漂亮的汤（具有 BeautifulStoneSoup 的版本；我认为在 BS4 之前）来解析文档。我已经使用漂亮的汤进行 HTML 解析，但有些我无法找到一种有效的方法来提取标签的内容。

我已经写了：

for tags in soup('doc'):
    print tags.renderContents()

我确实感觉到我可以强行通过它来获得输出（比如再次说'汤'），但希望有一个有效的解决方案来提取数据。我需要的输出是：

source:A
URL:A
2012-09-08T10:02:01Z
source:B
URL:B
2012-08-08T11:02:01Z

谢谢

score 2 · Accepted Answer

使用 XML 解析器代替任务；xml.etree.ElementTree包含在 Python 中：

from xml.etree import ElementTree as ET

# `ET.fromstring()` expects a string containing XML to parse.
# tree = ET.fromstring(solrdata)  
# Use `ET.parse()` for a filename or open file object, such as returned by urllib2:
ET.parse(urllib2.urlopen(url))

for doc in tree.findall('.//doc'):
    for elem in doc:
        print elem.attrib['name'], elem.text

score 1 · Accepted Answer

您必须使用这种特定的输出格式吗？Solr 支持开箱即用的 Python 输出格式（至少在版本 4 中），只需在查询中使用 wt=python 即可。

python - 在 python 中解析 Solr 输出

2 回答 2

Related

Reference