python - 在 Python 中获取两个标签之间的数据

Question

<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>

使用 Python 我想从锚标签中获取值，这应该是粗糙集和模糊集视图中基于粒度计算的数据挖掘

我尝试使用 lxml

parser = etree.HTMLParser()
tree   = etree.parse(StringIO.StringIO(html), parser)                   
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()"
rawResponse = tree.xpath(xpath1)              
print rawResponse

并获得以下输出

['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]

score 3 · Accepted Answer

您可以使用以下text_content方法：

import lxml.html as LH

html = '''<h3>
<a href="article.jsp?tp=&arnumber=16">
Granular computing based
<span class="snippet">data</span>
<span class="snippet">mining</span>
in the views of rough set and fuzzy set
</a>
</h3>'''

root = LH.fromstring(html)
for elt in root.xpath('//a'):
    print(elt.text_content())

产量

Granular computing based
data
mining
in the views of rough set and fuzzy set

或者，要删除空格，您可以使用

print(' '.join(elt.text_content().split()))

获得

Granular computing based data mining in the views of rough set and fuzzy set

这是您可能会发现有用的另一个选项：

print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))

产量

Granular computing based data  mining in the views of rough set and fuzzy set

data（请注意，它在和之间留下了额外的空间mining。）

'//a/descendant-or-self::text()'是更通用的版本 "//a/child::text() | //a/span/child::text()"。它将遍历所有子孙等。

score 1 · Accepted Answer

与BeautifulSoup：

>>> from bs4 import BeautifulSoup
>>> html = (the html you posted above)
>>> soup = BeautifulSoup(html)
>>> print " ".join(soup.h3.text.split())
Granular computing based data mining in the views of rough set and fuzzy set

解释：

BeautifulSoup解析 HTML，使其易于访问。soup.h3访问h3HTML 中的标签。

.text，简单地说，从h3标签中获取所有内容，不包括所有其他标签，例如spans。

我split()在这里使用去掉多余的空格和换行符，然后" ".join()作为 split 函数返回一个列表。

python - 在 Python 中获取两个标签之间的数据

2 回答 2

Related

Reference