python - html标签丢失时如何用scrapy提取标签值列表

Question

我目前正在处理一份文件

<b> label1 </b>
value1 <br>
<b> label2 </b>
value2 <br>
....

我想不出一个用scrapy来处理xpath的干净方法。这是我最好的实现

hxs = HtmlXPathSelector(response)

section = hxs.select(..............)
values = section.select("text()[preceding-sibling::b/text()]"):
labels = section.select("text()/preceding-sibling::b/text()"):

但我不喜欢这种通过索引匹配两个列表的节点的方法。我宁愿遍历 1 个列表（值或标签）并将匹配节点查询为相对 xpath。如：

values = section.select("text()[preceding-sibling::b/text()]"):
for value in values:
    value.select("/preceding-sibling::b/text()"):

我一直在调整这个表达式，但总是不返回任何匹配项

更新

我正在寻找能够容忍“噪音”的稳健方法，例如：

garbage1<br>
<b> label1 </b>
value1 <br>
<b> label2 </b>
value2 <br>
garbage2<br>
<b> label3 </b>
value3 <br>
<div>garbage3</div>

score 1 · Accepted Answer

编辑：对不起，我使用 lxml，但它与 Scrapy 自己的选择器相同。

对于您提供的特定 HTML，这将起作用：

>>> s = """<b> label1 </b>
... value1 <br>
... <b> label2 </b>
... value2 <br>
... """
>>> 
>>> import lxml.html
>>> lxml.html.fromstring(s)
<Element span at 0x10fdcadd0>
>>> soup = lxml.html.fromstring(s)
>>> soup.xpath("//text()")
[' label1 ', '\nvalue1 ', ' label2 ', '\nvalue2 ']
>>> res = soup.xpath("//text()")
>>> for i in xrange(0, len(res), 2):
...     print res[i:i+2]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
>>>

编辑2：

>>> bs = etree.xpath("//text()[preceding-sibling::b/text()]")
>>> for b in bs:
...     if b.getparent().tag == "b":
...         print [b.getparent().text, b]
... 
[' label1 ', '\nvalue1 ']
[' label2 ', '\nvalue2 ']
[' label3 ', '\nvalue3 ']

同样值得一提的是，如果您要在 for 循环内的 xpath 中循环选择要执行“./foo”的元素，而不是“/foo”。

python - html标签丢失时如何用scrapy提取标签值列表

1 回答 1

Related

Reference