html - 如何使用 XPath 提取包含 < 未编码为 < 的文本

Question

我想使用 Scrapy 从 html 页面中提取一些文本。

其中一个元素包含<未编码为的字符<（页面不是有效的 html）。

例如

<div>
  years < 7
</div>

使用 XPath（在 Chrome 或 Scapy 代码中）使用'//div/text()'我只能提取'years'

有没有办法获得全文即'years < 7'？

score 1 · Accepted Answer

XPath operates on the DOM level, not on how things are encoded. XPath does not see whether entities were used for certain things or not. This is the DOM parsers business. So, if the DOM parser dropped < 7 because it could not make sense of it, then XPath won't see that part at all.

To get reliable results, fix the HTML by other means before applying XPath.

score 0 · Accepted Answer

您可以使用其他模块而不是基本选择例如我使用我自己的

from lxml import etree
from lxml.html.clean import clean_html

import html5lib
from lxml.etree import XMLSyntaxError, XPathEvalErro

def parse_user(self, response):        
    m = smarte_html_parser.dive_html_root_level(html=response.body)

从一些标题年 < 7

我有年< 7

html - 如何使用 XPath 提取包含 < 未编码为 < 的文本

2 回答 2

Related

Reference