python - Python Xpath 查找包含 @domain 的 text()

Question

我正在寻找正确的 xpath 表达式来搜索包含字符串的 html 页面中的所有 text()：@domain

在匹配提取直到左边的第一个空格和右边的第一个空格 -

只是为了获取电子邮件地址。

谢谢

score 1 · Accepted Answer

此 Xpath 查询将获取包含“@domain”的所有节点的文本

//*[contains(text(), '@domain')]/text()

然后，您可以使用 Python 解析文本以提取电子邮件

>>> import re
>>> re.findall(r'[\w\.]+@domain\.[\w\.]+', 'this is our info: info@domain.co.uk')
['info@domain.co.uk']

更新：

看起来scrapy中的XPath选择器有re方法，我不知道：

>>> hxs.select('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

python - Python Xpath 查找包含 @domain 的 text()

1 回答 1

Related

Reference