python - lxml - 在 findall() 中使用正则表达式按属性值查找标签

Question

我正在尝试使用 lxml 来获取格式为的评论数组

<div id="comment-1">
  TEXT
</div>

<div id="comment-2">
  TEXT
</div>

<div id="comment-3">
  TEXT
</div>
...

我尝试使用

html.findall(".//div[@id='comment-*']")

但这会搜索文字星号。

对于我正在尝试做的事情，正确的语法是什么？

编辑：我终于让它工作了

doc = lxml.html.parse(url).getroot()
comment_array = doc.xpath('.//div[starts-with(@id, "comment-")]')

score 1 · Accepted Answer

You can use regular XPath functions to find the comments as you suggested:

comments = doc.xpath('.//div[starts-with(@id, "comment-")]')

But, for more complex matching, you could use regular expressions: with lxml, XPath supports regular expressions in the EXSLT namespace. See the official documentation Regular expressions in XPath.

Here is a demo:

from lxml import etree

content = """\
<body>
<div id="comment-1">
  TEXT
</div>

<div id="comment-2">
  TEXT
</div>

<div id="comment-3">
  TEXT
</div>

<div id="note-4">
  not matched
</div>
</body>
"""

doc = etree.XML(content)

# You must give the namespace to use EXSLT RegEx
REGEX_NS = "http://exslt.org/regular-expressions"

comments = doc.xpath(r'.//div[re:test(@id, "^comment-\d+$")]',
                          namespaces={'re': REGEX_NS})

To see the result, you can "dump" the matched nodes:

for comment in comments:
    print("---")
    etree.dump(comment)

You get:

---
<div id="comment-1">
      TEXT
    </div>


---
<div id="comment-2">
      TEXT
    </div>


---
<div id="comment-3">
      TEXT
    </div>

score 0 · Accepted Answer

该path部分html.findall仅允许将 anXPath subset用作表达式，默认情况下不使用正则表达式。

为此，您必须EXSLT按照描述使用扩展名here- 或者您可以使用xpath core functions.

score 0 · Accepted Answer

I had a similar desire and did something that while I'm not terribly proud of, got the job done.

def node_checker(node):
    if node.attrib['id'].find('hurf-durf') > -1:
        return True
    else:
        return False


for node in itertools.ifilter(node_checker, r.iterdescendants(tag='sometag')):
    print node.tag

Not my finest work, but it got me close enough to getElementById with some flexibility that I was able to move on to another problem.

python - lxml - 在 findall() 中使用正则表达式按属性值查找标签

3 回答 3

Related

Reference