regex - groovy - 正则表达式检索内部 html 标记

Question

我想尝试在跨度标签之间匹配字符串的内部部分，以保证该跨度标签的 id 以 blk 开头。

我怎样才能将它与 groovy 匹配？

例子：

<p>I wanted to try to <span id="blk1">match</span> the inner part of the string<span id="blk2"> between </span>the span tags <span>where</span> it is guaranteed that the id of this span tags <span id="blk3">starts</span> with blk.</p>

根据上面的例子，我想要

   match
   between
   starts

我尝试了以下，但它返回 null；

 def html='''<p>I wanted to try to <span id="blk1">match</span> the inner part of the string<span id="blk2"> between </span>the span tags <span>where</span> it is guaranteed that the id of this span tags <span id="blk3">starts</span> with blk.</p>''' 

 html=html.findAll(/<span id="blk(.)*">(.)*<\/span>/).join();
 println html;

score 4 · Accepted Answer

与其乱用正则表达式，不如直接解析 HTML 然后从中提取节点？

@Grab( 'net.sourceforge.nekohtml:nekohtml:1.9.18' )
import org.cyberneko.html.parsers.SAXParser

def html = '''<p>
             |  I wanted to try to <span id="blk1">match</span> the inner part
             |  of the string<span id="blk2"> between </span> the span tags <span>where</span>
             |  it is guaranteed that the id of this span tags <span id="blk3">starts</span>
             |  with blk.
             |</p>'''.stripMargin()

def content = new XmlSlurper( new SAXParser() ).parseText( html )

List<String> spans = content.'**'.findAll { it.name() == 'SPAN' && it.@id?.text()?.startsWith( 'blk' ) }*.text()

score 3 · Accepted Answer

你似乎有span一方面和strong另一方面。

另外应该小心.*单独使用，因为它会一次性匹配大部分字符串，因为正则表达式是贪婪的。你通常应该通过使用使它变得懒惰.*?

当你(.)*用来匹配标签之间的文本时，你不会从那个组中得到实际的文本，而只会得到匹配的最后一个字符，你需要将量词放在匹配组中。

Using[^<>]+是在 html 标签之间匹配文本的一种更好的方法，并且与 .* 类似，除了几点。

它将匹配任何字符，除了“<”和“>”
它将需要匹配至少一个字符，因此它不会匹配一个空范围。

此外，如果您可以确保“blk”后面的内容始终是整数，我建议使用 \d+ 来匹配它。

html=html.findAll(/<=span id="blk\d">([^<>]+)<\/span>/).join();

话虽如此，我对 Groovy 的经验很少，但是您希望打印包含这三个单词的列表吗？以下正则表达式也将从 html 中提取文本。

html=html.findAll(/(?<=span id="blk\d">)([^<>]+)(?=<\/span>)/).join();

regex - groovy - 正则表达式检索内部 html 标记

2 回答 2

Related

Reference