html - 使用 XmlSlurper：如何在遍历 GPathResult 时选择子元素

Question

我正在编写一个 HTML 解析器，它使用 TagSoup 将格式良好的结构传递给 XMLSlurper。

这是通用代码：

def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""     

def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );

html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

我希望 each 让我依次选择每个“li”，以便我可以检索相应的 href 和地址详细信息。相反，我得到了这个输出：

#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111

我检查了网络上的各种示例，这些示例要么处理 XML，要么是单行示例，例如“从该文件中检索所有链接”。似乎 it.h3.a.@href 表达式正在收集文本中的所有 href，即使我将它传递给父“li”节点的引用。

你能告诉我吗：

为什么我得到显示的输出
如何检索每个“li”项目的 href/address 对

谢谢。

score 11 · Accepted Answer

用查找替换 grep：

html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

那么你会得到

#href1: Here is the addressTelephone number: telephone

#href2: Here is another addressAnother telephone: 0845 1111111

grep 返回一个 ArrayList 但 find 返回一个 NodeChild 类：

println html.'**'.grep { it.@class == 'divclass' }.getClass()
println html.'**'.find { it.@class == 'divclass' }.getClass()

结果是：

class java.util.ArrayList
class groovy.util.slurpersupport.NodeChild

因此，如果您想使用 grep ，则可以像这样嵌套另一个，以使其正常工作

html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
    it.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
}

长话短说，在你的情况下，使用 find 而不是 grep。

score 1 · Accepted Answer

这是一个棘手的问题。当只有一个带有 class='divclass' 的元素时，前面的答案肯定没问题。如果 grep 有多个结果，那么一个结果的 find() 不是答案。指出结果是 ArrayList 是正确的。插入一个外部嵌套的 .each() 循环会在闭包参数div中提供一个 GPathResult 。从这里向下钻取可以继续得到预期的结果。

html."**".grep { it.@class == 'divclass' }.each { div -> div.ol.li.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address.text()
   println "$link: $address\n"
}}

原始代码的行为也可以使用更多的解释。在 Groovy 中访问 List 上的属性时，您将获得一个新列表（大小相同），其中包含列表中每个元素的属性。grep() 找到的列表只有一个条目。然后我们得到一个属性ol的条目，这很好。接下来，我们得到该条目的 ol.it 的结果。它又是一个 size() == 1 的列表，但这次有一个 size() == 2 的条目。如果我们愿意，我们可以在那里应用外循环并获得相同的结果：

html."**".grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address
   println "$link: $address\n"
}}

在代表多个节点的任何 GPathResult 上，我们得到所有文本的连接。这是原始结果，首先是@href，然后是address。

score 0 · Accepted Answer

I believe the previous answers are all correct at the time of writing, for the version used. But I am using HTTPBuilder 0.7.1 and Grails 2.4.4 with Groovy 2.3.7 and there is a big issue - HTML elements are transformed to uppercase. It appears this is due to NekoHTML used under the hood:

http://nekohtml.sourceforge.net/faq.html#uppercase

Because of this, the solution in the accepted answer must be written as:

html.'**'.find { it.@class == 'divclass' }.OL.LI.each { linkItem ->
    def link = linkItem.H3.A.@href
    def address = linkItem.ADDRESS.text()
    println "$link: $address\n"
}

This was very frustrating to debug, hope it helps someone.

html - 使用 XmlSlurper：如何在遍历 GPathResult 时选择子元素

3 回答 3

Related

Reference