xml - 在多个 XML 节点中搜索串联文本

Question

我必须在“有序”xml 文件中进行搜索，在这些文件中，我要检索的文本被分派到这样的几个节点上。

<root>
    <div id="1">Hello</div>
    <div id="2">Hel</div>
    <div id="3">lo dude</div>   
    <div id="4">H</div>
    <div id="5">el</div>
    <div id="6">lo</div>
</root>

必须在连接的文本上进行搜索：

HelloHello dudeHello

但我需要能够检索节点属性。例如，对于“将”搜索，我希望获取节点：

<div id="1">Hello</div>
<div id="2">Hel</div>
<div id="3">lo dude</div>   
<div id="5">el</div>
<div id="6">lo</div>

或者至少是ID。

有人知道如何在 XPath 或任何其他方式中执行此操作吗？

我认为这有点挑战性，我暂时没有（简单的）想法。谢谢你的帮助。

编辑：在搜索之前必须连接文本是关键信息并且必须精确！

score 0 · Accepted Answer

您的更新要求使问题变得更加复杂，因为“元素包装”可能发生在搜索令牌内的任意点，甚至可能跨越多个元素。我认为您无法在 XPath < 3.0 中编写查询（如果您只能在 XPath 中执行此操作）。我为此使用了 XQuery，它扩展了 XPath。代码在BaseX中运行良好，但也应该在所有其他 XQuery 引擎中运行（可能需要 XQuery 3.0，没看过）。

代码变得相当复杂，我想我在其中添加了足够的注释以使其易于理解。它要求节点位于下一个元素内，但稍作调整后，它也可用于遍历任意 XML 结构（想想带有<span/>s 和其他标记的 HTML）。

(: functx dependencies :)
declare namespace functx = "http://www.functx.com";
declare function functx:is-node-in-sequence 
  ( $node as node()? ,
    $seq as node()* )  as xs:boolean {

   some $nodeInSeq in $seq satisfies $nodeInSeq is $node
 } ;
declare function functx:distinct-nodes 
  ( $nodes as node()* )  as node()* {

    for $seq in (1 to count($nodes))
    return $nodes[$seq][not(functx:is-node-in-sequence(
                                .,$nodes[position() < $seq]))]
 } ;

declare function local:search( $elements as item()*, $pattern as xs:string) as item()* {
  functx:distinct-nodes(
    for $element in $elements
    return ($element[contains(./text(), $pattern)], local:start-search($element, $pattern))
  )
};

declare function local:start-search( $element as item(), $pattern as xs:string) as item()* {
    let $splits := (
      (: all possible prefixes of search token :)
      for $i in 1 to string-length($pattern) - 1
      (: check whether element text starts with prefix :)
      where ends-with($element/text(), substring($pattern, 1, $i))
      return $i
    )
    (: go on for all matching prefixes :)
    for $split in $splits
    return
      (: recursive call to next element :)
      let $continue := local:continue-search($element/following-sibling::*[1], substring($pattern, $split+1))
      where not(empty($continue))
      return ($element, $continue)
};

declare function local:continue-search( $element as item()*, $pattern as xs:string) as item()* {
  if (empty($element)) then () else
  (: case a) text node contains whole remaining token :)
  if (starts-with($element/text(), $pattern))
  then ($element)
  (: case b) text node is part of token :)
  else if (starts-with($pattern, $element/text()))
  then
    (: recursive call to next element :)
    let $continue := local:continue-search($element/following-sibling::*[1], substring($pattern, 1+string-length($element/text())))
    where not(empty($continue))
    return ($element, $continue)
  (: token not found :)
  else ()
};

let $token := 'll'
return local:search(//div, $token)

score 0 · Accepted Answer

在 XPath 2 中，您可以使用 tokenize 来计算搜索文本出现的频率，然后测试每个节点，如果文本中不包括该节点，则会减少出现次数。如果数量减少，则该节点必须包含在结果中。这不是那么快。

假设只有直接子节点中的文本很重要，就像在示例中一样，它看起来像这样：

for $searched in "ll" 
return //*/ for $matches in count(tokenize(string-join(*, ""), $searched)) - 1
            return *[$matches > count(tokenize(concat(" ",string-join(preceding-sibling::*, "")), $searched)) +
                                count(tokenize(concat(" ",string-join(following-sibling::*, "")), $searched)) - 2]

xml - 在多个 XML 节点中搜索串联文本

2 回答 2

Related

Reference