marklogic - Marklogic：从元素词词典中获取字数

Question

我有两个示例 XML 文件，如下所示：

abc.xml

<data>
<text>i am a test user and doing testing here more and more. What are you doing?</text>
<data>

定义.xml

<data>
<text>We are a doing nothing here you can say it time pass. what are you doing?</text>
<data>

现在我已经为元素创建了元素词词典<text>。我对以下感兴趣：

获取整个数据库中的所有唯一单词及其计数（仅具有以上两个文件）。
获取给定文件的所有唯一词

score 1 · Accepted Answer

1）对于所有唯一词和匹配片段的数量：

for $w in cts:element-words(xs:QName('text'))
return 
element word {
    attribute count { 
      xdmp:estimate(cts:search(doc(), cts:word-query($w))
    },
    $w }

这应该很快，但是要获得实际的字数而不仅仅是片段数，我认为您可能必须检查每个片段，这可能会变得非常慢：

sum(
  cts:search(doc(), cts:word-query($w))/cts:highlight(.,
    cts:word-query($w),<match/>)/count(//match)
  )

2）对于每个文件的所有唯一单词：

for $d in doc()
return element file {
    for $w in cts:element-words(xs:QName('text'), (), (),
        cts:document-query(xdmp:node-uri($d))
    return element word { $w }
}

如果您启用了 URI 词典，那么您可以通过迭代cts:uris()而不是doc()将该值作为第四个参数传递给cts:element-values()，而不是调用xdmp:node-uridoc 来进一步优化 2)。

score 1 · Accepted Answer

1

请参阅http://docs.marklogic.com/guide/search-dev/lexicon#chapter

于 2012-10-09T16:48:45.820 回答

marklogic - Marklogic：从元素词词典中获取字数

2 回答 2

Related

Reference