4

这是 XPath 专家的一个简单点!:)

文件结构:

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

忽略文档的语义不可能性,我想拉出[["Newt", "Gingrich"], ["Garry", "Trudeau"]],即:当连续有两个token的entityTypes为PROPER_NOUN时,我想从这两个标记中提取单词。

我已经做到了:

"//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']"

...它可以找到两个连续的 PROPER_NOUN 令牌中的第二个,但我不确定如何让它与它一起发出第一个令牌。

一些注意事项:

  • 如果可以简化问题,我不介意对 NodeSet 进行更高级别的处理(例如,在 Ruby / Nokogiri 中)。
  • 如果有三个或更多连续的 PROPER_NOUN 标记(称为 A、B、C),理想情况下我想发出 [A、B]、[B、C]。

更新

这是我使用高级 Ruby 函数的解决方案。但是我厌倦了所有那些在我脸上踢沙子的 XPath 恶霸,我想知道真正的 XPath 编码人员是如何做到的!

def extract(doc)
  names = []
  sentences = doc.xpath("//tokens")
  sentences.each do |sentence| 
    tokens = sentence.xpath("token")
    prev = nil
    tokens.each do |token|
      name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN"
      names << [prev, name] if (name && prev)
      prev = name
    end
  end
  names
end
4

4 回答 4

1

这个 XPath 1.0 表达式

   /*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word

选择所有“first-in-pair noun-words”

这个 XPath 表达式

/*/token
  [entityType='PROPER_NOUN'
 and
   preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
  ]
   /word

选择所有“第二对名词词”

您必须生成实际对,取两个生成的结果节点集中的每一个的第 k 个节点。

基于 XSLT 的验证

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
==============
  <xsl:copy-of select=
   "/*/token
      [entityType='PROPER_NOUN'
     and
       preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
 </xsl:template>
</xsl:stylesheet>

简单地评估两个 XPath 表达式并输出这两个评估的结果(使用合适的分隔符来可视化第一个结果的结尾和第二个结果的开头)。

应用于提供的 XML 文档时:

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

输出是

<word>Newt</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Trudeau</word>

并且两个结果的组合(压缩)(您将在您最喜欢的 PL 中指定)是:

["Newt", "Gingrich"]

["Garry", "Trudeau"]

当对这个 XML 文档应用相同的转换时(注意我们现在有一个三元组):

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Rep</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

现在的结果是

<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>

并压缩这两个结果会产生正确的、想要的最终结果:

["Newt", "Gingrich"],

["Gingrich", "Rep"],

["Garry", "Trudeau"]

请注意

可以使用单个 XPath 2.0 表达式生成想要的结果。如果您对 XPath 2.0 解决方案感兴趣,请告诉我。

于 2012-09-15T05:19:23.273 回答
1

I'd do this in two steps. First step is to select a set of nodes:

//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]

This gives you all the tokens that start a 2-word pair. Then to get the actual pair, iterate over the node list and extract ./word and following-sibling::token[1]/word

Using XmlStarlet ( http://xmlstar.sourceforge.net/ - awesome tool for quick xml manipulation) the command line is

xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml 

giving

Newt,Gingrich
Garry,Trudeau

XmlStarlet will also compile that command line to xslt, the relevant bit is

  <xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]">
    <xsl:value-of select="word"/>
    <xsl:value-of select="','"/>
    <xsl:value-of select="following-sibling::token[1]/word"/>
    <xsl:value-of select="'&#10;'"/>
  </xsl:for-each>

Using Nokogiri it could look something like:

#parse the document
doc = Nokogiri::XML(the_document_string)

#select all tokens that start 2-word pair
pair_starts = doc.xpath '//token[entityType = "PROPER_NOUN" and following-sibling::token[1][entityType = "PROPER_NOUN"]]'

#extract each word and the following one
result = pair_starts.each_with_object([]) do |node, array|
  array << [node.at_xpath('word').text, node.at_xpath('following-sibling::token[1]/word').text]
end
于 2012-09-14T23:10:58.093 回答
0

XPath 返回一个节点或节点集,但不返回组。所以你必须确定每个组的开始,然后抓住其余的。

first = "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]/word"
next = "../following-sibling::token[1]/word"

doc.xpath(first).map{|word| [word.text, word.xpath(next).text] }

输出:

[["Newt", "Gingrich"], ["Garry", "Trudeau"]]
于 2012-09-15T00:12:54.423 回答
0

单独的 XPath 不足以完成这项任务。但在 XSLT 中这很容易:

<xsl:for-each-group select="token" group-adjacent="entityType">
  <xsl:if test="current-grouping-key="PROPER_NOUN">
     <xsl:copy-of select="current-group">
     <xsl:text>====</xsl:text>
  <xsl:if>
</xsl:for-each-group>
于 2012-09-15T18:20:16.657 回答