1

文件: http ://en.wikiquote.org/wiki/The_Matrix

我想获取第一部分的所有引号(//ul/li)(Neo 的引号)。

我不能这样做//ul[1]/li,因为在某些 wikiquote 的页面中,引用以这种形式表示

<h2><span class="mw-headline" id="Neo">Neo</span></h2>  

<ul>
 <li> First quote </li>
</ul> 

<ul>
 <li> Second quote </li>
</ul> 

<h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>  

代替

<ul>
     <li> First quote </li>
     <li> Second quote </li>
</ul>

我试过这个来获得第一部分

(//*[@id='mw-content-text']/ul/preceding-sibling::h2/span[@class='mw-headline'])[1]

但我无法仅获取第一部分的报价。你能帮帮我吗?

4

3 回答 3

2

使用

(//h2[span/@id='Neo'])[1]/following-sibling::ul
  [count(.
        |
         (//h2[span/@id='Neo'])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  =
   count((//h2[span/@id='Neo'])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  ]
   /li

这将选择所有li紧跟第h2一个的span子节点,该子节点的id属性值为“Neo”。

要选择第二个这样的 qoutatations h2,只需将上面的表达式替换12

对所有数字执行此操作:1,2, ..., count(//h2[span/@id='Neo'])

基于 XSLT 的验证

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "(//h2[span/@id='Neo'])[1]/following-sibling::ul
      [count(.
            |
             (//h2[span/@id='Neo'])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      =
       count((//h2[span/@id='Neo'])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      ]
        /li

   "/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于提供的 XML 文档时:

<html>
 <h2><span class="mw-headline" id="Neo">Neo</span></h2>

 <ul>
  <li> First quote </li>
 </ul>

 <ul>
  <li> Second quote </li>
 </ul>

 <h2><span class="mw-headline" id="dont wanna this">Useless</span></h2>  >
</html>

计算 XPath 表达式,并将选定的节点复制到输出:

<li> First quote </li>
<li> Second quote </li>

说明

这来自两个节点集的交集的 Kayessian(Michael Kay 博士)公式:

$ns1[count(.|$ns2) = count($ns2)]

上面恰好选择了同时属于 nodeset$ns和 nodeset 的所有节点$ns2

因此,我们替换$ns1为由所有感兴趣的后续兄弟组成的节点ulh2。我们$ns2用包含所有前面兄弟节点的节点集ul替换,h2即感兴趣的直接(第一个)跟随兄弟节点h2

这两个节点集的交集正好包含所有ul需要的元素。


更新:在评论中,OP 声明他只知道他希望结果来自第一部分——字符串“Neo”是未知的。

这是修改后的解决方案

(//h2[span/@id=$vSectionId])[1]
            /following-sibling::ul
  [count(.
        |
         (//h2[span/@id=$vSectionId])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  =
   count((//h2[span/@id=$vSectionId])[1]
            /following-sibling::h2[1]
              /preceding-sibling::ul
         )
  ]
    /li

该变量$vSectionId必须作为以下 XPath 表达式的字符串值获取:

  substring(//div[h2='Contents']
              /following-sibling::ul[1]
                 /li[1]/a/@href,
            2)

在这里,我们id从第一个目录条目中的 中获取所需内容,并跳过第一个字符“#” hrefa

这里又是一个基于 XSLT 的验证

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:variable name="vSectionId" select=
 "substring(//div[h2='Contents']
                      /following-sibling::ul[1]
                         /li[1]/a/@href,
                    2)
 "/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "(//h2[span/@id=$vSectionId])[1]
                /following-sibling::ul
      [count(.
            |
             (//h2[span/@id=$vSectionId])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      =
       count((//h2[span/@id=$vSectionId])[1]
                /following-sibling::h2[1]
                  /preceding-sibling::ul
             )
      ]
        /li

   "/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于位于 http://en.wikiquote.org/wiki/The_Matrix的完整 XML 文档时,应用这两个 XPath 表达式的结果(将第一个的结果替换为第二个,然后评估第二个表达式)是想要的,正确的一个

<li>I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.</li>
<li>Whoa.</li>
<li>I know kung-fu.</li>
<li>Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.</li>
<li>Guns.. lots of guns...</li>
<li>There is no spoon.</li>
<li>My name...is Neo!</li>
于 2012-12-16T20:38:31.087 回答
2

使用 API 将使其更容易解析。这是一个将提取第一部分的查询:

http://en.wikiquote.org/w/api.php?action=parse&page=The_Matrix§ion=1&prop=wikitext

输出:

<?xml version="1.0"?>
<api>
  <parse title="The Matrix">
    <wikitext xml:space="preserve">== Neo ==
[[File:The.Matrix.glmatrix.2.png|thumb|right|Unfortunately, no one can be ''told'' what The Matrix is. You have to see it for yourself.]]
[[Image:Arty spoon.jpg|thumb|right|Do not try to bend the spoon — that's impossible. Instead, only try to realize the truth: there is no spoon.]]

* I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.

* Whoa.
* I know kung-fu.

* Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.

* Guns.. lots of guns...

* There is no spoon. 

* My name...is Neo!</wikitext>
  </parse>
</api>

这是解析它的一种方法(使用HTTParty):

require 'httparty'

class Wikiquote
  include HTTParty
  base_uri 'en.wikiquote.org/w/'

  def self.get_quotes(page)
    url = "/api.php?action=parse&page=#{page}&section=1&prop=wikitext&format=xml"
    headers = {"User-Agent" => "Wikiquote scraper 1.0"}
    content = get(url, headers: headers)['api']['parse']['wikitext']['__content__']
    return content.scan(/^\* (.*)$/).flatten
  end
end

用法:

Wikiquote.get_quotes("The_Matrix")

输出:

["I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change. I don't know the future. I didn't come here to tell you how this is going to end. I came here to tell you how it's going to begin. I'm going to hang up this phone, and then I'm going to show these people what you don't want them to see. I'm going to show them a world … without you. A world without rules and controls, without borders or boundaries; a world where anything is possible. Where we go from there is a choice I leave to you.",
 "Whoa.",
 "I know kung-fu.",
 "Yeah. Well, that sounds like a pretty good deal. But I think I may have a better one. How about, I give you the finger [He does] and you give me my phone call.",
 "Guns.. lots of guns...",
 "There is no spoon. ",
 "My name...is Neo!"]
于 2012-12-18T23:53:54.027 回答
1

我建议//ul[preceding-sibling::h2[1][span/@id = 'Neo']]/li。或者,如果该id属性也没有分别与搜索无关,那么根据评论中的答案,我认为你想要

(//h2[span[contains(@class, 'mw-headline')]])[1]/following-sibling::ul
   [1 = count(preceding-sibling::h2[1] | (//h2[span[contains(@class, 'mw-headline')]])[1])]/li

请参阅XPath 轴,获取所有后续节点,直到获得解释,我希望我已经设法正确关闭所有括号和大括号,现在没有时间测试。

于 2012-12-16T17:14:33.563 回答