0

我正在尝试返回准确的 XPATH 查询表达式,以便可以使用 rapidminer 对站点进行数据挖掘。我需要一个查询来单独隔离每一行:

2012 年 7 月 11 日星期三

巨魔

9999999999999

07.11.12

提交的内容文件

2012 年 11 月 20 日星期二下午 1:12

到目前为止,我只有//td[@class='select']/text()

注意:值会发生变化,因此查询需要特定于位置。

对于每个值,六个单独的查询是什么?

        <tr>
          <td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
            Wed 7/11/2012<br>
            TROLL&nbsp;
            
          </td>
          <td class="select" align="center" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
            9999999999999
            <br>07.11.12
            
            &nbsp;
          </td>
          <td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')">
             
              
              
                      
                CONNOTE FILE LODGED <br>
                Tue 20/11/2012 1:12 PM
              &nbsp;
            
            
            
&nbsp;
          </td>
          
        </tr>
      
    </table>
4

1 回答 1

0

使用 Ruby 库Nokogiri(它位于 libxml2 之上,实现 XPath 1.0)来测试:

XPATHS = %w{
  //tr/td[1]/text()[1]
  //tr/td[1]/text()[2]
  //tr/td[2]/text()[1]
  //tr/td[2]/text()[2]
  //tr/td[3]/text()[1]
  //tr/td[3]/text()[2]
}

require 'nokogiri'
d = Nokogiri.HTML(html)

XPATHS.each{ |expression| p d.at_xpath(expression).content }
#=> "\n            Wed 7/11/2012"
#=> "\n            TROLL\u00A0\n\n          "
#=> "\n            9999999999999\n            "
#=> "07.11.12\n\n            \u00A0\n          "
#=> "\n\n\n\n\n                CONNOTE FILE LODGED "
#=> "\n                Tue 20/11/2012 1:12 PM\n              \u00A0\n\n\n\n\u00A0\n          "

如您所见,文本节点包含许多您可能想要删除的额外前导和尾随空格。我们可以使用以下方法去除它normalize-space

XPATHS = %w{
  normalize-space(//tr/td[1]/text()[1])
  normalize-space(//tr/td[1]/text()[2])
  normalize-space(//tr/td[2]/text()[1])
  normalize-space(//tr/td[2]/text()[2])
  normalize-space(//tr/td[3]/text()[1])
  normalize-space(//tr/td[3]/text()[2])
}

XPATHS.each{ |expression| p d.xpath(expression) }
#=> "Wed 7/11/2012"
#=> "TROLL\u00A0"
#=> "9999999999999"
#=> "07.11.12 \u00A0"
#=> "CONNOTE FILE LODGED"
#=> "Tue 20/11/2012 1:12 PM \u00A0 \u00A0"
于 2013-05-03T03:15:19.873 回答