1

我正在尝试使用 Nokogiri、Mechanize 和 XPath 解析页面,但是,无论我尝试什么,我都会收到一个空数组。

页面我正在尝试解析。

我在 Chrome 中检查了它并得到了 XPath,然后尝试了多种方法来解析它,但总是收到一个空数组。

我试过了:

puts page.search('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect

puts  post_page.parser.xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect

puts  post_page.parser.at_xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect

所有带有和不带有尾随“/ text”的内容

这是我要抓取的页面的来源:

<SCRIPT language="JavaScript">
<!-- 
document.cookie = "IV_JCT=%2FMPIS; path=/";
//--> 
</SCRIPT>

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>

  <head>
    <title>My Schedule</title>

    <meta http-equiv="pragma" content="no-cache">
    <meta http-equiv="cache-control" content="no-cache">
    <meta http-equiv="expires" content="-1">    
    <meta http-equiv="keywords" content="keyword1,keyword2,keyword3">
    <meta http-equiv="description" content="This is my schedule">
    <!--
    <link rel="stylesheet" type="text/css" href="styles.css">
    -->

  </head>

  <body>
  <div align="center"> 
    <strong>My Schedule</strong><br>as of Sun Feb 24 2013 06:43:09 PM CST<br><br>
    <div align="left"><pre><br>Employee Name: Johnson Appleseed    
Unit = 12345</pre>    
    <br> 
  </div>

    <table border="0" cellpadding="0" cellspacing="0" width="100%">
     <tr>
     <td colspan="8" align="center"><b><font size="+1">Schedules may be subject to change based on business needs or demand</font></b></td>
     </tr>   

      <tr><td>
      <table border="4" bordercolor="#2D73B9" cellpadding="2" cellspacing="2" width="100%">
      <tr bgcolor="#7C9BCF">
        <td width="12%" align="center"><b>Sunday</b></td>
        <td width="12%" align="center"><b>Monday</b></td>
        <td width="12%" align="center"><b>Tuesday</b></td>
        <td width="12%" align="center"><b>Wednesday</b></td>
        <td width="12%" align="center"><b>Thursday</b></td>
        <td width="12%" align="center"><b>Friday</b></td>
        <td width="12%" align="center"><b>Saturday</b></td>
        <td rowspan="2" width="12%" align="center"><b>Total weekly Hours</b></td>
      </tr>  

      <tr bgcolor="#7C9BCF">

        <td width="14%" align="center">2013-02-24</td>

        <td width="14%" align="center">2013-02-25</td>

        <td width="14%" align="center">2013-02-26</td>

        <td width="14%" align="center">2013-02-27</td>

        <td width="14%" align="center">2013-02-28</td>

        <td width="14%" align="center">2013-03-01</td>

        <td width="14%" align="center">2013-03-02</td>

      </tr>  

      <tr bgcolor="#FFFFFF">

        <td width="14%" align="left"><pre>&nbsp;</pre></td>

        <td width="14%" align="left"><pre><b>Shift: </b>
5:30 PM - 9:00 PM
<b>Meal:</b>
 - </pre></td>


        <td width="14%" align="left"><pre>&nbsp;</pre></td>

        <td width="14%" align="left"><pre>&nbsp;</pre></td>

        <td width="14%" align="left"><pre>&nbsp;</pre></td>

        <td width="14%" align="left"><pre><b>Shift: </b>
2:00 PM - 9:15 PM
<b>Meal:</b>
5:45 PM - 6:30 PM</pre></td>

        <td width="14%" align="left"><pre><b>Shift: </b>
4:45 PM - 9:15 PM
<b>Meal:</b>
 - </pre></td>

        <td width="12%" align="center">14.5</td> 
      </tr>  

      <tr bgcolor="#FFFFFF">

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">3.5</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">6.5</td>

        <td width="14%" align="center">4.5</td>

        <td width="14%" align="center">Daily Hours</td>   
      </tr>  

     </table>
     </td></tr>

      <tr><td>
      <table border="4" bordercolor="#2D73B9" cellpadding="2" cellspacing="2" width="100%">
      <tr bgcolor="#7C9BCF">
        <td width="12%" align="center"><b>Sunday</b></td>
        <td width="12%" align="center"><b>Monday</b></td>
        <td width="12%" align="center"><b>Tuesday</b></td>
        <td width="12%" align="center"><b>Wednesday</b></td>
        <td width="12%" align="center"><b>Thursday</b></td>
        <td width="12%" align="center"><b>Friday</b></td>
        <td width="12%" align="center"><b>Saturday</b></td>
        <td rowspan="2" width="12%" align="center"><b>Total weekly Hours</b></td>
      </tr>  

      <tr bgcolor="#7C9BCF">

        <td width="14%" align="center">2013-03-03</td>

        <td width="14%" align="center">2013-03-04</td>

        <td width="14%" align="center">2013-03-05</td>

        <td width="14%" align="center">2013-03-06</td>

        <td width="14%" align="center">2013-03-07</td>

        <td width="14%" align="center">2013-03-08</td>

        <td width="14%" align="center">2013-03-09</td>

      </tr>  

      <tr bgcolor="#FFFFFF">

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="12%" align="center">0.0</td> 
      </tr>  

      <tr bgcolor="#FFFFFF">

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">Daily Hours</td>   
      </tr>  

     </table>
     </td></tr>

      <tr><td>
      <table border="4" bordercolor="#2D73B9" cellpadding="2" cellspacing="2" width="100%">
      <tr bgcolor="#7C9BCF">
        <td width="12%" align="center"><b>Sunday</b></td>
        <td width="12%" align="center"><b>Monday</b></td>
        <td width="12%" align="center"><b>Tuesday</b></td>
        <td width="12%" align="center"><b>Wednesday</b></td>
        <td width="12%" align="center"><b>Thursday</b></td>
        <td width="12%" align="center"><b>Friday</b></td>
        <td width="12%" align="center"><b>Saturday</b></td>
        <td rowspan="2" width="12%" align="center"><b>Total weekly Hours</b></td>
      </tr>  

      <tr bgcolor="#7C9BCF">

        <td width="14%" align="center">2013-03-10</td>

        <td width="14%" align="center">2013-03-11</td>

        <td width="14%" align="center">2013-03-12</td>

        <td width="14%" align="center">2013-03-13</td>

        <td width="14%" align="center">2013-03-14</td>

        <td width="14%" align="center">2013-03-15</td>

        <td width="14%" align="center">2013-03-16</td>

      </tr>  

      <tr bgcolor="#FFFFFF">

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="14%" align="left"><pre>Sched Not Posted</pre></td>

        <td width="12%" align="center">0.0</td> 
      </tr>  

      <tr bgcolor="#FFFFFF">

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">0.0</td>

        <td width="14%" align="center">Daily Hours</td>   
      </tr>  

     </table>
     </td></tr>

     <tr>
     <td colspan="8" align="center"><b><font size="+1">Schedules may be subject to change based on business needs or demand</font></b></td>
     </tr>
    </table >

        <p><br>
        </p>
        <p class="align_center" >      
            <input type=button value="Print this page" onClick="javascript:window.print();">     
            <input type=button value="Close This Window" onClick="javascript:window.close();">
        </p>

  </div>
  </body>

</html>
4

1 回答 1

5

Notice that in your XPath accessors you're requiring tbody be part of the path:

puts page.search('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect
puts  post_page.parser.xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect
puts  post_page.parser.at_xpath('/html/body/div/table/tbody/tr[2]/td/table/tbody/tr[2]/td[2]').inspect

The HTML doesn't have a tbody tag, causing your lookup to fail.

Try simplifying your accessor. I usually start with CSS, which Nokogiri supports, then, if I can't get there I switch to XPath. Your mileage might vary for that.

For instance:

(rdb:1) puts doc.at('table table tr').to_html

Outputs:

<tr bgcolor="#7C9BCF">
<td width="12%" align="center"><b>Sunday</b></td>
        <td width="12%" align="center"><b>Monday</b></td>
        <td width="12%" align="center"><b>Tuesday</b></td>
        <td width="12%" align="center"><b>Wednesday</b></td>
        <td width="12%" align="center"><b>Thursday</b></td>
        <td width="12%" align="center"><b>Friday</b></td>
        <td width="12%" align="center"><b>Saturday</b></td>
        <td rowspan="2" width="12%" align="center"><b>Total weekly Hours</b></td>
        </tr>

That's a lot simpler way of getting at the column headers.

To get at the second row you can use:

(rdb:1) puts doc.at('table table tr[2]').to_html

Which gets you:

<tr bgcolor="#7C9BCF">
<td width="14%" align="center">2013-02-24</td>
        <td width="14%" align="center">2013-02-25</td>
        <td width="14%" align="center">2013-02-26</td>
        <td width="14%" align="center">2013-02-27</td>
        <td width="14%" align="center">2013-02-28</td>
        <td width="14%" align="center">2013-03-01</td>
        <td width="14%" align="center">2013-03-02</td>
        </tr>

To get at the cell contents you can use:

(rdb:1) puts doc.search('table table tr[2] td').map(&:text)

Which returns:

2013-02-24
2013-02-25
2013-02-26
2013-02-27
2013-02-28
2013-03-01
2013-03-02
2013-03-03
2013-03-04
2013-03-05
2013-03-06
2013-03-07
2013-03-08
2013-03-09
2013-03-10
2013-03-11
2013-03-12
2013-03-13
2013-03-14
2013-03-15
2013-03-16

Notice how that is returning the headings for two tables. To limit it to the first table we can use at instead of search. at returns the first matching node, where search returns a NodeSet, which is like an array. Also, search looks through the entire document finding all matches, unlike at's behavior.

This code finds the first table's second row, then walks the embedded cells:

(rdb:1) puts doc.at('table table tr[2]').search('td').map(&:text)
2013-02-24
2013-02-25
2013-02-26
2013-02-27
2013-02-28
2013-03-01
2013-03-02

It's much simpler, easier to understand and maintain.

于 2013-02-25T03:23:58.290 回答