1

我有以下代码,效果很好:

rows = diary_HTML.xpath('//*[@id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
  detail = {}
  [
    ["Food", 'td[1]/text()'],   
    ["Calories", 'td[2]/text()'],
    ["Carbs", 'td[3]/text()'],
    ["Fat", 'td[4]/text()'],
    ["Protein", 'td[5]/text()'],
    ["Cholest", 'td[6]/text()'],
  ].each do |name, xpath|
    detail[name] = row.at_xpath(xpath).to_s.strip
  end
  detail
end

但是,“食物”td 不仅包括文本,还包括我想从中获取文本的链接。

我知道我可以用它'td[1]/a/text()'来获取链接文本,但我该怎么做呢?

'td[1]/a/text()' or 'td[1]/text()'

已编辑 - 添加了代码段。

我试图<tr class="meal_header"> <td class="first alt">Breakfast</td>在第一行包含所有行,其他行包含其他常规 tds,同时不包括底行的 td1。

<tr class="meal_header">
  <td class="first alt">Breakfast</td>
  <td class="alt">Calories</td>
  <td class="alt">Carbs</td>
  <td class="alt">Fat</td>
  <td class="alt">Protein</td>
  <td class="alt">Sodium</td>
  <td class="alt">Sugar</td>
</tr>
<tr>  
<td class="first alt">            
  <a onclick="showEditFood(3992385560);" href="#">Hovis (Uk - White Bread (40g) Toasted With Flora Light Marg, 2 slice</a> </td>
  <td>262</td>   
  <td>36</td>
  <td>9</td>
  <td>7</td>
  <td>0</td>
  <td>3</td>
</tr>
<tr class="bottom">
  <td class="first alt" style="z-index: 10">
    <a href="/food/add_to_diary?meal=0" class="add_food">Add Food</a>
    <div class="quick_tools">
    <a href="#quick_tools_0" class="toggle_diary_options">Quick Tools</a>
    <div id="quick_tools_0" class="quick_tools_options hidden">
    <ul>
      <li><a onclick="showLightbox(200, 250, '/food/quick_add?meal=0&amp;date=2013-04-15'); return false;">Quick add calories</a></li>
     <li><a href="/meal/new?meal=0">Remember meal</a></li>
     <li><a href="/food/copy_meal?date=2013-04-15&amp;from_date=2013-04-14&amp;meal=0&amp;username=nickwild1">Copy yesterday</a></li>  
     <li><a href="#recent_meals_0" class="toggle_diary_options">Copy from date</a></li>             
     <li><a href="#recent_meals_copy_to_0" class="toggle_diary_options">Copy to date</a></li>
    </ul>
    </div>
   <div id="recent_meals_0" class="recent_meal_options hidden">
    <ul id="recent_meal_options_0">
    <li class="header">Copy from which date?</li>        
    <li><a href="/food/copy_meal?date=2013-04-15&amp;from_date=2013-04-14&amp;meal=0&amp;username=nickwild1">Sunday, April 14</a></li>
    <li><a href="/food/copy_meal?date=2013-04-15&amp;from_date=2013-04-13&amp;meal=0&amp;username=nickwild1">Saturday, April 13</a></li>
    </ul>
    </div>
    </div>
  </td>
  <td>285</td>
  <td>39</td>
  <td>9</td>
  <td>10</td>
  <td>0</td>
  <td>3</td>
  <td></td>

4

2 回答 2

2

简短的回答是: use Nokogiri::XML::Element#text,它将给出元素加上子元素的文本(a例如你的)。

您还可以清理该代码相当多:

keys = ["Food", "Calories", "Carbs", "Fat", "Protein", "Cholest"]
food_diary = rows.collect do |row|
  Hash[keys.zip row.search('td').map(&:text)]
end

作为最后的提示,避免将 xpath 与 html 一起使用,css 更好。

于 2013-04-19T08:28:52.220 回答
1

text()我认为当您在 xpath 中没有显式提取时,您可以通过更改逻辑来查看元素内容来实现这一点

rows = diary_HTML.xpath('//*[@id="main"]/div[2]/table/tbody/tr')
food_diary = rows.collect do |row|
  detail = {}
  [
    ["Food", 'td[1]'],   
    ["Calories", 'td[2]/text()'],
    ["Carbs", 'td[3]/text()'],
    ["Fat", 'td[4]/text()'],
    ["Protein", 'td[5]/text()'],
    ["Cholest", 'td[6]/text()'],
  ].each do |name, xpath|
    if xpath.include?('/text()')
      detail[name] = row.at_xpath(xpath).to_s.strip
    else
      detail[name] = row.at_xpath(xpath).content.strip
    end
  end
  detail
end

您还可以在数组中添加一个符号,以描述您是如何提取数据的,并有一个case块来处理项目,具体取决于最后阶段要执行的操作xpath

请注意,您也可以通过递归遍历 xpath 返回的节点结构来做您想做的事情,但是如果您只想忽略标记、链接等,这似乎有点过头了。

于 2013-04-19T08:23:41.237 回答