ruby-on-rails - 抓取包含文本 nokogiri xpath 的元素

Question

仍在学习如何使用 nokogiri，到目前为止可以通过 css 元素抓取。有一个页面我想抓取http://www.bbc.co.uk/sport/football/results，我想获得所有结果巴克莱超级联赛，可以通过阿贾克斯的电话进行渲染，但是我读过的 nokogiri 是不可能的。

所以我提供的链接对所有不同的联赛都有很多结果，所以我只能抓住那些标题为巴克莱超级联赛的结果，它包含在

class="competition-title"

到目前为止，我可以像这样获取所有结果

def get_results # Get me all results
 doc = Nokogiri::HTML(open(RESULTS_URL))
 days = doc.css('#results-data h2').each do |h2_tag|
 date = Date.parse(h2_tag.text.strip).to_date
  matches = h2_tag.xpath('following-sibling::*[1]').css('tr.report')
  matches.each do |match|
    home_team = match.css('.team-home').text.strip
    away_team = match.css('.team-away').text.strip
    score = match.css('.score').text.strip
 Result.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
end

任何帮助表示赞赏

编辑

好的，好像我可以使用一些红宝石，使用选择？不知道如何实施。下面的例子

.select{|th|th.text =~ /Barclays Premier League/}

或者更多的阅读说可以使用 xpath

matches = h2_tag.xpath('//th[contains(text(), "Barclays Premier League")]').css('tr.report')

或者

matches = h2_tag.xpath('//b/a[contains(text(),"Barclays")]/../following-sibling::*[1]').css('tr.report')

已经尝试过 xpath 的方式，但显然是错误的，因为没有任何节省

谢谢

score 2 · Accepted Answer

我更喜欢一种方法，您可以深入了解您需要什么。查看源代码，您需要匹配详细信息：

    <td class='match-details'>
        <p>
            <span class='team-home teams'><a href='...'>Brechin</a></span>
            <span class='score'><abbr title='Score'> 0-2 </abbr></span>
            <span class='team-away teams'><a href='...'>Alloa</a></span>
        </p>
    </td>

您需要元素中的三个文本内容项p。只有“巴克莱超级联赛”才需要这个。

查看源代码，请注意上面需要的元素恰好位于仅包含该联赛得分的表格中。多么方便！该表可以通过包含<th>“Barclays Premier League”的元素来识别。然后您所要做的就是使用 XPath 识别该表：

matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')

就足够了，td/p因为匹配详细信息是唯一包含 a 的p，但您可以根据需要将类添加到td。

然后你完全按照你的方式获取你的信息：

matches.each do |match|
  home_team = match.css('.team-home').text.strip
  away_team = match.css('.team-away').text.strip
  score = match.css('.score').text.strip
  ...
end

剩下的一项任务：获取每场比赛的日期。回顾源代码，您可以追溯到第一个包含表，并看到前面的第一个h2节点有它。您可以在 XPath 中表达这一点：

date = match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text

把它们放在一起

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  matches = doc.xpath('//table[.//th[contains(., "Barclays Premier League")]]//td/p')
  matches.each do |match|
    home_team = match.css('.team-home').text.strip
    away_team = match.css('.team-away').text.strip
    score = match.css('.score').text.strip
    date = Date.parse(match.at_xpath('ancestor::table[1]/preceding-sibling::h2[1]').text).to_date
    Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
  end
end

score 1 · Accepted Answer

只是为了好玩，这就是我将如何转换@Mark Thomas 的解决方案：

def get_results    
  doc = Nokogiri::HTML(open(RESULTS_URL))
  doc.search('h2.table-header').each do |h2|
    date = Date.parse(h2.text).to_date
    next unless h2.at('+ table th[2]').text['Barclays Premier League']
    h2.search('+ table tbody tr').each do |tr|
      home_team = tr.at('.team-home').text.strip
      away_team = tr.at('.team-away').text.strip
      score = tr.at('.score').text.strip
      Results.create!(home_team: home_team, away_team: away_team, score: score, fixture_date: date)
    end
  end
end

通过首先迭代那些 h2，你得到：

优点：

将日期拉到循环之外
更简单的表达方式（你可能不会太担心这些，但想想追随你的那个人。）

缺点：

一些额外的代码字节

ruby-on-rails - 抓取包含文本 nokogiri xpath 的元素

2 回答 2

Related

Reference