0

我构建了一个抓取工具,将所有信息从 Wikipedia 表中提取出来,并将其上传到我的数据库中。一切都很好,直到我意识到我在图像上拉错了 URL,我想要实际的图像 URL“ http://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Baconbutty.jpg ”而不是“/wiki/File:Baconbutty.jpg”很容易给我。到目前为止,这是我的代码:

def initialize
  @url = "http://en.wikipedia.org/wiki/List_of_sandwiches"
  @nodes = Nokogiri::HTML(open(@url))  
end

def summary

  sammich_data = @nodes

  sammiches = sammich_data.css('div.mw-content-ltr table.wikitable tr') 
    sammich_data.search('sup').remove

    sammich_hashes = sammiches.map {|x| 

      if content = x.css('td')[0]
        name = content.text
      end
      if content = x.css('td a.image').map {|link| link ['href']}
        image =content[0]
      end
      if content = x.css('td')[2]
        origin = content.text
      end
      if content = x.css('td')[3]
        description =content.text
      end

我的问题是这一行:

if content = x.css('td a.image').map {|link| link ['href']}
            image =content[0]

如果我去td a.image img,它只会给我一个null条目。

有什么建议么?

4

2 回答 2

1

这是我的做法(如果我要抓取维基百科,我不会这样做,因为他们确实有用于这些东西的 API):

require 'nokogiri'
require 'open-uri'
require 'pp'

doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/List_of_sandwiches"))  

sammich_hashes = doc.css('table.wikitable tr').map { |tr| 
  name, image, origin, description = tr.css('td,th')
  name, origin, description = [name, origin, description].map{ |n| n && n.text ? n.text : nil }
  image = image.at('img')['src'] rescue nil

  {
    name: name,
    origin: origin,
    description: description,
    image: image
  }
}

pp sammich_hashes

哪个输出:

[
  {:name=>"Name", :origin=>"Origin", :description=>"Description", :image=>nil},
  {
    :name=>"Bacon",
    :origin=>"United Kingdom",
    :description=>"Often served with ketchup or brown sauce",
    :image=>"//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Baconbutty.jpg/120px-Baconbutty.jpg"
  },
  ... [lots removed] ...
{
    :name=>"Zapiekanka",
    :origin=>"Poland",
    :description=>"A halved baguette or other bread usually topped with mushrooms and cheese, ham or other meats, and vegetables",
    :image=>"//upload.wikimedia.org/wikipedia/commons/thumb/1/12/Zapiekanka_3..jpg/120px-Zapiekanka_3..jpg"
  }
]

如果图像不可用,则该字段将nil在返回的哈希中设置为。

于 2013-05-09T18:15:39.680 回答
0

您可以使用元素的srcset属性img,将其拆分并保留可用的调整大小图像之一。

if content = x.at_css('td a.image img')
  image =content['srcset'].split(' 1.5x,').first
于 2013-05-09T17:49:04.093 回答