0

这是我到目前为止所拥有的......问题是它正在生成一个看起来像的 JSON 文件(见下文)。我的问题是,当我检查页面上的代码时,我看不到 css 选择器的任何独特之处。他们都只是 tr td a。任何提示将不胜感激。

谢谢!

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'json'

sammiches = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/List_of_sandwiches"))

class Scraper

def initialize
 @url = "http://en.wikipedia.org/wiki/List_of_sandwiches"
 @nodes = Nokogiri::HTML(open(@url))

end

def summary(filename)

 sammich_data = @nodes

 sammiches = sammich_data.css('div.mw-content-ltr table.wikitable tr') 


 sammich_hashes = sammiches.map {|x| 
   name = x.css('td a').text
   image = x.css('td a.image').text
   country = x.css('td a').text
   description = x.css('td a').text

 {
  :name => name,
  :image => image,
  :country => country,
  :description => description,
  }
    }

File.open("public/#{filename}","w") do |f|
 f.write(JSON.pretty_generate(sammich_hashes))
 end   
 end

 sammy = Scraper.new
 puts sammy.summary('listy')
 end

Json文件输出部分

[
{
"name": "",
"image": "",
"country": "",
"description": ""
},
{
"name": "BaconUnited Kingdomketchupbrown sauce",
"image": "",
"country": "BaconUnited Kingdomketchupbrown sauce",
"description": "BaconUnited Kingdomketchupbrown sauce"
},
{
"name": "Bacon, egg and cheesebreakfast sandwich",
"image": "",
"country": "Bacon, egg and cheesebreakfast sandwich",
"description": "Bacon, egg and cheesebreakfast sandwich"
4

2 回答 2

1

只需使用 td 索引:

name = x.at('td[1]').text
country = x.at('td[3]').text

您可能想先删除引文:

sammich_data.search('sup').remove
于 2013-04-25T02:34:56.113 回答
1

与其解析 Wikipedia 的 HTML,不如利用他们的 API,这将为您提供 XML、JSON 或其他格式的数据。它更清洁,更可重复使用。

您甚至可以获得用于呈现没有所有边框和框的页面的 HTML。

于 2013-04-25T05:34:37.937 回答