0

我有一个 Xpath 查询,它接受使用 Axslx 输出的数组元素,我需要为某些条件整理我的输出,其中之一是“包含的软件”

我的 xpath 抓取以下 URL http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1

我的代码示例如下:

clues = Array.new
clues << 'Optical drive'
clues << 'Pointing device'
clues << 'Software included'

selector = "//td[text()='%s']/following-sibling::td"

data = clues.map do |clue| 
         xpath = selector % clue
         [clue, doc.at(xpath).text.strip]
       end

Axlsx::Package.new do |p|
  p.workbook.add_worksheet do |sheet|
    data.each { |datum| sheet.add_row datum }
  end
  p.serialize 'output.xlsx'
end

我当前的输出格式

在此处输入图像描述

我想要的输出格式

在此处输入图像描述

4

1 回答 1

0

如果您可以始终使用“;”来依赖数据 对于分隔符,试试这个:

data = []
clues.each do |clue|
  xpath = selector % clue
  details = doc.at(xpath).text.strip.split(';')
  data << [clue, details.pop]
  details.each { |detail| data << ['', detail] }
end

在 Axlsx::Package.new 块之前生成数据

回答你的评论/问题:你用这样的东西来做;)

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'axlsx'

class Scraper

   def initialize(url, selector)
     @url = url
     @selector = selector
   end

   def hooks
     @hooks ||= {}
   end

   def add_hook(clue, p_roc)
     hooks[clue] = p_roc
   end

   def export(file_name)
     Scraper.clues.each do |clue|
       if detail = parse_clue(clue)
         output << [clue, detail.pop]
         detail.each { |datum| output << ['', datum] }
       end
     end
     serialize(file_name)
   end

   private

   def self.clues
     @clues ||= ['Operating system', 'Processors', 'Chipset', 'Memory type', 'Hard drive', 'Graphics',
                 'Ports', 'Webcam', 'Pointing device', 'Keyboard', 'Network interface', 'Chipset', 'Wireless',
                 'Power supply type', 'Energy efficiency', 'Weight', 'Minimum dimensions (W x D x H)',
                 'Warranty', 'Software included', 'Product color']
   end

   def doc
     @doc ||= begin 
                Nokogiri::HTML(open(@url))
              rescue
                raise ArgumentError, 'Invalid URL - Nothing to parse'
              end
   end

   def output
     @output ||= []
   end

   def selector_for_clue(clue)
     @selector % clue
   end

   def parse_clue(clue)
     if element = doc.at(selector_for_clue(clue))
       call_hook(clue, element) || element.inner_html.split('<br>').each(&:strip)
     end
   end

   def call_hook(clue, element)
     if hooks[clue].is_a? Proc
        value = hooks[clue].call(element)
        value.is_a?(Array) ? value : [value]
     end
   end

   def package
     @package ||= Axlsx::Package.new
   end

   def serialize(file_name)
     package.workbook.add_worksheet do |sheet|
       output.each { |datum| sheet.add_row datum }
     end
     package.serialize(file_name)
   end
end

scraper = Scraper.new("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1", "//td[text()='%s']/following-sibling::td")

# define a custom action to take against any elements found.
os_parse = Proc.new do |element|
  element.inner_html.split('<br>').each(&:strip!).each(&:upcase!)
end

scraper.add_hook('Operating system', os_parse)

scraper.export('foo.xlsx')

最终的答案是……一颗宝石。

http://rubydoc.info/gems/ninja2k/0.0.2/frames

于 2012-08-02T00:30:31.170 回答