ruby - 如何正确爬行？

Question

一个月以来，我一直在研究 Nokogiri、REXML 和 Ruby。我有这个巨大的数据库，我正在尝试爬取。我正在抓取的内容是 HTML 链接和 XML 文件。

我想要抓取并存储在 CSV 文件中的正是 43612 个 XML 文件。

如果抓取 500 个 xml 文件，我的脚本可以工作，但更大的文件需要太多时间并且它会冻结或其他东西。

我在这里将代码分成几部分，以便于阅读，整个脚本/代码在这里：https ://gist.github.com/1981074

我正在使用两个库，因为我找不到在 nokogiri 中完成这一切的方法。我个人觉得 REXML 更容易使用。

我的问题：如何解决它，这样我就不会在一周内爬完这一切？如何让它运行得更快？

这是我的脚本：

需要必要的库：

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML

创建一堆数组来存储抓取数据：

@urls = Array.new 
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new

从规范站点获取所有 xml 链接并将它们存储在一个名为 @urls 的数组中

htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))

htmldoc.xpath('//a/@href').each do |links|
  @urls << links.content
end

循环抛出@urls 数组，并用xpath 抓取我想抓取的每个元素节点。

@urls.each do |url|
  # Loop throw the XML files and grab element nodes
  xmldoc = REXML::Document.new(open(url).read)
  # Root element
  root = xmldoc.root
  # Hämtar info-id
  @ID << root.attributes["id"]
  # TitleSv
  xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
        m = m.to_s
        next if m.empty? 
        @titleSv << m
  }

然后将它们存储在 CSV 文件中。

 CSV.open("eduction_normal.csv", "wb") do |row|
    (0..@ID.length - 1).each do |index|
      row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
    end
  end

score 4 · Accepted Answer

由于代码的结构方式，很难确定确切的问题。这里有一些建议可以提高程序的速度和结构，以便更容易找到阻碍你的东西。

图书馆

您在这里使用了很多可能没有必要的库。

您同时使用REXML和Nokogiri。他们都做同样的工作。除了Nokogiri在这方面做得更好（基准）。

使用哈希

不是将数据存储index在 15 个数组中，而是使用一组哈希。

例如，

items = Set.new

doc.xpath('//a/@href').each do |url|
  item = {}
  item[:url] = url.content
  items << item
end

items.each do |item|
  xml = Nokogiri::XML(open(item[:url]))

  item[:id] = xml.root['id']
  ...
end

收集数据，然后写入文件

现在你已经有了你的items集合，你可以迭代它并写入文件。这比逐行执行要快得多。

保持干燥

在您的原始代码中，您将同一件事重复了十几次。不要复制和粘贴，而是尝试抽象出公共代码。

xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
     m = m.to_s
     next if m.empty? 
     @titleSv << m
}

移动方法的共同点

def get_value(xml, path)
   str = ''
   xml.elements.each(path) do |e|
     str = e.text.to_s
     next if str.empty?
   end

   str
end

并将任何不变的东西移动到另一个散列

xml_paths = {
  :title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
  :title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
  ...
}

现在你可以结合这些技术来制作更简洁的代码

item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])

我希望这有帮助！

score 2 · Accepted Answer

没有你的固定，它就行不通。我相信你应该像@Ian Bishop 所说的那样重构你的解析代码

require 'rubygems'
require 'pioneer'
require 'nokogiri'
require 'rexml/document'
require 'csv'

class Links < Pioneer::Base
  include REXML
  def locations
    ["http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI"]
  end

  def processing(req)
    doc = Nokogiri::HTML(req.response.response)
    htmldoc.xpath('//a/@href').map do |links|
      links.content
    end
  end
end

class Crawler < Pioneer::Base
  include REXML
  def locations
    Links.new.start.flatten
  end

  def processing(req)
    xmldoc = REXML::Document.new(req.respone.response)
    root = xmldoc.root
    id = root.attributes["id"]
    xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]") do |e|
      title = e.text.to_s
      CSV.open("eduction_normal.csv", "a") do |f|
        f << [id, title ...]
      end
    end
  end
end

Crawler.start
# or you can run 100 concurrent processes
Crawler.start(concurrency: 100)

score 1 · Accepted Answer

如果你真的想加快速度，你将不得不并发。

最简单的方法之一是安装 JRuby，然后稍加修改即可运行您的应用程序：安装“peach”或“pmap”gem，然后将您的更改items.each为items.peach(n)（每个并行），其中n是线程数。每个 CPU 内核至少需要一个线程，但是如果将 I/O 放入循环中，那么您将需要更多。

此外，使用 Nokogiri，它要快得多。如果您需要使用 Nokogiri 解决特定问题，请提出单独的 Nokogiri 问题。我相信它可以做你需要的。

ruby - 如何正确爬行？

3 回答 3

图书馆

使用哈希

收集数据，然后写入文件

保持干燥

Related

Reference