ruby - 使用 Nokogiri/Rspec 一次解析多个文件

Question

我有一个主要的解析方法，它由解析 html 文件的其他方法组成：

class Parser
  def self.parse(html)
    @data = Nokogiri.HTML(open(html))
    merged_hashes = {}

    array_of_hashes = [
      parse_title,
      parse_description,
      parse_related
    ]
    array_of_hashes.inject(merged_hashes,:update)

    return merged_hashes
  end

  def self.parse_title
    title_hash = {}

    title = @data.at_css('.featureProductInfo a')
    return title_hash if title.nil?
    title_hash[:title] = @data.at_css('.featureProductInfo a').text

    title_hash
  end
  .
  .
  .

所以我在 Rspec 中这样做：

require File.dirname(__FILE__) + '/parser.rb'

def html_starcraft
  File.open("amazon_starcraft.html")
end

describe ".parse_title (StarCraft)" do
  let(:title_hash) { Parser.parse html_starcraft } 

  it "scraps the featured product title" do
    expect(title_hash[:title]).to eq("StarCraft II: Wings of Liberty (Bradygames Signature Guides)")
  end
end

如您所见，我一次只解析一个文件。我怎样才能同时解析多个？比如说，解析文件夹中的所有文件？

score 2 · Accepted Answer

正如@theTinMan 指出的那样，Nokogiri 一次只处理一个文件。如果要解析文件夹中的所有文件，则必须读取该文件夹（同样，正如@theTinMan 所指出的那样）并为每个文件生成一个进程或线程。

当然，您首先需要了解fork 的工作原理或线程是什么。

使用流程的示例

好的，让我们使用一个进程，因为 ruby 没有真正的线程：

files = Dir.glob("files/**")

files.each do |file|
  # Here the program become two: 
  # One executes the block, other continues the loop
  fork do 
    puts File.open(file).read
  end
end

# We need to wait for all processes to get to this point
# Before continue, because if the main program dies before
# its children, they are killed immediately. 
Process.waitall
puts "All done. closing."

和输出：

$ ls files/
a.txt  b.txt  c.txt  d.txt
$ ruby script.rb 
Content of a.txt
Content of b.txt
Content of d.txt
Content of c.txt
All done. closing.

请注意，由于它是并发的，因此每次执行程序时读取文件的顺序都会改变。

ruby - 使用 Nokogiri/Rspec 一次解析多个文件

1 回答 1

使用流程的示例

Related

Reference