ruby - 使用 Ruby 和 Nokogiri 解析大型 XML 文件

Question

我有一个大型 XML 文件（大约 10K 行），我需要定期解析，格式如下：

<summarysection>
    <totalcount>10000</totalcount>
</summarysection>
<items>
     <item>
         <cat>Category</cat>
         <name>Name 1</name>
         <value>Val 1</value>
     </item>
     ...... 10,000 more times
</items>

我想做的是使用 nokogiri 解析每个单独的节点来计算一个类别中的项目数量。然后，我想从 total_count 中减去该数字，得到一个显示为“Count of Interest_Category: n, Count of All Else: z”的输出。

这是我现在的代码：

#!/usr/bin/ruby

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("/path/to/file/all.xml"))
all_items = xmlfeed.xpath("//items")

  all_items.each do |adv|
            if (adv.children.filter("cat").first.child.inner_text.include? "partofcatname")
                icount = icount + 1
            end
  end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

这似乎有效，但速度很慢！我说 10,000 件物品的时间超过 10 分钟。有一个更好的方法吗？我是否以一种不太理想的方式做某事？

score 26 · Accepted Answer

这是一个将 SAX 解析器计数与基于 DOM 的计数进行比较的示例，<item>对 7 个类别之一计数 500,000 秒。首先，输出：

创建 XML 文件：1.7s
通过 SAX 计数：12.9s
创建 DOM：1.6s
通过 DOM 计数：2.5s

两种技术都会产生相同的哈希，计算每个类别的数量：

{"Cats"=>71423, "Llamas"=>71290, "Pigs"=>71730, "Sheep"=>71491, "Dogs"=>71331, "Cows"=>71536, "Hogs"=>71199}

SAX 版本需要 12.9 秒来计算和分类，而 DOM 版本只需要 1.6 秒来创建 DOM 元素，并且需要 2.5 秒来查找和分类所有<cat>值。DOM 版本的速度大约是 3 倍！

……但这还不是全部。我们还必须查看 RAM 使用情况。

对于 500,000 个项目，SAX (12.9s) 在 238MB 的 RAM 处达到峰值；DOM (4.1s) 在 1.0GB 时达到峰值。
对于 1,000,000 个项目，SAX（25.5 秒）峰值为 243MB 的 RAM；DOM (8.1s) 在 2.0GB 时达到峰值。
对于 2,000,000 个项目，SAX (55.1s) 在 250MB RAM 处达到峰值；DOM ( ??? ) 峰值为 3.2GB。

我的机器上有足够的内存来处理 1,000,000 个项目，但在 2,000,000 个项目时，我的 RAM 用完了，不得不开始使用虚拟内存。即使使用 SSD 和快速机器，我也让 DOM 代码运行了将近十分钟，然后才最终杀死它。

您报告的时间很长很可能是因为您的 RAM 用完并且作为虚拟内存的一部分连续访问磁盘。如果您可以将 DOM 放入内存，请使用它，因为它是 FAST。但是，如果你不能，你真的必须使用 SAX 版本。

这是测试代码：

require 'nokogiri'

CATEGORIES = %w[ Cats Dogs Hogs Cows Sheep Pigs Llamas ]
ITEM_COUNT = 500_000

def test!
  create_xml
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_sax
  sleep 2; GC.start # Time to read memory before cleaning the slate
  test_dom
end

def time(label)
  t1 = Time.now
  yield.tap{ puts "%s: %.1fs" % [ label, Time.now-t1 ] }
end

def test_sax
  item_counts = time("Count via SAX") do
    counter = CategoryCounter.new
    # Use parse_file so we can stream data from disk instead of flooding RAM
    Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')
    counter.category_counts
  end
  # p item_counts
end

def test_dom
  doc = time("Create DOM"){ File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) } }
  counts = time("Count via DOM") do
    counts = Hash.new(0)
    doc.xpath('//cat').each do |cat|
      counts[cat.children[0].content] += 1
    end
    counts
  end
  # p counts
end

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

def create_xml
  time("Create XML file") do
    File.open('tmp.xml','w') do |f|
      f << "<root>
      <summarysection><totalcount>10000</totalcount></summarysection>
      <items>
      #{
        ITEM_COUNT.times.map{ |i|
          "<item>
            <cat>#{CATEGORIES.sample}</cat>
            <name>Name #{i}</name>
            <name>Value #{i}</name>
          </item>"
        }.join("\n")
      }
      </items>
      </root>"
    end
  end
end

test! if __FILE__ == $0

DOM 计数是如何工作的？

如果我们去掉一些测试结构，基于 DOM 的计数器看起来像这样：

# Open the file on disk and pass it to Nokogiri so that it can stream read;
# Better than  doc = Nokogiri.XML(IO.read('tmp.xml'))
# which requires us to load a huge string into memory just to parse it
doc = File.open('tmp.xml','r'){ |f| Nokogiri.XML(f) }

# Create a hash with default '0' values for any 'missing' keys
counts = Hash.new(0) 

# Find every `<cat>` element in the document (assumes one per <item>)
doc.xpath('//cat').each do |cat|
  # Get the child text node's content and use it as the key to the hash
  counts[cat.children[0].content] += 1
end

SAX 计数如何工作？

首先，让我们关注这段代码：

class CategoryCounter < Nokogiri::XML::SAX::Document
  attr_reader :category_counts
  def initialize
    @category_counts = Hash.new(0)
  end
  def start_element(name,att=nil)
    @count = name=='cat'
  end
  def characters(str)
    if @count
      @category_counts[str] += 1
      @count = false
    end
  end
end

当我们创建这个类的一个新实例时，我们会得到一个对象，该对象的 Hash 对于所有值默认为 0，并且有几个可以在其上调用的方法。SAX 解析器将在文档中运行时调用这些方法。

每次 SAX 解析器看到一个新元素时，它都会调用start_element这个类的方法。发生这种情况时，我们根据该元素是否命名为“猫”设置一个标志（以便我们稍后可以找到它的名称）。
每次 SAX 解析器抓取一段文本时，它都会调用characters我们对象的方法。当这种情况发生时，我们检查我们看到的最后一个元素是否是一个类别（即如果@count被设置为true）；如果是这样，我们使用此文本节点的值作为类别名称并将计数器加一。

要将我们的自定义对象与 Nokogiri 的 SAX 解析器一起使用，我们这样做：

# Create a new instance, with its empty hash
counter = CategoryCounter.new

# Create a new parser that will call methods on our object, and then
# use `parse_file` so that it streams data from disk instead of flooding RAM
Nokogiri::HTML::SAX::Parser.new(counter).parse_file('tmp.xml')

# Once that's done, we can get the hash of category counts back from our object
counts = counter.category_counts
p counts["Pigs"]

score 4 · Accepted Answer

对于这么大的文件，我建议使用 SAX 解析器而不是 DOM 解析器。Nokogiri 内置了一个不错的 SAX 解析器：http: //nokogiri.org/Nokogiri/XML/SAX.html

SAX 处理大文件的方式非常适合大文件，因为它不会构建巨大的 DOM 树，在您的情况下这有点过分了；您可以在事件触发时构建自己的结构（例如，用于计算节点）。

score 3 · Accepted Answer

您可以通过将代码更改为以下内容来显着减少执行时间。只需将“99”更改为您要检查的任何类别。：

require 'rubygems'
require 'nokogiri'
require 'open-uri'

icount = 0 
xmlfeed = Nokogiri::XML(open("test.xml"))
items = xmlfeed.xpath("//item")
items.each do |item|
  text = item.children.children.first.text  
  if ( text =~ /99/ )
    icount += 1
  end
end

othercount = xmlfeed.xpath("//totalcount").inner_text.to_i - icount 

puts icount
puts othercount

这在我的机器上花了大约三秒钟。我认为您犯的一个关键错误是您选择了“项目”迭代而不是创建“项目”节点的集合。这使您的迭代代码变得笨拙且缓慢。

score 0 · Accepted Answer

您可能想尝试一下 - https://github.com/amolpujari/reading-huge-xml

HugeXML.read xml, elements_lookup do |element|
  # => element{ :name, :value, :attributes}
end

我也尝试过使用ox。

score 0 · Accepted Answer

查看 Greg Weber 的 Paul Dix 萨克斯机 gem 版本：http: //blog.gregweber.info/posts/2011-06-03-high-performance-rb-part1

用 SaxMachine 解析大文件似乎是将整个文件加载到内存中

sax-machine 使代码更简单；Greg 的变体使其成为流媒体。

ruby - 使用 Ruby 和 Nokogiri 解析大型 XML 文件

5 回答 5

DOM 计数是如何工作的？

SAX 计数如何工作？

Related

Reference