这是针对真实日志文件的基准。在用于读取文件的方法中,只有一种使用foreach
是可扩展的,因为它避免了对文件的破坏。
使用lazy
会增加开销,导致时间比map
单独使用要慢。
请注意,foreach
就处理速度而言,它就在那里,并产生了可扩展的解决方案。Ruby 不在乎文件是数以亿计的行还是数以亿计的 TB,它仍然一次只能看到一行。有关读取文件的一些相关信息,请参阅“为什么“slurping”文件不是一个好习惯? ”。
人们经常倾向于使用一次提取整个文件,然后将其拆分为多个部分的东西。这忽略了 Ruby 然后必须根据行结束使用split
或类似的东西来重建数组的工作。这加起来,这就是我认为foreach
领先的原因。
另请注意,两次基准运行之间的结果略有不同。这可能是由于作业正在运行时在我的 Mac Pro 上运行的系统任务所致。重要的是显示差异是清洗,向我确认 usingforeach
是处理大文件的正确方法,因为如果输入文件超过可用内存,它不会杀死机器。
require 'benchmark'
REGEX = /\bfoo\z/
LOG = 'debug.log'
N = 1
# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def lazy_map(filename)
File.open("lazy_map.out", 'w') do |fo|
fo.puts File.readlines(filename).lazy.map { |li|
li.gsub(REGEX, 'bar')
}.force
end
end
# each_line: "Splits str using the supplied parameter as the record separator
# ($/ by default), passing each substring in turn to the supplied block."
#
# Because the file is read into a string, then split into lines, this isn't
# scalable. It will work if Ruby has enough memory to hold the string plus all
# other variables and its overhead.
def map(filename)
File.open("map.out", 'w') do |fo|
fo.puts File.readlines(filename).map { |li|
li.gsub(REGEX, 'bar')
}
end
end
# "Reads the entire file specified by name as individual lines, and returns
# those lines in an array."
#
# As a result of returning all the lines in an array this isn't scalable. It
# will work if Ruby has enough memory to hold the array plus all other
# variables and its overhead.
def readlines(filename)
File.open("readlines.out", 'w') do |fo|
File.readlines(filename).each do |li|
fo.puts li.gsub(REGEX, 'bar')
end
end
end
# This is completely scalable because no file slurping is involved.
# "Executes the block for every line in the named I/O port..."
#
# It's slower, but it works reliably.
def foreach(filename)
File.open("foreach.out", 'w') do |fo|
File.foreach(filename) do |li|
fo.puts li.gsub(REGEX, 'bar')
end
end
end
puts "Ruby version: #{ RUBY_VERSION }"
puts "log bytes: #{ File.size(LOG) }"
puts "log lines: #{ `wc -l #{ LOG }`.to_i }"
2.times do
Benchmark.bm(13) do |b|
b.report('lazy_map') { lazy_map(LOG) }
b.report('map') { map(LOG) }
b.report('readlines') { readlines(LOG) }
b.report('foreach') { foreach(LOG) }
end
end
%w[lazy_map map readlines foreach].each do |s|
puts `wc #{ s }.out`
end
结果是:
Ruby version: 2.0.0
log bytes: 733978797
log lines: 5540058
user system total real
lazy_map 35.010000 4.120000 39.130000 ( 43.688429)
map 29.510000 7.440000 36.950000 ( 43.544893)
readlines 28.750000 9.860000 38.610000 ( 43.578684)
foreach 25.380000 4.120000 29.500000 ( 35.414149)
user system total real
lazy_map 32.350000 9.000000 41.350000 ( 51.567903)
map 24.740000 3.410000 28.150000 ( 32.540841)
readlines 24.490000 7.330000 31.820000 ( 37.873325)
foreach 26.460000 2.540000 29.000000 ( 33.599926)
5540058 83892946 733978797 lazy_map.out
5540058 83892946 733978797 map.out
5540058 83892946 733978797 readlines.out
5540058 83892946 733978797 foreach.out
的使用gsub
是无害的,因为每种方法都使用它,但它不是必需的,并且是为了一些无聊的电阻负载而添加的。