你所要求的实际上不会帮助你。
首先,要跳转到文件的第 n 行,您首先必须读取文件的前一部分,以计算其中的换行数。例如:
$ ruby -e '(1..10000000).each { |i| puts "This is line number #{i}"}' > large_file.txt
$ du -h large_file.txt
266M large_file.txt
$ purge # mac os x command - clears any in memory disk caches in use
$ time sed -n -e "5000000p; 5000000q" large_file.txt
This is line number 5000000
sed -n -e "5000000p; 5000000q" large_file.txt 0.52s user 0.13s system 28% cpu 2.305 total
$ time sed -n -e "5000000p; 5000000q" large_file.txt
This is line number 5000000
sed -n -e "5000000p; 5000000q" large_file.txt 0.49s user 0.05s system 99% cpu 0.542 total
请注意该sed
命令不是即时的,它必须通读文件的初始部分才能确定第 5 百万行的位置。这就是为什么第二次运行它对我来说要快得多 - 我的计算机将文件缓存到内存中。
即使您确实完成了此操作(通过手动拆分文件),如果您不断地在一个或多个文件的不同部分之间跳转以读取下一行,您的 IO 性能也会很差。
更好的是在单独的线程(或进程)上处理每第 n 行。这将允许使用多个 cpu 内核,但仍然具有良好的 IO 性能。这可以通过并行库轻松完成。
使用示例(我的电脑有 4 个核心):
$ ruby -e '(1..10000000).each { |i| puts "This is line number #{i}"}' > large_file.txt # use a smaller file to speed up the tests
$ time ruby -r parallel -e "Parallel.each(File.open('large_file.txt').each_line, in_processes: 4) { |line| puts line if (line * 10000) =~ /9999/ }"
This is line number 9999
This is line number 19999
This is line number 29999
This is line number 39999
This is line number 49999
This is line number 59999
This is line number 69999
This is line number 79999
This is line number 89999
This is line number 99990
This is line number 99991
This is line number 99992
This is line number 99993
This is line number 99994
This is line number 99995
This is line number 99996
This is line number 99997
This is line number 99999
This is line number 99998
ruby -r parallel -e 55.84s user 10.73s system 400% cpu 16.613 total
$ time ruby -r parallel -e "Parallel.each(File.open('large_file.txt').each_line, in_processes: 1) { |line| puts line if (line * 10000) =~ /9999/ }"
This is line number 9999
This is line number 19999
This is line number 29999
This is line number 39999
This is line number 49999
This is line number 59999
This is line number 69999
This is line number 79999
This is line number 89999
This is line number 99990
This is line number 99991
This is line number 99992
This is line number 99993
This is line number 99994
This is line number 99995
This is line number 99996
This is line number 99997
This is line number 99998
This is line number 99999
ruby -r parallel -e 47.04s user 7.46s system 97% cpu 55.738 total
第二个版本(使用 4 个进程)完成了原来的 29.81% 的时间,快了近 4 倍。