我必须从一个大文件(1500000 行)中提取一个特定的行,在多个文件的循环中多次,我问自己什么是最好的选择(就性能而言)。有很多方法可以做到这一点,我用这两种
cat ${file} | head -1
或者
cat ${file} | sed -n '1p'
我找不到答案,他们是否都只获取第一行或两者之一(或两者)首先打开整个文件然后获取第 1 行?
我必须从一个大文件(1500000 行)中提取一个特定的行,在多个文件的循环中多次,我问自己什么是最好的选择(就性能而言)。有很多方法可以做到这一点,我用这两种
cat ${file} | head -1
或者
cat ${file} | sed -n '1p'
我找不到答案,他们是否都只获取第一行或两者之一(或两者)首先打开整个文件然后获取第 1 行?
放弃无用的cat
和做:
$ sed -n '1{p;q}' file
这将在打印该行后退出sed
脚本。
基准测试脚本:
#!/bin/bash
TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')
# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
echo "Lines in file: $j"
# create file containing j lines
seq 1 $j > file
# initial read of file
cat file > /dev/null
for comm in {0..3}
do
avg=0
echo
echo ${heading[$comm]}
for (( i=1; i<=$n; i++ ))
do
case $comm in
0)
t=$( { time head -1 file > /dev/null; } 2>&1);;
1)
t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
2)
t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
3)
t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
esac
avg=$avg+$t
done
echo "scale=3;($avg)/$n" | bc
done
done
只需另存为benchmark.sh
并运行bash benchmark.sh
.
结果:
head -1 file
.001
sed -n 1p file
.048
sed -n '1{p;q} file
.002
read line < file && echo $line
0
**来自包含 1,000,000 行的文件的结果。*
因此,时间sed -n 1p
将随着文件的长度线性增长,但其他变化的时间将是恒定的(并且可以忽略不计),因为它们在读取第一行后都退出:
注意:由于在更快的 Linux 机器上,时间与原始帖子不同。
If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read
which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk
, sed
, head
, etc.
The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.
For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.
All of this caching effect "interference" is both OS and hardware dependent.
So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.
this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:
sed: sed '1{p;q}' uopgenl20121216.lis
real 0m0.917s
user 0m0.258s
sys 0m0.492s
read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"
real 0m0.017s
user 0m0.000s
sys 0m0.015s
This is clearly contrived, but does show the difference between builtin performance vs using a command.
如何避免管道?两者都sed
支持head
文件名作为参数。这样你就可以避免从猫身边经过。我没有测量它,但是 head 在较大的文件上应该更快,因为它会在 N 行之后停止计算(而 sed 会遍历所有文件,即使它不打印它们 - 除非您q
按照上面的建议指定 uit 选项)。
例子:
sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file
同样,我没有测试效率。
如果你想从一个大文件中只打印 1 行(比如第 20 行),你也可以这样做:
head -20 filename | tail -1
我用 bash 做了一个“基本”测试,它似乎比sed -n '1{p;q}
上面的解决方案表现得更好。
测试需要一个大文件并从中间某处(在 line 10000000
)打印一行,重复 100 次,每次选择下一行。所以它选择线10000000,10000001,10000002, ...
等等直到10000099
$wc -l english
36374448 english
$time for i in {0..99}; do j=$((i+10000000)); sed -n $j'{p;q}' english >/dev/null; done;
real 1m27.207s
user 1m20.712s
sys 0m6.284s
对比
$time for i in {0..99}; do j=$((i+10000000)); head -$j english | tail -1 >/dev/null; done;
real 1m3.796s
user 0m59.356s
sys 0m32.376s
用于从多个文件中打印一行
$wc -l english*
36374448 english
17797377 english.1024MB
3461885 english.200MB
57633710 total
$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done;
real 0m2.059s
user 0m1.904s
sys 0m0.144s
$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;
real 0m1.535s
user 0m1.420s
sys 0m0.788s
我做了广泛的测试,发现如果你想要文件的每一行:
while IFS=$'\n' read LINE; do
echo "$LINE"
done < your_input.txt
比任何其他(基于 Bash 的)方法要快得多。所有其他方法(如sed
)每次都读取文件,至少到匹配的行。如果文件长 4 行,您将得到:1 -> 1,2 -> 1,2,3 -> 1,2,3,4
=10
读取,而 while 循环仅维护一个位置光标(基于IFS
),因此总共只会4
读取。
在具有约 15k 行的文件上,差异是惊人的:约 25-28 秒(sed
基于,每次提取特定行)与约 0-1 秒(while...read
基于,读取文件一次)
上面的示例还显示了如何以IFS
更好的方式设置换行符(感谢下面评论中的 Peter),这有望解决有时while... read ...
在 Bash 中使用时出现的其他一些问题。