21

我必须从一个大文件(1500000 行)中提取一个特定的行,在多个文件的循环中多次,我问自己什么是最好的选择(就性能而言)。有很多方法可以做到这一点,我用这两种

cat ${file} | head -1

或者

cat ${file} | sed -n '1p'

我找不到答案,他们是否都只获取第一行或两者之一(或两者)首先打开整个文件然后获取第 1 行?

4

5 回答 5

38

放弃无用的cat和做:

$ sed -n '1{p;q}' file

这将在打印该行后退出sed脚本。


基准测试脚本:

#!/bin/bash

TIMEFORMAT='%3R'
n=25
heading=('head -1 file' 'sed -n 1p file' "sed -n '1{p;q} file" 'read line < file && echo $line')

# files upto a hundred million lines (if your on slow machine decrease!!)
for (( j=1; j<=100,000,000;j=j*10 ))
do
    echo "Lines in file: $j"
    # create file containing j lines
    seq 1 $j > file
    # initial read of file
    cat file > /dev/null

    for comm in {0..3}
    do
        avg=0
        echo
        echo ${heading[$comm]}    
        for (( i=1; i<=$n; i++ ))
        do
            case $comm in
                0)
                    t=$( { time head -1 file > /dev/null; } 2>&1);;
                1)
                    t=$( { time sed -n 1p file > /dev/null; } 2>&1);;
                2)
                    t=$( { time sed '1{p;q}' file > /dev/null; } 2>&1);;
                3)
                    t=$( { time read line < file && echo $line > /dev/null; } 2>&1);;
            esac
            avg=$avg+$t
        done
        echo "scale=3;($avg)/$n" | bc
    done
done

只需另存为benchmark.sh并运行bash benchmark.sh.

结果:

head -1 file
.001

sed -n 1p file
.048

sed -n '1{p;q} file
.002

read line < file && echo $line
0

**来自包含 1,000,000 行的文件的结果。*

因此,时间sed -n 1p将随着文件的长度线性增长,但其他变化的时间将是恒定的(并且可以忽略不计),因为它们在读取第一行后都退出:

在此处输入图像描述

注意:由于在更快的 Linux 机器上,时间与原始帖子不同。

于 2013-03-26T08:50:40.483 回答
6

If you are really just getting the very first line and reading hundreds of files, then consider shell builtins instead of external external commands, use read which is a shell builtin for bash and ksh. This eliminates the overhead of process creation with awk, sed, head, etc.

The other issue is doing timed performance analysis on I/O. The first time you open and then read a file, file data is probably not cached in memory. However, if you try a second command on the same file again, the data as well as the inode have been cached, so the timed results are may be faster, pretty much regardless of the command you use. Plus, inodes can stay cached practically forever. They do on Solaris for example. Or anyway, several days.

For example, linux caches everything and the kitchen sink, which is a good performance attribute. But it makes benchmarking problematic if you are not aware of the issue.

All of this caching effect "interference" is both OS and hardware dependent.

So - pick one file, read it with a command. Now it is cached. Run the same test command several dozen times, this is sampling the effect of the command and child process creation, not your I/O hardware.

this is sed vs read for 10 iterations of getting the first line of the same file, after read the file once:

sed: sed '1{p;q}' uopgenl20121216.lis

real    0m0.917s
user    0m0.258s
sys     0m0.492s

read: read foo < uopgenl20121216.lis ; export foo; echo "$foo"

real    0m0.017s
user    0m0.000s
sys     0m0.015s

This is clearly contrived, but does show the difference between builtin performance vs using a command.

于 2013-03-26T12:49:16.613 回答
3

如何避免管道?两者都sed支持head文件名作为参数。这样你就可以避免从猫身边经过。我没有测量它,但是 head 在较大的文件上应该更快,因为它会在 N 行之后停止计算(而 sed 会遍历所有文件,即使它不打印它们 - 除非您q按照上面的建议指定 uit 选项)。

例子:

sed -n '1{p;q}' /path/to/file
head -n 1 /path/to/file

同样,我没有测试效率。

于 2013-03-26T10:13:49.110 回答
3

如果你想从一个大文件中只打印 1 行(比如第 20 行),你也可以这样做:

head -20 filename | tail -1

我用 bash 做了一个“基本”测试,它似乎比sed -n '1{p;q}上面的解决方案表现得更好。

测试需要一个大文件并从中间某处(在 line 10000000)打印一行,重复 100 次,每次选择下一行。所以它选择线10000000,10000001,10000002, ...等等直到10000099

$wc -l english
36374448 english

$time for i in {0..99}; do j=$((i+10000000));  sed -n $j'{p;q}' english >/dev/null; done;

real    1m27.207s
user    1m20.712s
sys     0m6.284s

对比

$time for i in {0..99}; do j=$((i+10000000));  head -$j english | tail -1 >/dev/null; done;

real    1m3.796s
user    0m59.356s
sys     0m32.376s

用于从多个文件中打印一行

$wc -l english*
  36374448 english
  17797377 english.1024MB
   3461885 english.200MB
  57633710 total

$time for i in english*; do sed -n '10000000{p;q}' $i >/dev/null; done; 

real    0m2.059s
user    0m1.904s
sys     0m0.144s



$time for i in english*; do head -10000000 $i | tail -1 >/dev/null; done;

real    0m1.535s
user    0m1.420s
sys     0m0.788s
于 2015-06-13T01:37:21.393 回答
0

我做了广泛的测试,发现如果你想要文件的每一行:

while IFS=$'\n' read LINE; do
  echo "$LINE"
done < your_input.txt

比任何其他(基于 Bash 的)方法要快得多所有其他方法(如sed)每次都读取文件,至少到匹配的行。如果文件长 4 行,您将得到:1 -> 1,2 -> 1,2,3 -> 1,2,3,4=10读取,而 while 循环仅维护一个位置光标(基于IFS),因此总共只会4读取。

在具有约 15k 行的文件上,差异是惊人的:约 25-28 秒(sed基于,每次提取特定行)与约 0-1 秒(while...read基于,读取文件一次)

上面的示例还显示了如何以IFS更好的方式设置换行符(感谢下面评论中的 Peter),这有望解决有时while... read ...在 Bash 中使用时出现的其他一些问题。

于 2020-08-29T03:18:26.620 回答