linux - awk：循环并将不同的行保存到不同的文件？

Question

我正在使用 shell 脚本遍历一系列大文件：

i=0
while read line
do

    # get first char of line
    first=`echo "$line" | head -c 1`

    # make output filename
    name="$first"
    if [ "$first" = "," ]; then
        name='comma'
    fi
    if [ "$first" = "." ]; then
        name='period'
    fi

    # save line to new file
    echo "$line" >> "$2/$name.txt"

    # show live counter and inc
    echo -en "\rLines:\t$i"
    ((i++))

done <$file

每行中的第一个字符要么是字母数字，要么是上面定义的字符之一（这就是为什么我要重命名它们以用于输出文件名）。

实在是太慢了

5,000 行需要 128 秒。

以这个速度，我有一个月的处理时间。

awk 在这里会更快吗？

如果是这样，我如何将逻辑融入 awk？

score 3 · Accepted Answer

这当然可以在 bash 中更有效地完成。

举个例子：echo foo | head调用fork()，创建子shell，设置管道，启动外部head程序......根本没有理由这样做。

如果您想要一行的第一个字符，而不需要对子流程进行任何低效的处理，就这么简单：

c=${line:0:1}

我也会认真考虑对您的输入进行排序，因此您只能在看到新的第一个字符时重新打开输出文件，而不是每次都通过循环。

那就是 -- 使用 sort 进行预处理（如替换<$file为< <(sort "$file")）并在每次循环中执行以下操作，仅有条件地重新打开输出文件：

if [[ $name != "$current_name" ]] ; then
  current_name="$name"
  exec 4>>"$2/$name" # open the output file on FD 4
fi

...然后附加到打开的文件描述符：

printf '%s\n' "$line" >&4

（不使用 echo 因为如果您的线路是-e或，它的行为可能会不受欢迎-n）。

或者，如果可能的输出文件数量很少，您可以预先在不同的 FD 上打开它们（在我选择的地方替换其他更高的数字4），并有条件地输出到这些预先打开的文件之一。打开和关闭文件很昂贵——每个都close()强制刷新到磁盘——所以这应该是一个很大的帮助。

score 2 · Accepted Answer

#!/usr/bin/awk -f
BEGIN {
    punctlist = ", . ? ! - '"
    pnamelist = "comma period question_mark exclamation_mark hyphen apostrophe"
    pcount = split(punctlist, puncts)
    ncount = split(pnamelist, pnames)
    if (pcount != ncount) {print "error: counts don't match, pcount:", pcount, "ncount:", ncount; exit}
    for (i = 1; i <= pcount; i++) {
        punct_lookup[puncts[i]] = pnames[i]
    }
}
{
    print > punct_lookup[substr($0, 1, 1)] ".txt"
    printf "\r%6d", i++
}
END {
    printf "\n"
}

该BEGIN块构建了一个关联数组，因此您可以执行punct_lookup[","]并获取“逗号”。

主块只是简单地查找文件名并将该行输出到文件中。在 AWK 中，>第一次截断文件并随后追加。如果您有不想截断的现有文件，请将其更改为>>（但不要使用>>其他方式）。

score 2 · Accepted Answer

有几件事可以加快速度：

不要使用 echo/head 来获取第一个字符。您每行至少产生两个额外的进程。相反，使用 bash 的参数扩展工具来获取第一个字符。
使用 if-elif 避免$first每次都检查所有可能性。更好的是，如果您使用的是 bash 4.0 或更高版本，请使用关联数组来存储输出文件名，而不是检查 $first每一行的大 if 语句。

如果您没有支持关联数组的 bash 版本，请将 if 语句替换为以下内容。

if [[ "$first" = "," ]]; then
    name='comma'
elif [[ "$first" = "." ]]; then
    name='period'
else
    name="$first"
fi

但建议如下。如果没有给出名称，请注意使用$REPLY作为默认变量（仅供参考）。read

declare -A OUTPUT_FNAMES
output[","]=comma
output["."]=period
output["?"]=question_mark
output["!"]=exclamation_mark
output["-"]=hyphen
output["'"]=apostrophe
i=0
while read
do

    # get first char of line
    first=${REPLY:0:1}

    # make output filename
    name=${output[$first]:-$first}

    # save line to new file
    echo $REPLY >> "$name.txt"

    # show live counter and inc
    echo -en "\r$i"
    ((i++))

done <$file

score 1 · Accepted Answer

另一个采取：

declare -i i=0
declare -A names
while read line; do
    first=${line:0:1}
    if [[ -z ${names[$first]} ]]; then
        case $first in
            ,) names[$first]="$2/comma.txt" ;;
            .) names[$first]="$2/period.txt" ;;
            *) names[$first]="$2/$first.txt" ;;
        esac
    fi
    printf "%s\n" "$line" >> "${names[$first]}"
    printf "\rLine $((++i))"
done < "$file"

和

awk -v dir="$2" '
    {
        first = substr($0,1,1)
        if (! (first in names)) {
            if (first == ",")      names[first] = dir "/comma.txt"
            else if (first == ".") names[first] = dir "/period.txt"
            else                   names[first] = dir "/" first ".txt"
        }
        print > names[first]
        printf("\rLine %d", NR)
    }
'

linux - awk：循环并将不同的行保存到不同的文件？

4 回答 4

Related

Reference