shell - 在 Unix 中打印两行之间的文本（来自文件中的行号列表）

Question

我有一个包含数千行的示例文件。我想在该文件中的两个行号之间打印文本。我不想手动输入行号，而是我有一个文件，其中包含必须在其间打印文本的行号列表。

例子：linenumbers.txt

我需要一个 shell 脚本，它将从此文件中读取行号并将每个行范围之间的文本打印到一个单独的（新）文件中。

也就是说，它应该将 345 到 789 行之间的行打印到一个新文件中，比如File1.txt，并将第 999 到 1056 行之间的文本打印到一个新文件中，比如File2.txt，等等。

score 2 · Accepted Answer

考虑到您的目标文件只有数千行。这是一个快速而肮脏的解决方案。

awk -F'|' '{system("sed -n \""$1","$2"p\" targetFile > file"NR)}' linenumbers.txt

这targetFile是包含数千行的文件。
oneliner不需要您linenumbers.txt进行分类。
oneliner 允许行范围在您的linenumbers.txt

运行上述命令后，您将拥有 n 个filex文件。n是的行数linenumbers.txt x是从1-n您可以根据需要更改文件名模式。

score 2 · Accepted Answer

这是使用GNU awk. 像这样运行：

awk -f script.awk numbers.txt file.txt

内容script.awk：

BEGIN {
    # set the field separator
    FS="|"
}

# for the first file in the arguments list
FNR==NR {

    # add the row number and field one as keys to a multidimensional array with
    # a value of field two
    a[NR][$1]=$2

    # skip processing the rest of the code
    next
}

# for the second file in the arguments list
{
    # for every element in the array's first dimension
    for (i in a) {

        # for every element in the second dimension
        for (j in a[i]) {

            # ensure that the first field is treated numerically
            j+=0

            # if the line number is greater than the first field
            # and smaller than the second field
            if (FNR>=j && FNR<=a[i][j]) {

                # print the line to a file with the suffix of the first file's 
                # line number (the first dimension)
                print > "File" i
            }
        }
    }
}

或者，这是单线：

awk -F "|" 'FNR==NR { a[NR][$1]=$2; next } { for (i in a) for (j in a[i]) { j+=0; if (FNR>=j && FNR<=a[i][j]) print > "File" i } }' numbers.txt file.txt

如果你有一个 'old' awk，这里是兼容的版本。像这样运行：

awk -f script.awk numbers.txt file.txt

内容script.awk：

BEGIN {
    # set the field separator
    FS="|"
}

# for the first file in the arguments list
FNR==NR {

    # add the row number and field one as a key to a pseudo-multidimensional
    # array with a value of field two
    a[NR,$1]=$2

    # skip processing the rest of the code
    next
}

# for the second file in the arguments list
{
    # for every element in the array
    for (i in a) {

        # split the element in to another array
        # b[1] is the row number and b[2] is the first field 
        split(i,b,SUBSEP)

        # if the line number is greater than the first field
        # and smaller than the second field
        if (FNR>=b[2] && FNR<=a[i]) {

            # print the line to a file with the suffix of the first file's
            # line number (the first pseudo-dimension)
            print > "File" b[1]
        }
    }
}

或者，这是单线：

awk -F "|" 'FNR==NR { a[NR,$1]=$2; next } { for (i in a) { split(i,b,SUBSEP); if (FNR>=b[2] && FNR<=a[i]) print > "File" b[1] } }' numbers.txt file.txt

score 1 · Accepted Answer

您可以执行以下操作

# myscript.sh
linenumbers="linenumber.txt"
somefile="afile"
while IFS=\| read start  end ; do
    echo "sed -n '$start,${end}p;${end}q;' $somefile  > $somefile-$start-$end"
done < $linenumbers

像这样运行它sh myscript.sh

sed -n '345,789p;789q;' afile  > afile-345-789
sed -n '999,1056p;1056q;' afile  > afile-999-1056
sed -n '1522,1366p;1366q;' afile  > afile-1522-1366
sed -n '3523,3562p;3562q;' afile  > afile-3523-3562

然后当你快乐的时候sh myscript.sh | sh

编辑添加了威廉在风格和正确性方面的优秀观点。

编辑说明

基本思想是获取一个脚本来生成一系列shell命令，这些命令可以在被“| sh”执行之前首先检查正确性。

sed -n '345,789p;789q; 表示使用sed且不回显每一行 (-n) ；有两个命令从第 345 行到第 789 行 p(rint) 行，第二个命令在第 789 行 q(uit) - 通过在最后一行退出，您保存已sed读取所有输入文件。

循环使用whileread 从 $linenumbers 文件中读取，read如果给定多个变量名，则每个变量名都填充一个来自输入的字段，一个字段space通常read由变量的名称。

您可以在 shell 提示符下输入以下内容以了解该行为。

ls -l | while read first rest ; do
   echo $first XXXX $rest
done

尝试在上面添加另一个变量second，看看会发生什么，这应该很明显。

问题是您的数据由|s 分隔，这就是使用 William 的工作建议的地方，IFS=\|因为现在从输入中读取IFS已更改，输入现在由|s 分隔，我们得到了所需的结果。

其他人可以随意编辑、更正和扩展。

score 1 · Accepted Answer

我会用它sed来处理示例数据文件，因为它简单快捷。这需要一种将行号文件转换为适当sed脚本的机制。有很多方法可以做到这一点。

一种方法用于sed将一组行号转换为sed脚本。如果一切都是标准输出，这将是微不足道的。由于输出需要转到不同的文件，我们需要行号文件中每一行的行号。给出行号的一种方法是nl命令。另一种可能性是使用pr -n -l1. 相同的sed命令行适用于两者：

nl linenumbers.txt |
sed 's/ *\([0-9]*\)[^0-9]*\([0-9]*\)|\([0-9]*\)/\2,\3w file\1.txt/'

对于给定的数据文件，生成：

345,789w > file1.txt
999,1056w > file2.txt
1522,1366w > file3.txt
3523,3562w > file4.txt

另一种选择是awk生成sed脚本：

awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt

如果您的版本sed允许您使用-f -（GNU允许sed；BSDsed不允许）从标准输入读取其脚本，那么您可以将行号文件动态转换为sed脚本，并使用它来解析示例数据：

awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f - sample.data

如果您的系统支持/dev/stdin，您可以使用以下之一：

awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/stdin sample.data

awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt |
sed -n -f /dev/fd/0 sample.data

如果做不到这一点，请使用显式脚本文件：

awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > sed.script
sed -n -f sed.script sample.data
rm -f sed.script

严格来说，您应该确保临时文件名是唯一的 ( mktemp) 并且即使脚本被中断 ( trap) 也会被删除：

tmp=$(mktemp sed.script.XXXXXX)
trap "rm -f $tmp; exit 1" 0 1 2 3 13 15

awk -F'|' '{ printf "%d,%dw > file%d.txt\n", $1, $2, NR }' linenumbers.txt > $tmp
sed -n -f $tmp sample.data
rm -f $tmp
trap 0

finaltrap 0允许你的脚本成功退出；省略它，您的脚本将始终以状态 1 退出。

我忽略了 Perl 和 Python；任何一个都可以在单个命令中用于此目的。文件管理非常繁琐，使用起来sed似乎更简单。您也可以使用 just awk，或者使用第一个awk脚本编写awk脚本来完成繁重的工作（上面的微不足道的扩展），或者让单个awk进程读取两个文件并产生所需的输出（更难，但远非不可能）。

如果不出意外，这表明有许多可能的方法来完成这项工作。如果这是一次性练习，那么您选择哪一种并不重要。如果您将重复执行此操作，请选择您喜欢的机制。如果您担心性能，请测量。将行号转换为命令脚本的成本可能可以忽略不计；使用命令脚本处理样本数据是花费时间的地方。我希望sed在那一点上表现出色；我没有测量以确认它确实如此。

score 0 · Accepted Answer

要从345|789您那里提取第一个字段，例如可以使用 awk

awk -F'|' '{print $1}'

将其与从您的其他问题中收到的答案结合起来，您将获得解决方案。

score 0 · Accepted Answer

这可能对您有用（GNU sed）：

sed -r 's/(.*)\|(.*)/\1,\2w file-\1-\2.txt/' | sed -nf - file

shell - 在 Unix 中打印两行之间的文本（来自文件中的行号列表）

6 回答 6

Related

Reference