linux - 从另一个文件中删除行号的文本文件中的行

Question

我有一个文本文件，其中包含一个巨大的行号列表，我必须将其从另一个主文件中删除。这是我的数据的样子

行.txt

和documents.txt

string1
string2
string3
...

如果我有一个简短的行号列表，我可以很容易地使用

sed -i '1d,4d,5d' documents.txt.

但是有很多行号我必须删除。另外，我可以使用 bash/perl 脚本将行号存储在数组中，并回显不在数组中的行。但我想知道是否有内置命令可以做到这一点。

任何帮助将不胜感激。

score 10 · Accepted Answer

awk oneliner 应该适合您，请参见下面的测试：

kent$  head lines.txt doc.txt 
==> lines.txt <==
1
3
5
7

==> doc.txt <==
a
b
c
d
e
f
g
h

kent$  awk 'NR==FNR{l[$0];next;} !(FNR in l)' lines.txt doc.txt
b
d
f
h

正如 Levon 所建议的，我添加了一些解释：

awk                     # the awk command
 'NR==FNR{l[$0];next;}  # process the first file(lines.txt),save each line(the line# you want to delete) into an array "l"

 !(FNR in l)'           #now come to the 2nd file(doc.txt), if line number not in "l",print the line out
 lines.txt              # 1st argument, file:lines.txt
 docs.txt               # 2nd argument, file:doc.txt

score 2 · Accepted Answer

这是一种方法sed：

sed ':a;${s/\n//g;s/^/sed \o47/;s/$/d\o47 documents.txt/;b};s/$/d\;/;N;ba' lines.txt | sh

它用于sed构建sed命令并将其通过管道传输到要执行的 shell。生成的sed命令看起来就像 `sed '3d;5d;11d'documents.txt。

为了构建它，外部sed命令d;在每个数字后添加一个，循环到下一行，分支回到开头 ( N; ba)。当到达最后一行 ( $) 时，所有换行符都被删除，sed '被添加到最后一行，d并且' documents.txt被添加。然后从-循环b分支到最后，因为没有指定标签。:aba

以下是使用joinand的方法cat -n（假设 lines.txt 已排序）：

join -t $'\v' -v 2 -o 2.2 lines.txt <(cat -n documents.txt | sed 's/^ *//;s/\t/\v/')

如果lines.txt 没有排序：

join -t $'\v' -v 2 -o 2.2 <(sort lines.txt) <(cat -n documents.txt | sed '^s/ *//;s/\t/\v/')

编辑：

修复了join原始版本仅输出documents.txt中每行的第一个单词的命令中的错误。

score 2 · Accepted Answer

好吧，我不会说 Perl 和 bash，我一次又一次地经历痛苦的尝试。然而，Rexx 很容易做到这一点。

lines_to_delete = ""

do while lines( "lines.txt" )
   lines_to_delete = lines_to_delete linein( "lines.txt" )
end

n = 0
do while lines( "documents.txt" )
   line = linein( "documents.txt" )
   n = n + 1
   if ( wordpos( n, lines_to_delete ) == 0 )
      call lineout "temp_out,txt", line
end

这会将您的输出保留在 temp_out.txt 中，您可以根据需要将其重命名为documents.txt。

score 1 · Accepted Answer

这可能对您有用（GNU sed）：

sed 's/.*/&d/' lines.txt | sed -i -f - documents.txt

或者：

sed ':a;$!{N;ba};s/\n/d;/g;s/^/sed -i '\''/;s/$/d'\'' documents.txt/' lines.txt | sh

score 0 · Accepted Answer

我在Unix SE上问了一个类似的问题，得到了很好的答案，其中包括以下 awk 脚本：

#!/bin/bash
#
# filterline keeps a subset of lines of a file.
#
# cf. https://unix.stackexchange.com/q/209404/376
#
set -eu -o pipefail

if [ "$#" -ne 2 ]; then
    echo "Usage: filterline FILE1 FILE2"
    echo
    echo "FILE1: one integer per line indicating line number, one-based, sorted"
    echo "FILE2: input file to filter"
    exit 1
fi

LIST="$1" LC_ALL=C awk '
  function nextline() {
    if ((getline n < list) <=0) exit
  }
  BEGIN{
    list = ENVIRON["LIST"]
    nextline()
  }
  NR == n {
    print
    nextline()
  }' < "$2"

还有另一个 C 版本，它的性能更高一些：

https://github.com/miku/filterline

linux - 从另一个文件中删除行号的文本文件中的行

5 回答 5

Related

Reference