linux - 如何检查一个文件是否是另一个文件的一部分？

Question

我需要通过 bash 脚本检查一个文件是否在另一个文件中。对于给定的多行模式和输入文件。

返回值：

我想接收状态（如何在 grep 命令中）如果找到任何匹配项，则为 0，如果未找到匹配项，则为 1。

图案：

多线，
行的顺序很重要（被视为单个行块），
包括数字、字母、?、&、*、# 等字符，

解释

只有以下示例应该找到匹配项：

pattern     file1 file2 file3 file4
222         111   111   222   222
333         222   222   333   333
            333   333         444
            444

以下不应该：

pattern     file1 file2 file3 file4 file5 file6 file7
222         111   111   333   *222  111   111   222
333         *222  222   222   *333  222   222   
            333   333*        444   111         333
            444                     333   333

这是我的脚本：

#!/bin/bash

function writeToFile {
    if [ -w "$1" ] ; then
        echo "$2" >> "$1"
    else
        echo -e "$2" | sudo tee -a "$1" > /dev/null
    fi
}

function writeOnceToFile {
        pcregrep --color -M "$2" "$1"
        #echo $?

        if [ $? -eq 0 ]; then
            echo This file contains text that was added previously
        else
            writeToFile "$1" "$2"
        fi
}

file=file.txt 
#1?1
#2?2
#3?3
#4?4

pattern=`cat pattern.txt`
#2?2
#3?3

writeOnceToFile "$file" "$pattern"

我可以对所有模式行使用 grep 命令，但在此示例中失败：

file.txt 
#1?1
#2?2
#=== added line
#3?3
#4?4

pattern.txt
#2?2
#3?3

或者即使你换行：2 和 3

file=file.txt 
#1?1
#3?3
#2?2
#4?4

不应该返回 0。

我该如何解决？请注意，我更喜欢使用本机安装程序（如果可以不使用 pcregrep）。也许 sed 或 awk 可以解决这个问题？

score 6 · Accepted Answer

我只会diff用于此任务：

diff pattern <(grep -f file pattern)

解释

diff file1 file2报告两个文件是否不同。
通过说grep -f file pattern你看到的是什么pattern内容file。

因此，您要做的是检查其中的行pattern，file然后将其与pattern自身进行比较。如果它们匹配，则意味着它pattern是file!

测试

seq 10是一部分seq 20！让我们检查一下：

$ diff <(seq 10) <(grep -f <(seq 20) <(seq 10))
$

seq 10不完全在里面seq 2 20（1 不在第二个里面）：

$ diff -q <(seq 10) <(grep -f <(seq 2 20) <(seq 10))
Files /dev/fd/63 and /dev/fd/62 differ

score 2 · Accepted Answer

我有一个使用 perl 的工作版本。

我以为我可以让它与 GNU 一起使用awk，但我没有。RS=空字符串在空行上拆分。查看损坏的 awk 版本的编辑历史记录。

如何在文件中搜索多行模式？展示了如何使用 pcregrep，但是当要搜索的模式可能包含正则表达式特殊字符时，我看不到让它工作的方法。 -F固定字符串模式不适用于多行模式：它仍然将模式视为一组要单独匹配的行。（不是作为要匹配的多行固定字符串。）我看到您已经在尝试使用 pcregrep 了。

顺便说一句，我认为您的代码在非 sudo 情况下存在错误：

function writeToFile {
    if [ -w "$1" ] ; then
        "$2" >> "$1"   # probably you mean  echo "$2" >> "$1"
    else
        echo -e "$2" | sudo tee -a "$1" > /dev/null
    fi
}

无论如何，尝试使用基于行的工具都失败了，所以是时候推出一种更严肃的编程语言，它不会强制我们使用换行约定。只需将两个文件读入变量，并使用非正则表达式搜索：

#!/usr/bin/perl -w
# multi_line_match.pl  pattern_file  target_file
# exit(0) if a match is found, else exit(1)

#use IO::File;
use File::Slurp;
my $pat = read_file($ARGV[0]);
my $target = read_file($ARGV[1]);

if ((substr($target, 0, length($pat)) eq $pat) or index($target, "\n".$pat) >= 0) {
    exit(0);
}
exit(1);

请参阅在 Perl 中将文件转换为字符串的最佳方法是什么？避免依赖File::Slurp（它不是标准 perl 发行版或默认 Ubuntu 15.04 系统的一部分）。我选择 File::Slurp 部分是为了让程序正在做什么，对于非 perl 极客来说，与以下内容相比：

my $contents = do { local(@ARGV, $/) = $file; <> };

我正在努力避免将完整文件读入内存，来自http://www.perlmonks.org/?node_id=98208的想法。我认为不匹配的案例通常仍会一次读取整个文件。此外，在文件前面处理匹配的逻辑非常复杂，我不想花很长时间测试以确保它在所有情况下都是正确的。这是我放弃之前的情况：

#IO::File->input_record_separator($pat);
$/ = $pat;  # pat must include a trailing newline if you want it to match one

my $fh = IO::File->new($ARGV[2], O_RDONLY)
    or die 'Could not open file ', $ARGV[2], ": $!";

$tail = substr($fh->getline, -1);  #fast forward to the first match
#print each occurence in the file
#print IO::File->input_record_separator  while $fh->getline;

#FIXME: something clever here to handle the case where $pat matches at the beginning of the file.
do {
    # fixme: need to check defined($fh->getline)
    if (($tail eq '\n') or ($tail = substr($fh->getline, -1))) {
    exit(0);  # if there's a 2nd line
    }
} while($tail);

exit(1);
$fh->close;

另一个想法是过滤要搜索的模式和文件tr '\n' '\r'或其他东西，所以它们都是单行的。（\r可能是一个安全的选择，不会与文件或模式中已有的任何内容发生冲突。）

score 2 · Accepted Answer

我再次遇到了这个问题，我认为awk可以更好地处理这个问题：

awk 'FNR==NR {a[FNR]=$0; next}
     FNR==1 && NR>1 {for (i in a) len++}
     {for (i=last; i<=len; i++) {
         if (a[i]==$0) 
            {last=i; next}
     } status=1}
     END {print status+0}' file pattern

file这个想法是： -在一个数组中读取内存中的所有文件a[line_number] = line。- 计算数组中的元素。- 循环遍历文件pattern并检查当前行是否出现在file光标所在位置和文件末尾之间的任何时间file。如果匹配，请将光标移动到找到它的位置。如果没有，则将状态设置为1- 即有一行在上一场比赛之后pattern没有出现file。- 打印状态，0除非它被设置为之前的1任何时间。

测试

他们确实匹配：

$ tail f p
==> f <==
222
333
555

==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' f p
0

他们不：

$ tail f p
==> f <==
333
222
555

==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' f p
1

与seq：

$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 2 20) <(seq 10)
1
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 20) <(seq 10)
0

linux - 如何检查一个文件是否是另一个文件的一部分？

3 回答 3

解释

测试

测试

Related

Reference