bash - 使用带有输出重定向的 cmp 的 bash shell 脚本无法按预期工作

Question

我正在尝试编写一个 bash 脚本，从文件夹中删除重复文件，只保留一个副本。脚本如下：

#!/bin/sh

for f1 in `find ./ -name "*.txt"`
do
    if test -f $f1
    then
        for f2 in `find ./ -name "*.txt"`
        do
            if [ -f $f2 ] && [ "$f1" != "$f2" ]
            then
                # if cmp $f1 $f2 &> /dev/null # DOES NOT WORK
                if cmp $f1 $f2
                then
                    rm $f2
                    echo "$f2 purged"
                fi 
            fi
        done
    fi 
done

我想重定向输出和 stderr 以/dev/null避免将它们打印到屏幕上。但是使用注释语句，这个脚本不能按预期工作，并删除除第一个文件之外的所有文件。

如果需要，我会提供更多信息。

谢谢

score 8 · Accepted Answer

几点评论：

首先是：

for f1 in `find ./ -name "*.txt"`
do
    if test -f $f1
    then

与（仅查找带有txt扩展名的普通文件）相同

for f1 in `find ./ -type f -name "*.txt"`

更好的语法（仅限 bash）是

for f1 in $(find ./ -type f -name "*.txt")

最后整体是错误的，因为如果文件名包含空格，f1变量将不会得到完整的路径名。所以改为for：

find ./ -type f -name "*.txt" -print | while read -r f1

正如@Sir Athos 指出的那样，文件名可以包含\n，所以最好使用

find . -type f -name "*.txt" -print0 | while IFS= read -r -d '' f1

第二：

再次使用-"$f1"代替$f1，因为$f1可以包含空格。

第三：

进行 N*N 比较不是很有效。您应该为每个txt文件创建一个校验和（md5 或更好的 sha256）。当校验和相同时 - 文件是重复的。

如果您不信任校验和，只需比较具有相同校验和的文件即可。具有不同校验和的文件肯定不是重复的。;)

进行校验和的速度很慢，因此您应该首先将所有文件与same size. 不同大小的文件不重复...

您可以跳过空白txt files- 它们都是重复的:)。

所以最终的命令可以是：

find -not -empty -type f -name \*.txt -printf "%s\n" | sort -rn | uniq -d |\
xargs -I% -n1 find -type f -name \*.txt -size %c -print0 | xargs -0 md5sum |\
sort | uniq -w32 --all-repeated=separate

评论：

#find all non-empty file with the txt extension and print their size (in bytes)
find . -not -empty -type f -name \*.txt -printf "%s\n" |\

#sort the sizes numerically, and keep only duplicated sizes
sort -rn | uniq -d |\

#for each sizes (what are duplicated) find all files with the given size and print their name (path)
xargs -I% -n1 find . -type f -name \*.txt -size %c -print0 |\

#make an md5 checksum for them
xargs -0 md5sum |\

#sort the checksums and keep duplicated files separated with an empty line
sort | uniq -w32 --all-repeated=separate

现在的输出，您可以简单地编辑输出文件并决定要删除的内容和要保留的文件。

score 3 · Accepted Answer

&>是 bash 语法，您需要将 shebang 行（第一行）更改为 #!/bin/bash （或 bash.

或者如果你真的在使用 Bourne Shell ( /bin/sh)，那么你必须使用旧式重定向，即

cmp ... >/dev/null 2>&1

另外，我认为&>仅在 bash 4 中引入，所以如果您使用的是 bash 3.X，您仍然需要旧式重定向。

IHTH

score 3 · Accepted Answer

感谢@kobame 这个答案：这确实是一个评论，但用于格式化。

不需要调用 find 两次，在 find 命令中打印出大小和文件名

find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
# find the files that have duplicate sizes
sort -n | uniq -Dw 8 | 
# strip off the size and get the md5 sum
cut -c 10- | xargs md5sum

一个例子

$ cat a.txt
this is file a
$ cat b.txt
this is file b
$ cat c.txt
different contents 
$ cp a.txt d.txt
$ cp b.txt e.txt
$ find . -not -empty -type f -name \*.txt -printf "%8s %p\n" |
sort -n | uniq -Dw 8 | cut -c 10- | xargs md5sum

76fd4c1589ef708d9203f3cf09cfd032  ./a.txt
e2d75fd6a1080efb6230d0608b1f9014  ./b.txt
76fd4c1589ef708d9203f3cf09cfd032  ./d.txt
e2d75fd6a1080efb6230d0608b1f9014  ./e.txt

要保留一个并删除其余部分，我会将输出通过管道传输到：

...  | awk '++seen[$1] > 1 {print $2}' | xargs echo rm

rm ./d.txt ./e.txt

echo如果您的测试令人满意，请删除。

像许多复杂的管道一样，包含换行符的文件名会破坏它。

score 2 · Accepted Answer

所有很好的答案，所以只有一个简短的建议：您可以安装和使用

fdupes -r .

来自男人：

在给定路径中搜索重复文件。通过比较文件大小和 MD5 签名，然后逐字节比较来找到此类文件。

由@Francesco 添加

fdupes -rf . | xargs rm -f

用于删除欺骗。（-f在 fdupes 中省略了文件的第一次出现，所以只列出了欺骗）

bash - 使用带有输出重定向的 cmp 的 bash shell 脚本无法按预期工作

4 回答 4

Related

Reference