bash - bash中的字符串比较（结构化文本）

Question

我需要比较具有以下结构的两个文件（new.txt 和 old.txt）：

 <Field1>,<Field2>,<Field3>,<Field4>,<Field5>,<Field6>

必须跳过公共行。
应将 new.txt 和 old.txt 中的类似行分组。如果 Field1、Field2、Field3、Field4 相同，我认为 old.txt 中的行与 new.txt 中的行相似。
其他独特的行应按文件名分组打印在下面

所以最终的任务是让视觉比较更容易。

添加部分： 示例。

$ cat old.txt 
 one,two,three,four,five,six
 un,deux,trois,quatre,cinq,six
 eins, zwei, drei, vier, fünf, sechs
$ cat new.txt 
 one,two,three,four,FIVE,SIX
 un,deux,trois,quatre,cinq,six
 en,två,tre,fyra,fem,sex

$cat comparison_result:
# lines are grouped. So it it easy to find the difference without scrolling.
old.txt> one,two,three,four,five,six
new.txt> one,two,three,four,FIVE,SIX
# end of task 2. There are no more simillar lines.
#
#start task 3.
#Printing all the rest unique lines of old.txt 
echo "the rest unique line in old.txt"
eins, zwei, drei, vier, fünf, sechs
.... 
#Printing all the rest unique lines of new.txt
echo "the rest unique line in new.txt"
en,två,tre,fyra,fem,sex

这可以是第 1 步：跳过常用行。

 # This is only in old.txt
 comm -2 -3 <(sort old.txt) <(sort new.txt) > uniq_old

 # This is only in new.txt
 comm -1 -3 <(sort old.txt) <(sort new.txt) > uniq_new

我写了第 1 步，并将这个排序的差异作为临时解决方案：

 # additional sort improves a bit diffs results.
 diff <(sort uniq_old) <(sort uniq_new)

它正在工作，但并不理想。我拒绝使用 diff，因为它开始比较块，缺少公共行。

有没有更好的方法来满足上面写的 3 个要求？

我认为可以通过

对这种排序、diff 和 comm 命令进行了一些改进（将 sed/tr 添加到临时“隐藏”最后两个文件并比较其余文件）。
awk

我想awk可以做得更好吗？

score 1 · Accepted Answer

那这个呢？

awk -F, 'NR==FNR{old[$0];next} $0 in old{delete old[$0];next} 1 END{for(line in old) print line}' old.txt <(sort -u new.txt) | sort

让我们把它分解成几部分。

-F,告诉 awk 使用 a,作为字段分隔符。
NR==FNR{old[$0];next}- 如果 NR（记录/行号）与当前文件中的行号匹配（即，当我们读取第一个输入文件时），将整行存储为关联数组的索引，然后跳转到下一个记录。
$0 in old{delete old[$0];next}- 现在我们正在读取第二个文件。如果当前行在数组中，则从数组中删除 if 并继续。您问题中的此地址条件＃1。
1- awk 中“打印行”的简写。这通过打印第二个文件中的唯一行来解决您问题中的条件 #3 的一部分。
END{...}- 此循环打印未从数组中删除的所有内容。这通过打印第一个文件中的唯一行来解决条件 #3 的另一部分。
<(sort -u new.txt)- 唯一的 new.txt 的输入。如果您知道 new.txt 已经是唯一的，则可以删除此 bash 依赖项。
| sort对输出进行排序，根据问题中的条件 #2 “分组”事物。

样本输出：

 $ cat old.txt 
 one,two,three,four,five,six
 un,deux,trois,quatre,cinq,six
 $ cat new.txt 
 one,two,three,four,FIVE,SIX
 un,deux,trois,quatre,cinq,six
 en,två,tre,fyra,fem,sex
 $ awk -F, 'NR==FNR{old[$0];next} $0 in old{delete old[$0];next} 1 END{for(line in old) print line}' old.txt new.txt | sort
 en,två,tre,fyra,fem,sex
 one,two,three,four,FIVE,SIX
 one,two,three,four,five,six
 $

请注意，法语中的行是重复的，因此被删除了。其他所有内容都已打印，两条英文行通过排序“分组”。

另请注意，此解决方案会受到非常大的文件的影响，因为所有 old.txt 都作为数组加载到内存中。可能对您有用的替代方法是：

 $ sort old.txt new.txt | awk '$0==last{last="";next} last{print last} {last=$0} END{print last}' | sort
 en,tva,tre,fyra,fem,sex
 one,two,three,four,FIVE,SIX
 one,two,three,four,five,six
 $

这里的想法是您只需从文件中获取所有输入数据，对其进行排序，然后使用 awk 脚本跳过重复的行，然后打印所有其他数据。然后对输出进行排序。就 awk 而言，这适用于流，但请注意，对于非常大的输入，您的sort命令仍需要将数据加载到内存和/或临时文件中。

同样，如果特定行重复多次，则第二个解决方案将失败。也就是说，如果它在 old.txt 中存在一次，在 new.txt 中存在两次。您需要使输入文件独一无二，或针对这种情况调整脚本。

bash - bash中的字符串比较（结构化文本）

1 回答 1

Related

Reference