python - 根据字符/单词比率从文件中删除行 - unix/bash

Question

我有两个文件，我需要删除低于某个令牌比率的行，例如

文件 1：

This is a foo bar question
that is not a parallel sentence because it's too long
hello world

文件 2：

c'est le foo bar question
creme bulee
bonjour tout le monde

并且计算的比率是总no. of words in file 1 / total no. of words in file 2的，如果它低于这个比率，则删除句子。

然后输出是一个联合文件，其中文件 1 和文件 2 中的句子用制表符分隔：

[出去]：

This is a foo bar question\tc'est le foo bar question
hello world\tbonjour tout le monde

这些文件始终具有相同的行数。我一直在这样做，但是如何在 unix bash 而不是使用 python 中做同样的事情？

# Calculate the ratio.
with io.open('file1', , 'r', encoding='utf8') as f1, io.open('file2', , 'r', encoding='utf8') as f2: 
    ratio = len(f1.read().split()) / float(len(f2.read().split()))
# Check and output to file.
with io.open('file1', , 'r', encoding='utf8') as f1, io.open('file2', , 'r', encoding='utf8') as f2, io.open('fileout', , 'w', encoding='utf8') as fout:
    for l1, l2 in zip(file1, file2):
        if len(l1.split())/float(len(l2.split())) > ratio:
            print>>fout, "\t".join([l1.strip() / l2.strip()])

另外，如果比率计算是基于字符而不是单词，我可以在 python 中做到这一点，但我如何在 unix bash 中实现相同的效果？请注意，差异仅与len(str.split())和一起计算len(str)。

# Calculate the ratio.
with io.open('file1', , 'r', encoding='utf8') as f1, io.open('file2', , 'r', encoding='utf8') as f2: 
    ratio = len(f1.read()) / float(len(f2.read()))
# Check and output to file.
with io.open('file1', , 'r', encoding='utf8') as f1, io.open('file2', , 'r', encoding='utf8') as f2, io.open('fileout', , 'w', encoding='utf8') as fout:
    for l1, l2 in zip(file1, file2):
        if len(l1)/float(len(l2)) > ratio:
            print>>fout, "\t".join([l1.strip() / l2.strip()])

score 1 · Accepted Answer

这是 awk 中的一个简单的比率计算器。

awk 'NR == FNR { a[NR] = NF; next }
    { print NF/a[FNR] }' file1 file2

这仅打印每行的比率。当比率在特定范围内时，将其扩展为仅打印第二个文件很容易。

awk 'NR == FNR { a[NR] = NF; next }
    NF/a[FNR] >= 0.5 && NF/a[FNR] <= 2' file1 file2

（这使用了一个 Awk 速记——在一般形式中condition { action }，如果你省略它，{ action }它默认为{ print }。类似地，如果你省略条件，则无条件地采取行动。）

您可以运行第二次传递file1来做同样的事情，或者只是再次运行它并反转文件名。

哦，等等，这是一个完整的解决方案。

awk 'NR == FNR { a[NR] = NF; w[NR] = $0; next }
    NF/a[FNR] >= 0.5 && NF/a[FNR] <= 2 { print w[FNR] "\t" $0 }' file1 file2

score 1 · Accepted Answer

Tripleee 关于 bash 不适合非整数的评论是正确的，但如果你真的想做 bash，这应该让你开始。您可以使用程序wc和-w参数来做到这一点。它计算单词。bc 确实浮动除法。

while read line1 <&3 && read line2 <&4; do     
    line1_count=`echo $line1 | wc -w`
    line2_count=`echo $line2 | wc -w`
    ratio=`echo "$line1_count / $line2_count" | bc -l`
    echo $ratio
done 3<file1 4<file2

另外，man bc看看关于关系表达式的部分。这应该允许您与比率的阈值进行比较。

python - 根据字符/单词比率从文件中删除行 - unix/bash

2 回答 2

Related

Reference