linux - 用于读取单词列表并计算它们在语料库中的计数的 Shell 脚本。

Question

我需要在 linux 中编写一个命令行脚本来执行以下操作：

从文本文件中读取单词列表（每行一个单词）。说 w_i
对于每个 w_i 计算不同文本文件中的字数。
对这些计数求和

非常感谢这里的一些帮助！

score 2 · Accepted Answer

这里使用一个单行来awk打印字数和总数：

awk 'NR==FNR{w[$1];next}{for(i=1;i<=NF;i++)if($i in w)w[$i]++}END{for(k in w){print k,w[k];s+=w[k]}print "Total",s}' file1 file2
hello 13
foo 20
world 13
baz
bar 20
Total 66

注意：使用Kents示例输入。

更具可读性的脚本版本：

BEGIN {
    OFS="\t"                              # Space the output with a tab 
}
NR==FNR {                                 # Only true in file1
    word_count[$1]                        # Build keys for all words           
    next                                  # Get next line
}
{                                         # In file2 here
    for(i=1;i<=NF;i++)                    # For each word on the current line
        if($i in word_count)              # If the word has a key in the array
            word_count[$i]++              # Increment the count
}
END {                                     # After all files have been read
    for (word in word_count) {            # For each word in the array
        print word,int(word_count[word])  # Print the word and the count
        sum+=word_count[word]             # Sum the values
    }
    print "Total",sum                     # Print the total
}

另存为script.awk并运行如下：

$ awk -f script.awk file1 file2
hello   13
foo     20
world   13
baz     0
bar     20
Total   66

score 2 · Accepted Answer

这条 grep 行可能对你有用，试一试：

 grep -oFwf wordlist textfile|wc -l

我刚刚做了这个小测试，它似乎像你预期的那样工作。

（PS，我使用vim在file2中插入这些词，所以我知道我插入了多少）

kent$  head file1 file2
==> file1 <==
foo
bar
baz
hello
world

==> file2 <==
 foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo foo bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar bar
 hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world hello world 
blah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo bablah bbbb fo ba 

kent$  grep -oFwf file1 file2|wc -l
66

score 1 · Accepted Answer

假设您有 filewords每个文件包含一个单词，然后您有 file corpus，您可以使用以下命令：

$ cat file | xargs -I% sh -c '{ echo "%\c"; grep -o "%" corpus | wc -l; }' | \
  tee /dev/tty | awk '{ sum+=$2} END {print "Total " sum}'

例如，对于file：

car
plane
bike

对于corpus：

car is a plane is on a car
or in the car via a plane
plane plane
car

输出将是：

$ cat file | xargs -I% sh -c '{ echo "%\c"; grep -o "%" corpus | wc -l; }' | \
  tee /dev/tty | awk '{ sum+=$2} END {print "Total " sum}'
car       4
plane       4
bike       0
Total 8

linux - 用于读取单词列表并计算它们在语料库中的计数的 Shell 脚本。

3 回答 3

Related

Reference