linux - 使用 grep 从停用词文件中过滤掉单词

Question

我想将 grep 与停用词文件一起使用来过滤掉另一个文件中的常见英文单词。文件“somefile”每行包含一个单词。

cat somefile | grep -v -f stopwords

这种方法的问题是：它检查停用词中的单词是否出现在 somefile 中，但我想要相反，即检查 somefile 中的单词是否出现在停用词中。

这该怎么做？

例子

somefile 包含以下内容：

hello
o
orange

停用词包含以下内容：

我只想从 somefile 中过滤掉单词“o”，而不是 hello 和 orange。

score 14 · Accepted Answer

我又想了想，找到了解决办法……

使用-w开关grep来匹配整个单词：

grep -v -w -f stopwords somefile

score 5 · Accepted Answer

假设您有停用词文件 /tmp/words：

in
the

您可以通过以下方式从它创建 sed 程序：

sed 's|^|s/\\<|; s|$|\\>/[CENSORED]/g;|' /tmp/words > /tmp/words.sed

这样你会得到/tmp/words.sed：

s/\<in\>/[CENSORED]/g;
s/\<the\>/[CENSORED]/g;

然后用它来审查任何文本文件：

sed -e -f /tmp/words.sed /input/file/to/filter.txt > /censored/output.txt

-esed 需要理解识别所需的扩展正则表达式。当然，[censored]如果您愿意，您可以更改为任何其他字符串或空字符串。

该解决方案将处理一行中的多个单词以及每行一个单词的文件。

2 回答 2