unix - 使用 sed 删除停用词列表中的单词（为 sed 提供要从文本文件中删除的参数列表）

Question

因此，我们都知道 sed 非常擅长查找和替换文件中所有出现的单词：

sed -i 's/original_word/new_word/g' file.txt

但是，有人可以告诉我如何从文件（类似于 grep -f）中为 sed 提供“original_words”列表吗？我只想用''替换所有（删除它们）。

原始单词表文件只是一堆由行分隔的停用词（wordlist.txt）：

a
about
above
according
across
after
afterwards

这将是一种获取停用词列表并从语料库中删除它们的简单方法（对于清理数据很有用）。

file.txt 看起来像

05ricardo   RT @shakira: Immigration reform isn't about politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me a copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

score 2 · Accepted Answer

您也可以让 sed 为您编写 sed 脚本（使用 GNU sed 测试）：

<stopwords sed 's:.*:s/\\b&\\b//:g' | sed -f - file.txt

输出：

05ricardo   RT @shakira: Immigration reform isn't  politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me  copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3

score 1 · Accepted Answer

这是一种使用方法GNU sed：

while IFS= read -r word; do sed -ri "s/( |)\b$word\b//g" file; done < wordlist

文件内容：

how about I decide to look at it afterwards. What
across do you think? Is it a good idea to go out and about? I 
think I'd rather go up and above.

结果：

how I decide to look at it. What
 do you think? Is it good idea to go out and? I 
think I'd rather go up and.

score 1 · Accepted Answer

首先，不是所有sedsupport -i，但这不是必要的选项，因为以一般方式提供该功能是微不足道的。一个简单的选项（假设一个非 csh 系列的 shell）：

inline() { f=$1; shift; "$@" < $f > $f.out && mv $f.out $f; }

然后，进行替换（您尚未指定要如何处理单词分隔符，因此如果“foo”在黑名单中，“bar foo baz”将在“bar”和“baz”之间有两个空格）使用 awk 或 perl 都非常简单：

awk 'NR==FNR{a[$0]; next} {for( i in a ) gsub( i, "" )} 1' original-words file.txt
perl -wne 'if( $ARGV = $ARGV[0] ){ chop; push @no, $_; next } 
    foreach $x( @no ) {s/$x//g } print ' original-words file.txt

如果您对结果感到满意，请使用-iwith perl（不是所有sedsupport -i，但所有perl> 5.0），或者您可以使用以下命令修改文件：

inline file.txt awk 'NR==FNR{a[$0]; next} 
    {for( i in a ) gsub( i, "" )} 1' original-words -

sed这些解决方案中的任何一个都比调用黑名单中的每个单词要快得多。

score 0 · Accepted Answer

0

或许这

#!/bin/sh
while read k
do
  sed -i "s/$k//g" file.txt
done < dict.txt

于 2013-02-07T05:58:22.380 回答

score -1 · Accepted Answer

-1

cat file.txt | grep  -vf wordlist.txt

于 2014-05-21T16:18:50.380 回答

unix - 使用 sed 删除停用词列表中的单词（为 sed 提供要从文本文件中删除的参数列表）

5 回答 5

Related

Reference