3

因此,我们都知道 sed 非常擅长查找和替换文件中所有出现的单词:

sed -i 's/original_word/new_word/g' file.txt

但是,有人可以告诉我如何从文件(类似于 grep -f)中为 sed 提供“original_words”列表吗?我只想用''替换所有(删除它们)。

原始单词表文件只是一堆由行分隔的停用词(wordlist.txt):

a
about
above
according
across
after
afterwards

这将是一种获取停用词列表并从语料库中删除它们的简单方法(对于清理数据很有用)。

file.txt 看起来像

05ricardo   RT @shakira: Immigration reform isn't about politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me a copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3
4

5 回答 5

2

您也可以让 sed 为您编写 sed 脚本(使用 GNU sed 测试):

<stopwords sed 's:.*:s/\\b&\\b//:g' | sed -f - file.txt

输出:

05ricardo   RT @shakira: Immigration reform isn't  politics. It's about people mothers, kids. Obama is working for all of them. http://t.co/rAW ...    0
05ricardo   ?@ItsReginaG: Don't vote Obama. Because you will lose jobs, and die.? Lol   0
05ricardo   ?@shakira: Obama doubles Pell Grants - 700,000 more Latinos get help to go to college. Meet Johanny Adames http://t.co/EMg8NLGl Shak?. ?    -1
05rodriguez_a   My Comm teacher gave me  copy of Obama's speech that he gave the other night and I cried while reading it. It was that moving. -3
于 2013-02-07T11:31:13.037 回答
1

这是一种使用方法GNU sed

while IFS= read -r word; do sed -ri "s/( |)\b$word\b//g" file; done < wordlist

文件内容:

how about I decide to look at it afterwards. What
across do you think? Is it a good idea to go out and about? I 
think I'd rather go up and above.

结果:

how I decide to look at it. What
 do you think? Is it good idea to go out and? I 
think I'd rather go up and.
于 2013-02-07T06:03:21.003 回答
1

首先,不是所有sedsupport -i,但这不是必要的选项,因为以一般方式提供该功能是微不足道的。一个简单的选项(假设一个非 csh 系列的 shell):

inline() { f=$1; shift; "$@" < $f > $f.out && mv $f.out $f; }

然后,进行替换(您尚未指定要如何处理单词分隔符,因此如果“foo”在黑名单中,“bar foo baz”将在“bar”和“baz”之间有两个空格)使用 awk 或 perl 都非常简单:

awk 'NR==FNR{a[$0]; next} {for( i in a ) gsub( i, "" )} 1' original-words file.txt
perl -wne 'if( $ARGV = $ARGV[0] ){ chop; push @no, $_; next } 
    foreach $x( @no ) {s/$x//g } print ' original-words file.txt

如果您对结果感到满意,请使用-iwith perl(不是所有sedsupport -i,但所有perl> 5.0),或者您可以使用以下命令修改文件:

inline file.txt awk 'NR==FNR{a[$0]; next} 
    {for( i in a ) gsub( i, "" )} 1' original-words -

sed这些解决方案中的任何一个都比调用黑名单中的每个单词要快得多。

于 2013-02-07T10:40:15.613 回答
0

或许这

#!/bin/sh
while read k
do
  sed -i "s/$k//g" file.txt
done < dict.txt
于 2013-02-07T05:58:22.380 回答
-1
cat file.txt | grep  -vf wordlist.txt
于 2014-05-21T16:18:50.380 回答