4

我有这样的x行:

Unable to find latest released revision of 'CONTRIB_046578'.   

我需要在这个例子中提取单词revision of '和之间的单词,如果可能的话,使用或任何其他命令 计算该单词的出现次数?'CONTRIB_046578grepsed

4

6 回答 6

8

最干净的解决方案是grep -Po "(?<=')[^']+(?=')"

$ cat file
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'foo'
Unable to find latest released revision of 'bar'
Unable to find latest released revision of 'CONTRIB_046578'

# Print occurences 
$ grep -Po "(?<=')[^']+(?=')" file
CONTRIB_046578
foo
bar
CONTRIB_046578

# Count occurences
$ grep -Pc "(?<=')[^']+(?=')" file
4

# Count unique occurrences 
$ grep -Po "(?<=')[^']+(?=')" file | sort | uniq -c 
2 CONTRIB_046578
1 bar
1 foo
于 2012-12-21T13:37:31.570 回答
1

这是一个 awk 脚本,可用于提取和计算单引号中每个单词的频率:

awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} 
      END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile

测试

cat infile
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'

输出:

 awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} 
      END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile

CONTRIB_046579 3
CONTRIB_046578 1
CONTRIB_046570 1
CONTRIB_046572 2
于 2012-12-21T13:26:40.647 回答
1

您只需要一个非常简单的 awk 脚本来计算引号之间的出现次数:

awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file

使用@anubhava 的测试输入文件:

$ cat file
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'
$
$ awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file
CONTRIB_046578 1
CONTRIB_046579 3
CONTRIB_046570 1
CONTRIB_046572 2
于 2012-12-21T14:20:58.780 回答
0

假设:

  • 每个单词可以出现多次,OP 想要计算每个单词的出现次数。
  • 文件中没有其他行

输入文件:

$ cat test.txt 
Unable to find latest released revision of 'CONTRIB_046578'.
Unable to find latest released revision of 'CONTRIB_046572'.
Unable to find latest released revision of 'CONTRIB_046579'.
Unable to find latest released revision of 'CONTRIB_046570'.
Unable to find latest released revision of 'CONTRIB_046572'.
Unable to find latest released revision of 'CONTRIB_046578'.

用于过滤和计算单词的 Shell 脚本:

$ sed "s/.*'\(.*\)'.*/\1/" test.txt | sort | uniq -c
  1 CONTRIB_046570
  2 CONTRIB_046572
  2 CONTRIB_046578
  1 CONTRIB_046579
于 2012-12-21T13:20:55.453 回答
0
sed 's/.*\'(.*?)\'.*/$1/' myfile.txt
于 2012-12-21T13:22:55.213 回答
0

如果下面的测试文件代表了实际问题中的文件,那么下面的文件可能会有用。

基于测试文件中的每一行都是同质的- 即格式良好并包含 8 列(或字段) - 使用该cut命令的方便解决方案如下:

文件:

Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'

代码:

cut -d ' ' -f 8 file | tr -d "'" | sort | uniq -c

输出:

1 CONTRIB_046570
2 CONTRIB_046572
1 CONTRIB_046578
3 CONTRIB_046579

cut注意代码:用于分隔每个字段的默认分隔符是tab,但由于我们要求分隔符是单个空格来分隔每个字段,因此我们指定选项-d ' '。其余代码与其他答案类似,所以我不会重复已经说过的内容。

一般注意事项:如果文件格式不正确,如我上面已经提到的,此代码可能无法达到所需的输出。

于 2014-01-12T10:52:47.133 回答