1

使用 sed 和任何基本命令,我试图计算每个单独段落中包含许多单独段落的单词数。每一段都以特定的数字开头并增加。例子:

0:1.1 这是第一段……

0:1.2 这是第二段……

困难的是,每个段落都是一个由单词包裹的段落,而不是一行。如果它们在单行上,我可以计算每个段落中的单词。我该怎么做?感谢您的帮助

我确实想出了如何计算每个段落:

grep '[0-9]:[0-9]' 文件 | wc -l

4

4 回答 4

1

这可能对您有用(GNU sed):

sed -r ':a;$bb;N;/\n[0-9]+:[0-9]+\.[0-9]+/!s/\n/ /g;ta;:b;h;s/\n.*//;s/([0-9]+:[0-9]+\.[0-9]+)(.*)/echo "\1 = $(wc -w <<<"\2")"/ep;g;D' file

它将每个部分组成一行,然后计算部分中的单词减去部分编号(换行符被空格替换)。

于 2012-11-04T09:47:13.073 回答
1

awk解决方案可能对您有用:

awk '/^[0-9]:[0-9]\.[0-9]/{ 
       if (pass_num) printf "%s, word count: %i\n", pass_num, word_count
       pass_num=$1
       word_count=-1
     }
     { word_count+=NF }
     END { printf "%s, word count: %i\n", pass_num, word_count }
    ' file

测试输入:

# cat file
0:1.1 I am le passage one.
There are many words in me.

0:1.2 I am le passage two.
One two three four five six
Seven

0:1.3 I am "Hello world"

测试输出:

0:1.1, word count: 11
0:1.2, word count: 12
0:1.3, word count: 4


这个怎么运作:

每个单词由空格分隔,因此每个单词可以由 中的每个字段表示awk,即一行中的字数等于NF。字数每行相加,直到下一段。

当它遇到一个新的段落(由一个段落编号的存在表示)时,它

  • 打印出上一段的数量和字数。
  • 将段落编号设置为此新的段落编号
  • 重置段落字数(-1因为我们不想计算段落数)

END{..}块是必需的,因为最后一段没有导致它打印出段落编号和字数的触发器。

是在遇到第一段时if (pass_num)压制。printfawk

于 2012-11-04T09:13:01.087 回答
0
$ cat file
0:1.1 This is the first passage...
welcome to the SO, you leart a lot of things here.

0:1.2 This is the second passage...
wer qwerqrq            ewqr e
0:1.3 This is the second passage...

使用 sed 和 GNU grep:

$ sed -n '/0:1.1/,/[0-9]:[0-9]\.[0-9]/{//!p}' file | grep -Eo '[[:alpha:]]*'   | wc -l
11

0:1.1 -> 在此处给出您要计算的段落编号。

于 2012-11-04T06:55:18.223 回答
0

Here's one way with GNU awk:

awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' 'NF > 0 { print R ": " NF - 2 } { R = RT }'

If it is run on the file listed by doubledown, the output is:

0:1.1: 11
0:1.2: 12
0:1.3: 4

Explanation

This works by splitting the input into records according to [0-9]+:[0-9]+\\.[0-9]+ and splitting into fields at whitespace. The record separator is off by one, hence the {R = RT }, the field counter is off by two because each record starts and ends with an FS, hence the NF - 2.

Edit - only count fields containing [:alnum:]

The above also counts e.g. ellipsis (...) as words, to avoid this do something like this:

awk -v RS='[0-9]+:[0-9]+\\.[0-9]+' -v FS='[ \t\n]+' '
  NF > 0 { 
    wc = NF-2
    for(i=2; i<NF; i++)
      if($i !~ /[[:alnum:]]+/)
        wc--
    print R ": " wc
  } 
  { R = RT }'
于 2012-11-04T10:09:48.537 回答