linux - Shell program - determine average word length in a file

Question

I am trying to write a shell program to determine the average word length in a file. I'm assuming I need to use wc and expr somehow. Guidance in the right direction would be great!

score 4 · Accepted Answer

假设您的文件是 ASCII 并且wc确实可以读取它...

chars=$(cat inputfile | wc -c)
words=$(cat inputfile | wc -w)

然后一个简单的

avg_word_size=$(( ${chars} / ${words} ))

将计算一个（四舍五入的）整数。但这将比舍入错误“更错误”：您还将在平均字长中包含所有空白字符。我假设你想要更精确......

以下将通过从乘以 100 的数字计算舍入整数来提高精度：

_100x_avg_word_size=$(( $((${chars} * 100)) / ${words} ))

现在我们可以用它来告诉世界：

 echo "Avarage word size is: ${avg_word_size}.${_100x_avg_word_size: -2:2}"

为了进一步细化，我们可以假设只有 1 个空格字符分隔单词：

 chars=$(cat inputfile | wc -c)
 words=$(cat inputfile | wc -w)

 avg_word_size=$(( $(( ${chars} - $(( ${words} - 1 )) )) / ${words} ))
 _100x_avg_word_size=$(( $((${chars} * 100)) / ${words} ))

 echo "Avarage word size is: ${avg_word_size}.${_100x_avg_word_size: -2:2}"

现在，您的工作是尝试将“线”的概念纳入您的计算中...... :-)

score 1 · Accepted Answer

wc更新：清楚地（希望）显示和这种方法之间的区别；并修复了“太多换行”的错误；还增加了对单词结尾撇号的更精细控制。

如果您想将 aword视为 a bash word，那么单独使用wc就可以了。
但是，如果您想将word口语/书面语言中的单词视为单词，则不能wc用于单词解析。

例如..wc认为以下内容包含1 个单词（平均大小 = 112.00），而下面
的脚本显示它包含19 个单词（大小平均 = 4.58）

"/home/axiom/zap_notes/apps/eng-hin-devnag-itrans/Platt's_Urdu_and_classical_Hindi_to_English_-_preface5.doc't"

使用Kurt 的脚本，以下行显示包含7 个单词（大小平均 = 8.14），
而下面显示的脚本显示它包含7 个单词（大小平均 = 4.43）... बे= 2 个字符

"बे  = {Platts} ... —be-ḵẖẉabī, s.f. Sleeplessness:"

所以，如果wc是你的口味，好，如果不是，这样的东西可能适合：

# Cater for special situation words: eg 's and 't   
# Convert each group of anything which isn't a "character" (including '_') into a newline.  
# Then, convert each CHARACTER which isn't a newline into a BYTE (not character!).  
# This leaves one 'word' per line, each 'word' being made up of the same BYTE ('x').  
# 
# Without any options, wc prints  newline, word, and byte counts (in that order),
#  so we can capture all 3 values in a bash array
#  
# Use `awk` as a floating point calculator (bash can only do integer arithmetic)

count=($(sed "s/\>'s\([[:punct:]]\|$\)/\1/g      # ignore apostrophe-s ('s) word endings 
              s/'t\>/xt/g      # consider words ending in apostrophe-t ('t) as base word + 2 characters   
              s/[_[:digit:][:blank:][:punct:][:cntrl:]]\+/\n/g 
              s/^\n*//; s/\n*$//; s/[^\n]/x/g" "$file" | wc))
echo "chars / word average:" \
      $(awk -vnl=${count[0]} -vch=${count[2]} 'BEGIN{ printf( "%.2f\n", (ch-nl)/nl ) }')

linux - Shell program - determine average word length in a file

2 回答 2

Related

Reference