0

这是我计算词频的代码

  word_arr= ["I", "received", "this", "in", "email", "and", "found", "it", "a", "good", "read", "to", "share......", "Yes,", "Dr", "M.", "Bakri", "Musa", "seems", "to", "know", "what", "is", "happening", "in", "Malaysia.", "Some", "of", "you", "may", "know.", "He", "is", "a", "Malay",  "extra horny", "horny nor", "nor their", "their babes", "babes are", "are extra", "extra SEXY..", "SEXY.. .", ". .", ". .It's", ".It's because", "because their", "their CONDOMS", "CONDOMS are", "are Made", "Made In", "In China........;)", "China........;) &&"]

arr_stop_kwd=["a","and"] 

 frequencies = Hash.new(0)
   word_arr.each { |word|
      if !arr_stop_kwd.include?(word.downcase) && !word.match('&&')
        frequencies["#{word.downcase}"] += 1
      end
   }

当我有 100k 数据时,需要 9.03 秒,我可以用其他方法计算多少时间

提前谢谢

4

1 回答 1

2

看看Facets 宝石

你可以使用频率方法做这样的事情

require 'facets'
frequencies = (word_arr-arr_stop_kwd).frequency

请注意,可以从 中减去停用词word_arr。请参阅阵列文档

于 2013-03-20T10:56:43.130 回答