ruby-on-rails - 以非常低效的方式进行词频计数

Question

这是我计算词频的代码

  word_arr= ["I", "received", "this", "in", "email", "and", "found", "it", "a", "good", "read", "to", "share......", "Yes,", "Dr", "M.", "Bakri", "Musa", "seems", "to", "know", "what", "is", "happening", "in", "Malaysia.", "Some", "of", "you", "may", "know.", "He", "is", "a", "Malay",  "extra horny", "horny nor", "nor their", "their babes", "babes are", "are extra", "extra SEXY..", "SEXY.. .", ". .", ". .It's", ".It's because", "because their", "their CONDOMS", "CONDOMS are", "are Made", "Made In", "In China........;)", "China........;) &&"]

arr_stop_kwd=["a","and"] 

 frequencies = Hash.new(0)
   word_arr.each { |word|
      if !arr_stop_kwd.include?(word.downcase) && !word.match('&&')
        frequencies["#{word.downcase}"] += 1
      end
   }

当我有 100k 数据时，需要 9.03 秒，我可以用其他方法计算多少时间

提前谢谢

score 2 · Accepted Answer

看看Facets 宝石

你可以使用频率方法做这样的事情

require 'facets'
frequencies = (word_arr-arr_stop_kwd).frequency

请注意，可以从中减去停用词word_arr。请参阅阵列文档。

ruby-on-rails - 以非常低效的方式进行词频计数

1 回答 1

Related

Reference