我对通过 JRuby使用OpenNLP非常满意。对于像这样简单的东西,一个更简单的方法也可能就足够了。让我们从 Twitter 搜索#justinbieber 中随机抽取一条推文:
s = "If u never give up and if u fight for everything that u want, u can live our dreams. #JustinBieber"
去掉一些不必要的词:
words = s.split(/\W/).reject(&:empty?) - %w(the and u our if for that)
# => ["If", "never", "give", "up", "fight", "everything", "want", "can", "live", "dreams", "JustinBieber"]
创建计数:
words.each_with_object(Hash.new{ |h,k| h[k] = 0}) { |w, h| h[w] += 1 }
#=> {"If"=>1, "never"=>1, "give"=>1, "up"=>1, "fight"=>1, "everything"=>1, "want"=>1, "can"=>1, "live"=>1, "dreams"=>1, "JustinBieber"=>1}
如果您对超过 1 条推文执行此操作,则计数将更有意义。另外,由于您已经有一个 Ruby 哈希,因此很容易将其存储在例如 MongoDB 集合中。