ruby - 从文本描述中简单过滤掉常用词

Question

诸如“a”、“the”、“best”、“kind”之类的词。我很确定有实现这一目标的好方法

为了清楚起见，我正在寻找

可以实现的最简单的解决方案，最好是在 ruby 中。
我对错误的容忍度很高
如果我需要一个常用短语库，我也非常满意

score 2 · Accepted Answer

这些常用词被称为“停用词” - 这里有一个类似的stackoverflow问题：英语的“停用词”列表？

总结一下：

如果您要处理大量文本，则值得收集有关该特定数据集中单词频率的统计数据，并将最常用的单词作为停用词列表。（您在示例中包含“种类”向我表明，您可能有一组非常不寻常的数据，例如有很多像“种类”这样的口语表达，所以也许您需要这样做。）
既然您说您不太在意错误，那么仅使用其他人制作的英语停用词列表可能就足够了，例如MySQL 使用的相当长的词或Google 出现的任何其他词。

如果您只是将这些单词放入程序中的散列中，那么过滤任何单词列表应该很容易。

score 1 · Accepted Answer

  Common = %w{ a and or to the is in be }
Uncommon = %{
  To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
  Common.each { |w| ignore_me[w.downcase] = :Common          }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join

 ,  not  : that   question: 
Whether 'tis nobler   mind  suffer
 slings  arrows of outrageous fortune,
  take arms against  sea of troubles,
 by opposing end them?  die:  sleep;
No more;  by  sleep  say we end
 heart-ache   thousand natural shocks
That flesh  heir , 'tis  consummation
Devoutly   wish'd.  die,  sleep;
 sleep: perchance  dream: ay, there's  rub;
For  that sleep of death what dreams may come

score 1 · Accepted Answer

这是 DigitalRoss 答案的变体。

str=<<EOF
To be, or not to be: that is the question: 
  Whether 'tis nobler in the mind to suffer
  The slings and arrows of outrageous fortune,
  Or to take arms against a sea of troubles,
  And by opposing end them? To die: to sleep;
  No more; and by a sleep to say we end
  The heart-ache and the thousand natural shocks
  That flesh is heir to, 'tis a consummation
  Devoutly to be wish'd. To die, to sleep;
  To sleep: perchance to dream: ay, there's the rub;
  For in that sleep of death what dreams may come
EOF

common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')

同样相关：检查一个字符串中的单词是否在另一个字符串中的最快方法是什么？

score 0 · Accepted Answer

等等，在你取出停用词（又名噪音词、垃圾词）之前，你需要做一些研究。索引大小和处理资源并不是唯一的问题。很大程度上取决于最终用户是否会输入查询，或者您将使用长的自动查询。

所有搜索日志分析表明，人们倾向于在每个查询中输入一到三个单词。当这就是搜索的全部内容时，我们不能失去任何东西。例如，一个集合可能在许多文档上都有“版权”这个词——这很常见——但是如果索引中没有这个词，就不可能进行精确的短语搜索或邻近相关性排名。此外，搜索最常见的词有完全正当的理由：人们可能正在寻找“The Who”，或者更糟的是，“The The”。

因此，虽然需要考虑技术问题，并且删除停用词是一种解决方案，但它可能不是您要解决的整体问题的正确解决方案。

score 0 · Accepted Answer

如果您有一个要删除的单词数组 named stop_words，那么您会从此表达式中获得结果：

description.scan(/\w+/).reject do |word|
  stop_words.include? word
end.join ' '

如果要保留每个单词之间的非单词字符，

description.scan(/(\w+)(\W+)/).reject do |(word, other)|
  stop_words.include? word
end.flatten.join

ruby - 从文本描述中简单过滤掉常用词

5 回答 5

Related

Reference