诸如“a”、“the”、“best”、“kind”之类的词。我很确定有实现这一目标的好方法
为了清楚起见,我正在寻找
- 可以实现的最简单的解决方案,最好是在 ruby 中。
- 我对错误的容忍度很高
- 如果我需要一个常用短语库,我也非常满意
诸如“a”、“the”、“best”、“kind”之类的词。我很确定有实现这一目标的好方法
为了清楚起见,我正在寻找
这些常用词被称为“停用词” - 这里有一个类似的stackoverflow问题:英语的“停用词”列表?
总结一下:
如果您只是将这些单词放入程序中的散列中,那么过滤任何单词列表应该很容易。
Common = %w{ a and or to the is in be }
Uncommon = %{
To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
}.split /\b/
ignore_me, result = {}, []
Common.each { |w| ignore_me[w.downcase] = :Common }
Uncommon.each { |w| result << w unless ignore_me[w.downcase[/\w*/]] }
puts result.join
, not : that question:
Whether 'tis nobler mind suffer
slings arrows of outrageous fortune,
take arms against sea of troubles,
by opposing end them? die: sleep;
No more; by sleep say we end
heart-ache thousand natural shocks
That flesh heir , 'tis consummation
Devoutly wish'd. die, sleep;
sleep: perchance dream: ay, there's rub;
For that sleep of death what dreams may come
这是 DigitalRoss 答案的变体。
str=<<EOF
To be, or not to be: that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles,
And by opposing end them? To die: to sleep;
No more; and by a sleep to say we end
The heart-ache and the thousand natural shocks
That flesh is heir to, 'tis a consummation
Devoutly to be wish'd. To die, to sleep;
To sleep: perchance to dream: ay, there's the rub;
For in that sleep of death what dreams may come
EOF
common = {}
%w{ a and or to the is in be }.each{|w| common[w] = true}
puts str.gsub(/\b\w+\b/){|word| common[word.downcase] ? '': word}.squeeze(' ')
等等,在你取出停用词(又名噪音词、垃圾词)之前,你需要做一些研究。索引大小和处理资源并不是唯一的问题。很大程度上取决于最终用户是否会输入查询,或者您将使用长的自动查询。
所有搜索日志分析表明,人们倾向于在每个查询中输入一到三个单词。当这就是搜索的全部内容时,我们不能失去任何东西。例如,一个集合可能在许多文档上都有“版权”这个词——这很常见——但是如果索引中没有这个词,就不可能进行精确的短语搜索或邻近相关性排名。此外,搜索最常见的词有完全正当的理由:人们可能正在寻找“The Who”,或者更糟的是,“The The”。
因此,虽然需要考虑技术问题,并且删除停用词是一种解决方案,但它可能不是您要解决的整体问题的正确解决方案。
如果您有一个要删除的单词数组 named stop_words
,那么您会从此表达式中获得结果:
description.scan(/\w+/).reject do |word|
stop_words.include? word
end.join ' '
如果要保留每个单词之间的非单词字符,
description.scan(/(\w+)(\W+)/).reject do |(word, other)|
stop_words.include? word
end.flatten.join