0

给定输入:

str = "foo bar jim jam. jar jee joon."

我需要用空格分隔的所有 2 词和 3 词短语的输出:

[ "foo bar", "bar jim", "jim jam", "jar jee", "jee joon",
  "foo bar jim", "bar jim jam", "jar jee joon" ]

特别注意上面缺少“jam jar”、“jim jam jar”和“jam jar jee”,由于时期。

我不能使用str.scan(/\w+/).each_cons(2).map{ |a| a.join(' ') },因为那将包括"jam jar".

扫描/\w+ \w+/产量["foo bar", "jim jam", "jar jee"],特别是缺少“bar jim”和“jee joon”,并突出显示问题。

实际应用是为搜索引擎生成基于短语的索引。我想找到所有真正连续的单词作为短语,不包括那些用标点符号分隔单词的单词。

编辑:似乎有一种方法可以通过以下变体在正则表达式/扫描中执行此操作:

"a b c d".scan(/(?=([abc] [abc]) )[abc]/)
#=> [["a b"], ["b c"]]
4

4 回答 4

1
str = "foo bar jim jam. jar jee joon."
arr = str.split(' ').each_cons(2).map do |a|
  a.join(' ') if a.join(' ').match(/\w+ \w+/)
end
p arr.compact
#=> ["foo bar", "bar jim", "jim jam.", "jar jee", "jee joon."]

EDIT: It appears you've changed your question to ask for 3-word phrases as well. ಠ_ಠ</p>

于 2012-05-26T03:12:56.270 回答
1

我相信这可以完成工作,尽管它假设唯一的标点符号是句号:

str.split(".").map do |s|
  pairs_and_triples = []
  s.split.each_cons(2){ |*words| pairs_and_triples << words.join(" ") }
  s.split.each_cons(3){ |*words| pairs_and_triples << words.join(" ")}
  pairs_and_triples
end.flatten

编辑或少一点重复:

str.split(".").map do |s|
  [2,3].map do |i|
    s.split.each_cons(i).map{ |*words| words.join(" ") }
  end.flatten
end.flatten
于 2012-05-26T03:50:31.323 回答
0

我最终得到的稳健、高效的解决方案是@muistooshort 建议并由@ChrisRice 勾勒出来的:

  1. 在句子边界上拆分
  2. 扫描单词(忽略逗号等无趣的标点符号)
  3. 用于each_cons处理该数组的变化

在代码中:

max_words_per_phrase = 5
str = "foo bar, jim jam. jar: jee joon."

phrases = str.split(/[.!?]+/).flat_map do |sentence|
  words = sentence.scan(/\w+/)
  2.upto(max_words_per_phrase).flat_map do |i|
    words.each_cons(i).map{ |a| a.join(' ') }
  end
end

p phrases
#=> ["foo bar", "bar jim", "jim jam", "foo bar jim", "bar jim jam",
#=>  "foo bar jim jam", "jar jee", "jee joon", "jar jee joon"]
于 2012-05-31T21:31:42.783 回答
0

删除标点符号后:

str = "foo bar jim jam jar jee joon"

正如您在问题中建议的那样,可以使用积极的前瞻:

r2 = /(\w+)(?=(\s+\w+))/
r3 = /(\w+)(?=(\s+\w+)(\s+\w+))/
str.scan(r2).concat(str.scan(r3)).map(&:join)
  #=> ["foo bar", "bar jim", "jim jam", "jam jar", "jar jee", "jee joon",
  #    "foo bar jim", "bar jim jam", "jim jam jar", "jam jar jee", "jar jee joon"] 
于 2015-12-20T05:46:01.440 回答