ruby - Ruby Regex，获取所有可能的匹配项（不剪切字符串）

Question

我遇到了红宝石正则表达式的问题。我需要找到所有（可能重叠的）匹配项。这是问题的简化：

#Simple example
"Hey".scan(/../)
=> ["He"] 
#Actual results

#With overlapping matches the result should be
=> ["He"], ["ey"]

我试图执行并获得所有结果的正则表达式如下所示：

"aaaaaa".scan(/^(..+)\1+$/) #This looks for multiples of (here) "a" bigger than one that "fills" the entire string. "aa"*3 => true, "aaa"*2 => true. "aaaa"*1,5 => false.
 => [["aaa"]] 

#With overlapping results this should be
 => [["aa"],["aaa"]]

是否有图书馆或方法可以在 ruby 中执行正则表达式以获得我想要的结果？

我发现了一些线索，表明这在 Perl 中是可能的，但经过数小时的研究，我没有发现任何关于 Ruby 方法的信息。

但是我能够找到这个“ Javascript Regex - Find all possible matches, even in already capture matches ”，但是我找不到任何与 Ruby 类似的东西，也找不到类似于 Ruby 版本中最后一个索引属性的东西。老实说，我认为它无论如何都不会起作用，因为我打算使用的正则表达式是递归的并且依赖于整个字符串，而该方法会切掉字符串。

score 6 · Accepted Answer

有点老话题......不确定我是否理解，但我能找到的最好的是：

"Hey".scan(/(?=(..))/)
 => [["He"], ["ey"]] 

"aaaaaa".scan(/(?=(..+)\1)/)
 => [["aaa"], ["aa"], ["aa"]]

扫描遍历每个字节，并且“积极前瞻”在每个步骤中(?=)测试正则表达式。(..+)\1前瞻不消耗字节，但它内部的捕获组返回匹配（如果存在）。

score 3 · Accepted Answer

你只是错过了第二个捕获组吗？

"aaaaaa".scan(/(..+?)(\1+)/)
#=> [["aa", "aaaa"]]

看来您的期望可能有问题。

score 3 · Accepted Answer

基于任何解决方案的问题scan是它不会像scan往常一样找到重叠的匹配项。可能可以重铸正则表达式，使其完全嵌入零宽度的正向前瞻中，然后使用scan，但是 IIRC 存在其他有效的正则表达式模式，这些模式在前瞻或后视中不起作用。

提出的问题存在一些歧义。这将问题解释为真正要求找到正则表达式将匹配的目标字符串的所有唯一匹配子字符串。虽然不是绝对必要的，但它使用 ruby 2.0 惰性求值来避免过多的中间数组分配。

class String
  def each_substring
    Enumerator.new do |y|
      (0...length).each do |b|
        (b...length).each do |e|
          y << self[b..e]
        end
      end
      y << '' 
    end
  end
end

class Regexp
  def all_possible_matches(str)
    str.each_substring.lazy.
    map { |s| match(s) }.
    reject(&:nil?).
    map { |m| m.size > 1 ? m[1..-1] : m[0] }.
    to_a.uniq
  end
end

/.{2,4}/.all_possible_matches('abcde')
=> ["ab", "abc", "abcd", "bc", "bcd", "bcde", "cd", "cde", "de"]

/^(..+?)\1+$/.all_possible_matches('aaaaaa')
=> [["aa"]]
/^(..+)\1+$/.all_possible_matches('aaaaaa')
=> [["aa"], ["aaa"]]
/^(..+?)\1+$/.all_possible_matches('aaaaaaaaa')
=> [["aa"], ["aaa"]]
/^(..+)\1+$/.all_possible_matches('aaaaaaaaa')
=> [["aa"], ["aaa"], ["aaaa"]]

编辑：让它在存在时返回捕获组。OP 对非贪婪形式的期望解决方案/^(..+?)\1+$/是错误的，因为?它将满足于具有最少字符的模式。

score 1 · Accepted Answer

我不明白为什么您的预期结果应该是这样的，但是对于只是从不同的起点应用正则表达式，就可以了。

class String
  def awesome_regex_scan r
    (0...length).map{|i| match(r, i)}.map(&:to_a).reject(&:empty?).uniq
  end
end

"Hey".awesome_regex_scan(/../) # => [["He"], ["ey"]]

如上所述，它与您的预期结果不符，我不明白您为什么期望您所做的事情：

"aaaaaa".awesome_regex_scan(/^(..+?)\1+$/) # => [["aaaaaa", "aa"]]
"aaaaaa".awesome_regex_scan(/^(..+)\1+$/) # => [["aaaaaa", "aaa"]]

score 0 · Accepted Answer

class String
  def awesome_regex_scan(pattern)
    result = []
    source = self
    while (match = source.match(pattern))
      result << match.to_s
      source = source.slice(match.begin(0)+1..-1)
    end
    result
  end
end

p "Hey".awesome_regex_scan(/../)

ruby - Ruby Regex，获取所有可能的匹配项（不剪切字符串）

5 回答 5

Related

Reference