ruby - 如何编写能够匹配一两行文本的正则表达式

Question

我试图匹配一些可以是单行或两行的文本。我希望能够以有效的方式处理这两种情况。文本字符串将采用一致的格式并包含多个选项卡。我正在尝试用红宝石进行比赛。正文如下：

单线：

#3  Hello Stormy    Scratched - Reason Unavailable                           11:10 AM ET

两行：

#3  Hello Stormy    Scratched - Reason Unavailable                            11:10 AM ET   
                    Scratch Reason - Reason Unavailable changed to Trainer     2:19 PM ET

我不得不使用空格来格式化这里的字符串，但实际的文本使用制表符来分隔各个部分：数字和名称、划痕以及原因和时间。

样本输出：

一行：#3 Hello Stormy Scratched - Reason Unavailable 美国东部时间上午 11:10

两行 #3 Hello Stormy Scratched - Reason Unavailable 更改为 Trainer 下午 2:19

注意：理想情况下，两行输出将包括第一行的数字和名称。

我能够构建一个匹配各个部分的表达式，但是选项卡、第二行以及在两行输出中包含数字和马名的要求给我带来了麻烦。

score 2 · Accepted Answer

你不需要花哨的正则表达式来做你想做的事，你只需要知道如何去做。

Ruby 的 Enumerable 有一个称为slice_before正则表达式的方法，用于确定数组中的哪些元素被组合在一起。Array 继承自 Enumerable。例如：

text = '#3  Hello Stormy    Scratched   -   Reason Unavailable          11:10 AM ET
#3  Hello Stormy    Scratched   -   Reason Unavailable          11:10 AM ET
                        Scratch Reason  -   Reason Unavailable changed to Trainer   2:19 PM ET
'

data = text.split("\n").slice_before(/\A\S/).to_a

require 'pp'
pp data

输出：

[["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET"],
["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET",
  "\t\t\tScratch\tReason\t-\tReason Unavailable changed to Trainer\t2:19 PM ET"]]

换句话说，通过拆分文本创建的数组"\n"按不以空格开头的行分组，这就是模式/\A\S/。所有单行都在单独的子阵列中。作为前一行的延续的行与该行分组。

如果您正在从磁盘读取文件，则可以使用IO.readlines将文件作为数组读取，从而避免拆分文件的需要。

如果需要，您可以进一步处理该数组以重建行和续行，使用类似：

data = text.split("\n").slice_before(/\A\S/).map{ |i| i.join("\n") }

变成data：

["#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET",
"#3\tHello Stormy\tScratched\t-\tReason Unavailable\t\t\t11:10 AM ET\n\t\t\tScratch\tReason\t-\tReason Unavailable changed to Trainer\t2:19 PM ET"]

如果您需要将每一行拆分为其组件字段，请使用split("\t"). 如何在子阵列中执行此操作留给您作为练习，但我会涉及map.

编辑：

...我喜欢你的解决方案，但我得到了 slice_before 的未定义方法。

尝试这个：

require 'pp'
require 'rubygems'

class Array

  unless Array.respond_to?(:slice_before)
    def slice_before(pat)
      result = []
      temp_result = []
      self.each do |i|

        if (temp_result.empty?)
          temp_result << i
          next
        end

        if i[pat]
          result << temp_result
          temp_result = []
        end

        temp_result << i
      end
      result << temp_result

    end
  end

end

这么称呼：

ary = [
  '#3  Hello Stormy    Scratched - Reason Unavailable                           11:10 AM ET',
  '#3  Hello Stormy    Scratched - Reason Unavailable                            11:10 AM ET',
  '                    Scratch Reason - Reason Unavailable changed to Trainer     2:19 PM ET',
]

pp ary.slice_before(/\A\S/)

好像：

[
  ["#3  Hello Stormy    Scratched - Reason Unavailable                           11:10 AM ET"],
  ["#3  Hello Stormy    Scratched - Reason Unavailable                            11:10 AM ET",
   "                    Scratch Reason - Reason Unavailable changed to Trainer     2:19 PM ET"]
]

score 1 · Accepted Answer

如果您可以假设“#”字符不出现在字符串中的任何其他位置，它会变得相当简单。然后像这样的事情应该这样做：

 /^#[^#]*/m

另一种更通用的方法是匹配以 # 开头的第一行，以及之后以空格或制表符开头的任何行：

 /^#.*?$(\n^[ \t].*?$)*/m

如果该行并不总是以 # 开头，您可以将其替换为[^ \t]（不是空格或制表符）。

score 1 · Accepted Answer

RE 的乐趣！这很老套，但里面有几种不同类型的匹配策略。

# Two-line example
s = <<-EOS
  #3\tHello Stormy\t\tScratched - Reason Unavailable\t\t\t11:10 AM ET\t
  \t\t\tScratch Reason - Reason Unavailable changed to Trainer\t2:19 PM ET
EOS
# allow leading/trailing whitespace, get the number, name, last reason and time
s =~ /\A\s*(#\d)\t+([^\t]+)(?:\t+.*)?(?:\t+(.*))\t+(\d+:\d+ (?:AM|PM) ET)\s*\Z/m
# ["#3", "Hello Stormy", "Scratch Reason - Reason Unavailable changed to Trainer", "2:19 PM ET"]
a = $1, $2, $3, $4

注意：这假设您匹配的字符串中只有一条消息
注意：未针对单行案例进行测试:)

ruby - 如何编写能够匹配一两行文本的正则表达式

3 回答 3

Related

Reference