ruby - Ruby 扫描正则表达式

Question

我正在尝试拆分字符串：

"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"

进入以下数组：

[
  ["test","blah"]
  ["foo","bar bar bar"]
  ["test","abc","123","456 789"]
]

我尝试了以下方法，但不太正确：

"[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
.scan(/\[(.*?)\s*\|\s*(.*?)\]/)
# =>
# [
#   ["test", "blah"]
#   ["foo", "bar bar bar"]
#   ["test", "abc |123 | 456 789"]
# ]

我需要在每个管道而不是第一个管道上拆分。实现这一目标的正确正则表达式是什么？

score 7 · Accepted Answer

 s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
 arr = s.scan(/\[(.*?)\]/).map {|m| m[0].split(/ *\| */)}

score 6 · Accepted Answer

两种选择：

s = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"

s.split(/\s*\n\s*/).map{ |p| p.scan(/[^|\[\]]+/).map(&:strip) }
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

irb> s.split(/\s*\n\s*/).map do |line|
  line.sub(/^\s*\[\s*/,'').sub(/\s*\]\s*$/,'').split(/\s*\|\s*/)
end
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

它们都从换行符开始（丢弃周围的空白）。

然后第一个通过查找不是 , 或的任何内容来拆分每个块，[然后|丢弃]额外的空格（调用strip每个）。

然后第二个丢弃前导[和尾随]（带有空格），然后拆分|（带有空格）。

您无法使用单个scan. 关于你能得到的最接近的是：

s.scan /\[(?:([^|\]]+)\|)*([^|\]]+)\]/
#=> [["test", " blah"], ["foo ", "bar bar bar"], ["123 ", " 456 789"]]

…丢弃信息，或者这个：

s.scan /\[((?:[^|\]]+\|)*[^|\]]+)\]/
#=> [["test| blah"], ["foo |bar bar bar"], ["test| abc |123 | 456 789"]]

…将每个“数组”的内容捕获为单个捕获，或者这样：

s.scan /\[(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?(?:([^|\]]+)\|)?([^|\]]+)\]/
#=> [["test", nil, nil, " blah"], ["foo ", nil, nil, "bar bar bar"], ["test", " abc ", "123 ", " 456 789"]]

…硬编码为最多四个项目，并插入nil您需要删除的条目.compact。

没有办法使用 Rubyscan来获取类似的正则表达式/(?:(aaa)b)+/并在每次匹配重复时获取多个捕获。

score 2 · Accepted Answer

为什么是硬路径（单个正则表达式）？为什么不是简单的拆分组合？以下是可视化过程的步骤。

str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"

arr = str.split("\n").map(&:strip) # => ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"]
arr = arr.map{|s| s[1..-2] } # => ["test| blah", "foo |bar bar bar", "test| abc |123 | 456 789"]
arr = arr.map{|s| s.split('|').map(&:strip)} # => [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

这可能比效率低得多scan，但至少它很简单:)

score 2 · Accepted Answer

“扫描、拆分、剥离和删除”火车残骸

The whole premise seems flawed, since it assumes that you will always find alternation in your sub-arrays and that expressions won't contain character classes. Still, if that's the problem you really want to solve for, then this should do it.

First, str.scan( /\[.*?\]/ ) will net you three array elements, each containing pseudo-arrays. Then you map the sub-arrays, splitting on the alternation character. Each element of the sub-array is then stripped of whitespace, and the square brackets deleted. For example:

str = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
str.scan( /\[.*?\]/ ).map { |arr| arr.split('|').map { |m| m.strip.delete '[]' }}

#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

Verbosely, Step-by-Step

Mapping nested arrays is not always intuitive, so I've unwound the train-wreck above into more procedural code for comparison. The results are identical, but the following may be easier to reason about.

string = "[test| blah] \n [foo |bar bar bar]\n[test| abc |123 | 456 789]"
array_of_strings = string.scan( /\[.*?\]/ )
#=> ["[test| blah]", "[foo |bar bar bar]", "[test| abc |123 | 456 789]"]

sub_arrays = array_of_strings.map { |sub_array| sub_array.split('|') }
#=> [["[test", " blah]"],
#    ["[foo ", "bar bar bar]"],
#    ["[test", " abc ", "123 ", " 456 789]"]]

stripped_sub_arrays = sub_arrays.map { |sub_array| sub_array.map(&:strip) }
#=> [["[test", "blah]"],
#    ["[foo", "bar bar bar]"],
#    ["[test", "abc", "123", "456 789]"]]

sub_arrays_without_brackets =
  stripped_sub_arrays.map { |sub_array| sub_array.map {|elem| elem.delete '[]'} }
#=> [["test", "blah"], ["foo", "bar bar bar"], ["test", "abc", "123", "456 789"]]

ruby - Ruby 扫描正则表达式

4 回答 4

“扫描、拆分、剥离和删除”火车残骸

Verbosely, Step-by-Step

Related

Reference