ruby - 为什么这个名为组的正则表达式会捕获错误的文本？

Question

有人知道为什么命名组包含ref_id在下面的捕获中吗？regex1Some address: loststreet 4

我希望它是公正loststreet 4的，我不明白为什么它不是。下面的代码来自 IRB 会话。

我考虑过字符串的编码：

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos
# => "Burp\nFirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4\nZip code:\n" 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi
# => /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 

str1.match(regex1)
# => #<MatchData "FirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4" name:"Al Bundy" ref_id:"Some address: loststreet 4" other:"loststreet 4"> 

str1.encoding
# => #<Encoding:UTF-8> 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/miu
# => /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 

str1.match(regex1)
# => #<MatchData "FirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4" name:"Al Bundy" ref_id:"Some address: loststreet 4" other:"loststreet 4">

score 0 · Accepted Answer

使用MatchData#[]获取特定的组字符串：

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi
matched = str1.match(regex1)

matched['name'] # => "Al Bundy"
matched['other'] # => "loststreet 4"

score 0 · Accepted Answer

编写代码的目标之一是使其可维护。使其可维护涉及使其在处理该代码时易于被跟随的人阅读和理解。

正则表达式通常是维护的噩梦，根据我的经验，它们通常可以降低复杂性，或者完全替换，以提供同样有用的代码。解析这种文本是何时不使用复杂模式的一个很好的例子。

我会这样做：

str1 = <<eos
Burp
FirstName: Al Bundy
Ref person:
Some address: loststreet 4
Some other address: loststreet 4
Zip code:
eos

def get_value(s)
  _, value = s.split(':')
  value.strip if value
end

rows = str1.split("\n")
firstname          = get_value(rows[1]) # => "Al Bundy"
ref_person         = get_value(rows[2]) # => nil
some_address       = get_value(rows[3]) # => "loststreet 4"
some_other_address = get_value(rows[4]) # => "loststreet 4"
zip_code           = get_value(rows[5]) # => nil

将文本分成几行，然后挑选出需要的数据。

可以将其简化map为更简洁的内容：

firstname, ref_person, some_address, some_other_address, zip_code = rows[1..-1].map{ |s| get_value(s) }
firstname          # => "Al Bundy"
ref_person         # => nil
some_address       # => "loststreet 4"
some_other_address # => "loststreet 4"
zip_code           # => nil

如果你绝对必须有一个正则表达式，只是为了有一个正则表达式，然后简化它并隔离它的任务。虽然可以编写一个可以跨越多行的正则表达式，同时跳过和捕获文本，但到达那里是很痛苦的，随着它的增长它会变得越来越脆弱，并且如果传入的文本发生变化，它可能会中断。通过降低其复杂性，您更有可能避免脆弱性并使您的代码更加健壮：

def get_value(s)
  s[/^([^:]+):(.*)/]
  name, value = $1, $2
  value.strip! if value

  [name.downcase.tr(' ', '_'), value]
end

data_hash = Hash[
  str1.split("\n").select{ |s| s[':'] }.map{ |s| get_value(s) }
]
data_hash # => {"firstname"=>"Al Bundy", "ref_person"=>"", "some_address"=>"loststreet 4", "some_other_address"=>"loststreet 4", "zip_code"=>""}

score 0 · Accepted Answer

因为你在你的正则表达式中写了一个可选\s?的（在“Ref person:”之后），它可以匹配一个换行符\n（当一个参数为空时）。将其替换为[^\S\n]?（您必须对所有\s?不能为换行符的内容执行相同操作。）

（注意每一个参数后面你.*用去下一个，替换成.*?惰性的，避免回溯太多）

score 0 · Accepted Answer

看起来您的正则表达式缺少某些部分。请试试：

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some address:\s?(?<address>[^\n]*).*Some other address:\s?(?<other>[^\n]*)/mi

使用扩展模式会更容易：

regex1 = %r{
  FirstName:\s?(?<name>[^\n]*).*
  Ref\ person:\s?(?<ref_id>[^\n]*).*
  Some\ address:\s?(?<address>[^\n]*).*
  Some\ other\ address:\s?(?<other>[^\n]*)
}xmi

只要确保逃避常规空间。

ruby - 为什么这个名为组的正则表达式会捕获错误的文本？

4 回答 4

Related

Reference