ruby - 无法使用 Ruby Regex Rubular 正确拆分数据

Question

我正在尝试组织和分解通过 Net::POP3 提取的电子邮件中的内容。在代码中，当我使用

p mail.pop

我明白了

****************************\r\n>>=20\r\n>>11) <> Summary: Working with Vars on Social Influence =\r\nplatform=20\r\n>>=20\r\n>> Name: Megumi Lindon \r\n>>=20\r\n>> Category: Social Psychology=20\r\n>>=20\r\n>> Email: information@example.com =\r\n<mailto:information@example.com>=20\r\n>>=20\r\n>> Journal News: Saving Grace \r\n>>=20\r\n>> Deadline: 10:00 PM EST - 15 February=20\r\n>>=20\r\n>> Query:=20\r\n>>=20\r\n>> Lorem ipsum dolor sit amet \r\n>> consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\r\n>>=20\r\n>> Duis aute irure dolor in reprehenderit in voluptate \r\n>> velit esse cillum dolore eu fugiat nulla pariatur. =20\r\n>>=20\r\n>> Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.=20\r\n>> Requirements:=20\r\n>>=20\r\n>> Psychologists; anyone with good knowdledge\r\n>> with sociology and psychology.=20\r\n>>=20\r\n>> Please do send me your article and profile\r\n>> you want to be known as well. Thank you!=20\r\n>> Back to Top <x-msg://30/#top> Back to Category Index =\r\n<x-msg://30/#SocialPsychology>\r\n>>-----------------------------------\r\n>>=20\r\n>>

我正在尝试将其分解并组织到

11) Summary: Working with Vars on Social Influence 

Name: Megumi Lindon 

Category: Social Psychology 

Email: information@example.com 

Journal News: Saving Grace 

Deadline: 10:00 PM EST - 15 February

Questions:Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Requirements: Psychologists; anyone with good knowdledge with sociology and psychology.

到目前为止，我一直在使用 rubular，但结果各不相同，因为我仍在学习如何正确使用正则表达式、gsub 和拆分。到目前为止，我的代码如下。

  p mail.pop.scan(/Summary: (.+) Name:/)
  p mail.pop.scan(/Name: (.+) Category:/)
  p mail.pop.scan(/Category: (.+) Email:/) 
  p mail.pop.scan(/Email: (.+) Journal News:/)     
  p mail.pop.scan(/Journal News: (.+) Deadline:/)       
  p mail.pop.scan(/Deadline: (.+) Questions:/)    
  p mail.pop.scan(/Questions:(.+) Requirements:/) 
  p mail.pop.scan(/Requirements:(.+) Back to Top/)

但我一直在得到空数组。

[]
[]
[]
[]
[]
[]
[]
[]

想知道我怎样才能更好地做到这一点。提前致谢。

score 1 · Accepted Answer

天啊！真是一团糟！

当然，有很多方法可以解决这个问题，但我希望它们都涉及多个步骤和大量的反复试验。我只能说我是怎么做的。

有很多小步骤是一件好事，原因有几个。首先，它将问题分解为可管理的任务，其解决方案可以单独测试。其次，解析规则将来可能会发生变化。如果您有几个步骤，您可能只需要更改和/或添加一两个操作。如果您的步骤少且正则表达式复杂，您不妨重新开始，特别是如果代码是由其他人编写的。

假设text是一个包含您的字符串的变量。

首先，我不喜欢所有这些换行符，因为它们使正则表达式复杂化，所以我要做的第一件事就是摆脱它们：

s1 = text.gsub(/\n/, '')

接下来，有很多"20\r"'s 可能很麻烦，因为我们可能希望保留其他包含数字的文本，因此我们可以删除那些（以及"7941\r"）：

s2 = s1.gsub(/\d+\r/, '')

现在让我们看一下您想要的字段以及紧接在前和紧随其后的文本：

puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
  # <> Summary: Working with V
  #=>> Name: Megumi Lindon 
  #=>> Category: Social Psychol
  #=>> Email: information@ex
  #<mailto:information@exa
  #=>> Journal News: Saving Grace 
  #=>> Deadline: 10:00 PM EST -
  #=>> Query:=>>=>> Lorem ip
  #=>> Requirements:=>>=>> Psycholo
  # <x-msg://30/#top> Back
  #<x-msg://30/#SocialPsy

我们看到感兴趣的字段以 or 开头，"> "字段名称后跟": "or ":="。":="让我们通过更改为": "字段名称之后和字段名称之前"> "来进行简化" :"：

s3 = s2.gsub(/(?<=\w):=/, ": ")
s4 = s3.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")

在的正则表达式中s3，(?<=\w)是一个“正向向后看”：匹配必须紧跟在一个单词字符之前（不包括在匹配的一部分中）；在正则表达式中s4,(?=(?:\w+\s+)*\w+: )是一个“积极的前瞻”：匹配必须紧跟一个或多个单词，后跟一个冒号，然后是一个空格。注意s3和s4必须按照给定的顺序计算。

我们现在可以删除除标点符号和空格之外的所有非单词字符：

s5 = s4.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")

然后（最后）split在字段上：

a1 = s5.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
  # => ["11)  :", "Summary: ", "Working with Vars on Social Influence platform :",
  #     "Name: ", "Megumi Lindon  :",
  #     "Category: ", "Social Psychology :",
  #     "Email: ", "informationexample.com mailto:informationexample.com :",
  #     "Journal News: ", "Saving Grace  :",
  #     "Deadline: ", "10:00 PM EST  15 February :",
  #     "Query:  ", "Lorem ipsum ...laborum. :",
  #     "Requirements:  ", "Psychologists; anyone...psychology...Top xmsg:30#top...Psychology"]

请注意，我已包含(?<= :)(?:\w+\s+)*\w+:\s+在一个捕获组中，因此String#split将在结果数组中包含它拆分的位。

剩下的就是一些清理工作：

a2 = a1.map { |s| s.chomp(':') }
a2[0] = a2.shift + a2.first
  #=> "11)  Summary: "
a3 = a2.each_slice(2).to_a
  #=> [["11)  Summary: ", "Working with Vars on Social Influence platform "],
  #    ["Name: ", "Megumi Lindon  "],
  #    ["Category: ", "Social Psychology "],
  #    ["Email: ", "informationexample.com mailto:informationexample.com "],
  #    ["Journal News: ", "Saving Grace  "],
  #    ["Deadline: ", "10:00 PM EST  15 February "],
  #    ["Query:  ", "Lorem...est laborum. "],
  #    ["Requirements:  ", "Psychologists;...psychology. Please...xmsg:30#SocialPsychology"]] 

idx = a3.index { |n,_| n =~ /Email: / }
  #=> 3 
a3[idx][1] = a3[idx][1][/.*?\s/] if idx
  #=> "informationexample.com "

加入字符串并删除多余的空格：

a4 = a3.map { |b| b.join(' ').split.join(' ') }
  #=> ["11) Summary: Working with Vars on Social Influence platform",
  #    "Name: Megumi Lindon",
  #    "Category: Social Psychology",
  #    "Email: informationexample.com",
  #    "Journal News: Saving Grace",
  #    "Deadline: 10:00 PM EST 15 February",
  #    "Query: Lorem...laborum.",
  #    "Requirements: Psychologists...psychology. Please...well. Thank...Psychology"]

"Requirements"还是有问题的，但是没有额外的规则，就无能为力了。我们不能将所有类别值限制为一个句子，因为"Query"可以有多个。如果您想限制"Requirements"为一句话：

idx = a4.index { |n,_| n =~ /Requirements: / }
  #=> 7
a4[idx] = a4[idx][/.*?[.!?]/] if idx
  # => "Requirements: Psychologists; anyone with good knowsledge with sociology and psychology."

如果您想组合这些操作：

def parse_it(text)
  a1 = text.gsub(/\n/, '')
           .gsub(/\d+\r/, '') 
           .gsub(/(?<=\w):=/, ": ")
           .gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
           .gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
           .split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
           .map { |s| s.chomp(':') }

  a1[0] = a1.shift + a1.first

  a2 = a1.each_slice(2).to_a
  idx = a2.index { |n,_| n =~ /Email: / }
  a2[idx][1] = a2[idx][1][/.*?\s/] if idx

  a3 = a2.map { |b| b.join(' ').split.join(' ') }    
  idx = a3.index { |n,_| n =~ /Requirements: / }
  a3[idx] = a3[idx][/.*?[.!?]/] if idx

  a3
end

ruby - 无法使用 Ruby Regex Rubular 正确拆分数据

1 回答 1

Related

Reference