我正在尝试组织和分解通过 Net::POP3 提取的电子邮件中的内容。在代码中,当我使用

p mail.pop


11) Summary: Working with Vars on Social Influence 

Name: Megumi Lindon 

Category: Social Psychology 

Email: information@example.com 

Journal News: Saving Grace 

Deadline: 10:00 PM EST - 15 February

Questions:Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Requirements: Psychologists; anyone with good knowdledge with sociology and psychology. 


11) Summary: Working with Vars on Social Influence 

Name: Megumi Lindon 

Category: Social Psychology 

Email: information@example.com 

Journal News: Saving Grace 

Deadline: 10:00 PM EST - 15 February

Questions:Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Requirements: Psychologists; anyone with good knowdledge with sociology and psychology.

到目前为止,我一直在使用 rubular,但结果各不相同,因为我仍在学习如何正确使用正则表达式、gsub 和拆分。到目前为止,我的代码如下。

  p mail.pop.scan(/Summary: (.+) Name:/)
  p mail.pop.scan(/Name: (.+) Category:/)
  p mail.pop.scan(/Category: (.+) Email:/) 
  p mail.pop.scan(/Email: (.+) Journal News:/)     
  p mail.pop.scan(/Journal News: (.+) Deadline:/)       
  p mail.pop.scan(/Deadline: (.+) Questions:/)    
  p mail.pop.scan(/Questions:(.+) Requirements:/) 
  p mail.pop.scan(/Requirements:(.+) Back to Top/)  





s1 = text.gsub(/\n/, '')

接下来,有很多"20\r"'s 可能很麻烦,因为我们可能希望保留其他包含数字的文本,因此我们可以删除那些(以及"7941\r"):

s2 = s1.gsub(/\d+\r/, '') 


puts s2.scan(/.{4}(?:\w+\s+)*\w+:.{15}/)
  # <> Summary: Working with V
  #=>> Name: Megumi Lindon 
  #=>> Category: Social Psychol
  #=>> Email: information@ex
  #=>> Journal News: Saving Grace 
  #=>> Deadline: 10:00 PM EST -
  #=>> Query:=>>=>> Lorem ip
  #=>> Requirements:=>>=>> Psycholo
  # <x-msg://30/#top> Back

我们看到感兴趣的字段以 or 开头,"> "字段名称后跟": "or ":="":="让我们通过更改为": "字段名称之后和字段名称之前"> "来进行简化" :"

s3 = s2.gsub(/(?<=\w):=/, ": ")
s4 = s3.gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")

在 的正则表达式中s3(?<=\w)是一个“正向向后看”:匹配必须紧跟在一个单词字符之前(不包括在匹配的一部分中);在正则表达式中s4,(?=(?:\w+\s+)*\w+: )是一个“积极的前瞻”:匹配必须紧跟一个或多个单词,后跟一个冒号,然后是一个空格。注意s3s4必须按照给定的顺序计算。


s5 = s4.gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")


a1 = s5.split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
  # => ["11)  :", "Summary: ", "Working with Vars on Social Influence platform :",
  #     "Name: ", "Megumi Lindon  :",
  #     "Category: ", "Social Psychology :",
  #     "Email: ", "informationexample.com mailto:informationexample.com :",
  #     "Journal News: ", "Saving Grace  :",
  #     "Deadline: ", "10:00 PM EST  15 February :",
  #     "Query:  ", "Lorem ipsum ...laborum. :",
  #     "Requirements:  ", "Psychologists; anyone...psychology...Top xmsg:30#top...Psychology"] 

请注意,我已包含(?<= :)(?:\w+\s+)*\w+:\s+在一个捕获组中,因此String#split将在结果数组中包含它拆分的位。


a2 = a1.map { |s| s.chomp(':') }
a2[0] = a2.shift + a2.first
  #=> "11)  Summary: "
a3 = a2.each_slice(2).to_a
  #=> [["11)  Summary: ", "Working with Vars on Social Influence platform "],
  #    ["Name: ", "Megumi Lindon  "],
  #    ["Category: ", "Social Psychology "],
  #    ["Email: ", "informationexample.com mailto:informationexample.com "],
  #    ["Journal News: ", "Saving Grace  "],
  #    ["Deadline: ", "10:00 PM EST  15 February "],
  #    ["Query:  ", "Lorem...est laborum. "],
  #    ["Requirements:  ", "Psychologists;...psychology. Please...xmsg:30#SocialPsychology"]] 

idx = a3.index { |n,_| n =~ /Email: / }
  #=> 3 
a3[idx][1] = a3[idx][1][/.*?\s/] if idx
  #=> "informationexample.com " 


a4 = a3.map { |b| b.join(' ').split.join(' ') }
  #=> ["11) Summary: Working with Vars on Social Influence platform",
  #    "Name: Megumi Lindon",
  #    "Category: Social Psychology",
  #    "Email: informationexample.com",
  #    "Journal News: Saving Grace",
  #    "Deadline: 10:00 PM EST 15 February",
  #    "Query: Lorem...laborum.",
  #    "Requirements: Psychologists...psychology. Please...well. Thank...Psychology"] 


idx = a4.index { |n,_| n =~ /Requirements: / }
  #=> 7
a4[idx] = a4[idx][/.*?[.!?]/] if idx
  # => "Requirements: Psychologists; anyone with good knowsledge with sociology and psychology."


def parse_it(text)
  a1 = text.gsub(/\n/, '')
           .gsub(/\d+\r/, '') 
           .gsub(/(?<=\w):=/, ": ")
           .gsub(/>\s+(?=(?:\w+\s+)*\w+: )/, " :")
           .gsub(/[^a-zA-Z0-9 :;.?!-()\[\]{}]/, "")
           .split(/((?<= :)(?:\w+\s+)*\w+:\s+)/)
           .map { |s| s.chomp(':') }

  a1[0] = a1.shift + a1.first

  a2 = a1.each_slice(2).to_a
  idx = a2.index { |n,_| n =~ /Email: / }
  a2[idx][1] = a2[idx][1][/.*?\s/] if idx

  a3 = a2.map { |b| b.join(' ').split.join(' ') }    
  idx = a3.index { |n,_| n =~ /Requirements: / }
  a3[idx] = a3[idx][/.*?[.!?]/] if idx

