0

在正则表达式中,如何匹配任意数量的任意字符(例如,(.|\n)*)而不消耗可能遵循的其他匹配?如果这个问题不清楚,这是我的情况:

在一个文本文件中,我有一堆包含标题的电子邮件都粘贴在一起。

编辑:下面的清洁版本在换行符的开头有每个标题。我的实际数据可能会,也可能不会。每个标头组件(如'From: xxx')前面可能有任何东西或什么都没有。在某些情况下,许多电子邮件和标题可能都在一条线上,在一堆其他杂物之后。最重要的是,我需要识别其他包含“发件人:”的电子邮件标题。所以,我需要识别整个标题样式。

在我编辑之前给出的几个答案依赖于 ^ 或制表符分隔之类的东西,我不能指望它们。他们似乎可以稍微修改一下,但我(显然)对正则表达式不太好,而且我自己也无法调整它们。我很抱歉之前忽略了这一点,只是为了让几个回答者抓住它......我对正则表达式缺乏经验的另一个产品。

这是一个丑陋的版本- 这是我实际上试图匹配的字符串。它包含两个标题和要提取的消息。

emailsString = u"""From:\n     Lastname, Firstname\n     Sent:\n     Monday, June 24, 2013 1:48 PM\n     To:\n     Othername, Name\n     Subject:\n     RE: Center update\n    Message message message.\n    Such a lovely message\n    Take care,\n    Firstname Lastname, MS\n     Long signature\n     in this email\n   \n    E-mail:\n     email@email.com\n     Web\n     my blog\n     From:\n     Lastname, Firstname\n     Sent:\n     Monday, June 24, 2013 9:33 AM\n     To:\n     Othername, Name\n     Subject:\n     Center update\n     Importance:\n     High\n    Good Morning Name,\n    I hope this finds you doing well.\n    I wanted to inform you of some changes. The Center will be closing August 30\n     th\n     .  or September 1\n     st\n     .  I\u2019ve enjoyed my experience. """

这是一个更简洁的版本来显示标题的样子

From: Lastname, Firstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: Othername, Name
Subject: blah
Importance: High

Message message message
second line of message

second para of message

From: Lastname, Firstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: Othername, Name
Subject: blahblah

message

...

我正在尝试将标头中的信息与消息本身一起正则表达式。我有一个可以成功匹配所有标题的正则表达式,但我正在努力处理这个消息。问题是,一条消息可以包含任何东西(或什么都没有)。可能有多个换行符等。我想得到所有这些,但我仍然想拆分电子邮件。我的尝试(请注意,标题的“重要”部分是可选的):

for hit in re.finditer(r'[\s\n]*From:[\s\n]*(?P<from>.*)[\s\n]*Sent:[\s\n]*(?P<date>.*)[\s\n]*To:[\s\n]*(?P<to>.*)[\s\n]*Subject:[\s\n]*(?P<subject>.*)[\s\n]*(?:Importance:)?[\s\n]*.*[\s\n]*(?P<message>(.|\n)*)', allEmailsString):
    print "from: " + hit.group("from")
    print "to: " + hit.group("to")
    print "date: " + hit.group("date")
    print "subject: " + hit.group("subject")
    print "message: " + hit.group("message")

问题是,消息组正在抓住一切。因此,我正确获取了第一个电子邮件标头的 from/to/etc,然后看到一条包含该电子邮件消息的消息,以及所有后续电子邮件标头和消息。我需要抓住“直到下一个电子邮件标题/正则表达式匹配或直到字符串结尾的所有内容”。

我已经有一个解决方法 - 我可以摆脱消息捕获组并只获取标题。然后,遍历匹配对象并根据它们的开始/结束对字符串进行切片。例如,message1 是从 match1.end 到 match2.start。

所以,我问...

  • 有没有办法可以通过在我的正则表达式中捕获组来做到这一点?
  • 有更好的解决方法吗?
4

3 回答 3

1

仅当文本由可变部分和稳定部分(或至少具有稳定可变性的部分......)组成时,正则表达式才能用于提取文本块

在以下正则表达式模式中,我对“稳定”部分进行了一些假设以提高它们的数量,从而可以区分电子邮件并在似乎没有确定锚点的文本中提取所需的块:

  • 我想在“发送”部分中,总是有一个星期几的名称

  • 我想如果存在“重要性”这一行,那么只有一个词可以描述这种重要性,然后[^ \t\r\n]+

  • 我认为主题描述不能多行,然后[^\r\n]+

如果文本中稳定部分的数量太少,也就是说文本的结构太松散,使用正则表达式开始变得不可能。

该模式[ \t\r\n]*(?P<from>.*?[^ \t\r\n])[ \t\r\n]*'strip捕获的组有影响。
然后,如果几个空行构成消息,则匹配的结果说该消息是''

\Z如果在最后一条消息之后没有其他行,如我的文本示例中所示,则必须存在才能捕获桅杆电子邮件。

import re


emailsString = (u'     From:\n'
                '     Lastname, Firstname\n'
                '     Sent:\n'
                '     Monday, June 24, 2013 1:48 PM\n'
                '     To:\n'
                '     Othername, Name\n'
                '     Subject:\n'
                '     RE: Center update\n'
                '    Message message message.\n'
                '    Such a lovely message\n'
                '    Take care,\n'
                '    Firstname Lastname, MS\n'
                '     Long signature\n'
                '     in this email\n'
                '   \n'
                '    E-mail:\n'
                '     email@email.com\n'
                '     Web\n'
                '     my blog\n'
                '     From:\n'
                '     Lastname, Firstname\n'
                '     Sent:\n'
                '     Monday, June 24, 2013 9:33 AM\n'
                '     To:\n'
                '     Othername, Name\n'
                '     Subject:\n'
                '     Center update\n'
                '     Importance:\n'
                '     High\n'
                '    Good Morning Name,\n'
                '    I hope this finds you doing well.\n'
                '    I wanted to inform you of some changes. The Center will be closing August 30\n'
                '     th\n'
                '     .  or September 1\n'
                '     st\n'
                '     .  I\u2019ve enjoyed my experience. ')


allEmailsString = '''
From: FirstLastname, FirstFirstname
Sent: Monday, July 15th, 2011, 9:36 AM
To: TheOne
Subject: blah
Importance: High

Message message message
second line of message

second para of message

From: MidLastname, MidFirstname
Sent: Thursday, July 18th, 2011, 10:45 AM
To: TWOTWO
Subject: once upon



From: LastLastname, LastFirstname
Sent: Saturday, July 20th, 2011, 12:51 AM
To: Mr Three
Subject: blobloblo

Nothing to say. '''



dispat = ("*  from: {from}\n"
          "*  to: {to}\n"
          "*  date: {date}\n"
          "*  subject: {subject}\n"
          "** message (beginning on next line):\n{message}\n"
          "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-")



regx = re.compile('From:[ \t\r\n]*(?P<from>.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'Sent:[ \t\r\n]*'
                  '(?P<date>.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'To:[ \t\r\n]*(?P<to>.*?[^ \t\r\n])'
                  '[ \t\r\n]*'
                  'Subject:[ \t\r\n]*(?P<subject>[^\r\n]+)'
                  '[ \t\r\n]*'
                  '(?:Importance:[ \t\r\n]*(?P<importance>[^ \t\r\n]+))?'
                  '[ \t\r\n]*'
                  '(?P<message>.*?)'
                  '(?=[ \t\r\n]*From:.*?'
                  'Sent:.*?(?:Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day.*?'
                  'To.*?Subject:|\Z)',
                  re.DOTALL)


for s in (emailsString,allEmailsString):
    print ''.join(dispat.format(**d)
                  for d in (ma.groupdict('') for ma in regx.finditer(s)))
    print '\n#######################################\n'

结果

*  from: Lastname, Firstname
*  to: Othername, Name
*  date: Monday, June 24, 2013 1:48 PM
*  subject: RE: Center update
** message (beginning on next line):
Message message message.
    Such a lovely message
    Take care,
    Firstname Lastname, MS
     Long signature
     in this email

    E-mail:
     email@email.com
     Web
     my blog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: Lastname, Firstname
*  to: Othername, Name
*  date: Monday, June 24, 2013 9:33 AM
*  subject: Center update
** message (beginning on next line):
Good Morning Name,
    I hope this finds you doing well.
    I wanted to inform you of some changes. The Center will be closing August 30
     th
     .  or September 1
     st
     .  I\u2019ve enjoyed my experience. 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#######################################

*  from: FirstLastname, FirstFirstname
*  to: TheOne
*  date: Monday, July 15th, 2011, 9:36 AM
*  subject: blah
** message (beginning on next line):
Message message message
second line of message

second para of message
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: MidLastname, MidFirstname
*  to: TWOTWO
*  date: Thursday, July 18th, 2011, 10:45 AM
*  subject: once upon
** message (beginning on next line):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-*  from: LastLastname, LastFirstname
*  to: Mr Three
*  date: Saturday, July 20th, 2011, 12:51 AM
*  subject: blobloblo
** message (beginning on next line):
Nothing to say. 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

#######################################
于 2013-10-29T00:09:05.173 回答
0

我只是划分(split)和征服(re.match):

import re

# `data` is your text file
delimiter = r'(^|\n)From:'
capturer = re.compile(r'From:[\n\s]*(?P<from>.*)[\n\s]*'
                      r'Sent:[\n\s]*(?P<date>.*)[\n\s]*'
                      r'To:[\n\s]*(?P<to>.*)[\n\s]*'
                      r'Subject:[\n\s]*(?P<subject>.*)[\n\s]*'
                      r'(?:Importance:)?[\n\s]*.*[\n\s]*'
                      r'(?P<message>(\n|.)*)')

raw_emails = ['From:' + d for d in re.split(delimiter, data) if d.strip()]
emails = []
for raw_email in raw_emails:
    parts = capturer.match(raw_email)
    emails.append(parts.groupdict())

对于您的示例数据,此输出:

[{'date': 'Monday, July 15th, 2011, 9:36 AM',
  'from': 'Lastname, Firstname',
  'message': 'Message message message\nsecond line of message\n\nsecond para of message\n',
  'subject': 'blah',
  'to': 'Othername, Name'},
 {'date': 'Thursday, July 18th, 2011, 10:45 AM',
  'from': 'Lastname, Firstname',
  'message': '...\n',
  'subject': 'blahblah',
  'to': 'Othername, Name'}]
于 2013-10-29T00:39:55.043 回答
0

这可能看起来很痛苦。为了清楚起见,它已扩展。
使用多行模式和 No-DotAll。

@mobabo - 在您发表第一条评论后进行编辑。

必须清楚地描述您的关键字,并且确实存在。您的陈述
I can't count on things like '^From' to work表明您没有查看以前的
正则表达式,并且该部分与此部分相同。^[^\S\n]*From:不一样^From


此外,主题和消息或重要性和消息之间没有明确的划分。如果“重要性”是电子邮件的一部分,则主题有一个终点。

我制作了一个正则表达式来处理你的脏邮件和干净的电子邮件,底部是一个 Perl
程序来练习它。输出包括在内。看看这是否可以解决您的问题
(见下文)。

不幸的是,这是您所希望的最好的。

祝先生好运!
(注意——如果 Python 有递归,这个正则表达式将是这个大小的 1/4)

 # Compressed
 # -------------------
 #  ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?

 # Expanded
 # -------------------
 #

 ^ [^\S\n]* From: \s* 
 (?P<from>
      (?:
           (?!
                \s* ^ [^\S\n]* 
                (?: From | Sent | To | Subject | Importance )
                :
           )
           [\S\s] 
      )*
 )

 (?:
      \s* ^ [^\S\n]* Sent: \s* 
      (?P<sent>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?: From | Sent | To | Subject | Importance )
                     :
                )
                [\S\s] 
           )*
      )
 )?

 (?:
      \s* ^ [^\S\n]* To: \s* 
      (?P<to>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?: From | Sent | To | Subject | Importance )
                     :
                )
                [\S\s] 
           )*
      )
 )?

 (?:
      \s* ^ [^\S\n]* Subject: \s* 
      (?P<subject>
           (?:
                (?!
                     \s* ^ [^\S\n]* 
                     (?:
                          (?: From | Sent | To | Subject | Importance )
                     )
                     :
                )
                [\S\s] 
           )*
      )

      (?:
           \s* ^ [^\S\n]* Importance: \s* 
           (?P<importance>
                (?:
                     (?!
                          \s* ^ [^\S\n]* 
                          (?: From | Sent | To | Subject | Importance )
                          :
                     )
                     [\S\s] 
                )*
           )
      )?
 )?


 # // Output from Perl sample code (below)
 # //
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, July 15th, 2011, 9:36 AM
 # // To:
 # //         Othername, Name
 # // Subject:
 # //         blah
 # // Importance/Message:
 # //         High
 # // 
 # // Message message message
 # // second line of message
 # // 
 # // second para of message
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Thursday, July 18th, 2011, 10:45 AM
 # // To:
 # //         Othername, Name
 # // Subject/Message:
 # //         blahblah
 # // 
 # // message
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, June 24, 2013 1:48 PM
 # // To:
 # //         Othername, Name
 # // Subject/Message:
 # //         RE: Center update
 # //     Message message message.
 # //     Such a lovely message
 # //     Take care,
 # //     Firstname Lastname, MS
 # //      Long signature
 # //      in this email
 # // 
 # //     E-mail:
 # //      email@email.com
 # //      Web
 # //      my blog
 # // 
 # // 
 # // ======================
 # // From:
 # //         Lastname, Firstname
 # // Sent:
 # //         Monday, June 24, 2013 9:33 AM
 # // To:
 # //         Othername, Name
 # // Subject:
 # //         Center update
 # // Importance/Message:
 # //         High
 # //     Good Morning Name,
 # //     I hope this finds you doing well.
 # //     I wanted to inform you of some changes. The Center will be closing August 30
 # // 
 # //      th
 # //      .  or September 1
 # //      st
 # //      .  I've enjoyed my experience.
 # // 

 # ------------------------------------------------------------
 # # Perl sample code
 # use strict;
 # use warnings;
 # 
 # $/ = undef;
 # 
 # my $str = <DATA>;
 # 
 # 
 # 
 # while ( $str =~ /
 #     ^[^\S\n]*From:\s*(?P<from>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*)(?:\s*^[^\S\n]*Sent:\s*(?P<sent>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*To:\s*(?P<to>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?(?:\s*^[^\S\n]*Subject:\s*(?P<subject>(?:(?!\s*^[^\S\n]*(?:(?:From|Sent|To|Subject|Importance)):)[\S\s])*)(?:\s*^[^\S\n]*Importance:\s*(?P<importance>(?:(?!\s*^[^\S\n]*(?:From|Sent|To|Subject|Importance):)[\S\s])*))?)?
 # /xmg)
 # 
 # {
 #  print "\n\n======================\n";
 #  print "From: \n\t$+{from}\n";
 #  if (defined $+{sent})
 #  {
 #      print "Sent: \n\t$+{sent}\n";
 #  }
 #  if (defined $+{to})
 #  {
 #      print "To: \n\t$+{to}\n";
 #  }
 #  if (defined $+{importance})
 #  {
 #      print "Subject: \n\t$+{subject}\n";
 #      print "Importance/Message: \n\t$+{importance}\n";
 #  }
 #  elsif (defined $+{subject})
 #  {
 #      print "Subject/Message: \n\t$+{subject}\n";
 #  }
 # }
 # 
 # 
 # __DATA__
 # 
 # From: Lastname, Firstname
 # Sent: Monday, July 15th, 2011, 9:36 AM
 # To: Othername, Name
 # Subject: blah
 # Importance: High
 # 
 # Message message message
 # second line of message
 # 
 # second para of message
 # 
 # From: Lastname, Firstname
 # Sent: Thursday, July 18th, 2011, 10:45 AM
 # To: Othername, Name
 # Subject: blahblah
 # 
 # message
 # 
 # 
 # 
 # 
 # 
 # From:
 #      Lastname, Firstname
 #      Sent:
 #      Monday, June 24, 2013 1:48 PM
 #      To:
 #      Othername, Name
 #      Subject:
 #      RE: Center update
 #     Message message message.
 #     Such a lovely message
 #     Take care,
 #     Firstname Lastname, MS
 #      Long signature
 #      in this email
 #    
 #     E-mail:
 #      email@email.com
 #      Web
 #      my blog
 #      From:
 #      Lastname, Firstname
 #      Sent:
 #      Monday, June 24, 2013 9:33 AM
 #      To:
 #      Othername, Name
 #      Subject:
 #      Center update
 #      Importance:
 #      High
 #     Good Morning Name,
 #     I hope this finds you doing well.
 #     I wanted to inform you of some changes. The Center will be closing August 30
 #      th
 #      .  or September 1
 #      st
 #      .  I've enjoyed my experience.
 # 
 # 
于 2013-10-29T01:02:34.773 回答