因此,这是我要解析的电子邮件示例,仅提取其正文。
RECEIVED: 2012-11 20 09:59:24
SUBJECT: Get Boddy
--- Original Sender: Mark Twain. ---
----- Original Message -----
From: Boby Indo
To: Obum Hunter
At: 11/20 9:59:22
***NEW ISSUE SUPPORTED THROUGH UNIVERSALITY vs 104-13 on AY 3s JAN
10+BB {MYXV ABC 4116 SM MYXV YA 102-15 <DO>} | 2010/11 4.0s 4.0s
6+ BB {MYXV ABC 4132 NS MYXV YT 102-22 <DO>} | 2010 4.5s 4.5s
ABO 2006-OP1 M1 00442PAG5 19-24 p5
***SECOND SUPPORTED TRHOUGH INVERSALITY GEVINGS
10+BB {NXTW VXA 4061 SL MYXV YA 103-22 <DO>} | 11 wala 3.5s 3.5s
10+BB {NXTW VXA 12-47 SP MYXV YA 106-20 <DO>} | 22 wala 4.0s 4.0s
------------------------------------------------------------
© Copyright 2012 The Ridgly Group, Inc. All rights reserved. See
http://www.examply.html for important information disclosure.
这是我的期望:
***NEW ISSUE SUPPORTED THROUGH UNIVERSALITY vs 104-13 on AY 3s JAN
10+BB {MYXV ABC 4116 SM MYXV YA 102-15 <DO>} | 2010/11 4.0s 4.0s
6+ BB {MYXV ABC 4132 NS MYXV YT 102-22 <DO>} | 2010 4.5s 4.5s
ABO 2006-OP1 M1 00442PAG5 19-24 p5
***SECOND SUPPORTED TRHOUGH INVERSALITY GEVINGS
10+BB {NXTW VXA 4061 SL MYXV YA 103-22 <DO>} | 11 wala 3.5s 3.5s
10+BB {NXTW VXA 12-47 SP MYXV YA 106-20 <DO>} | 22 wala 4.0s 4.0s
如果***
线条也可以消除就好了。
这就是我到目前为止所得到的(?P<header>[\S+\s]+At:.*)\n+(?P<body>[\S+\s]([\d\.\d]+[a-z]?$))
。这似乎做得不好,因为它在最后 4.0 秒后抓住了虚线并卡在了非 ascii 字符©
上。谢谢!
PS:我认为最好的方法是用组切断电子邮件的标题和尾部。所以剩下的就是身体了。因为标题和尾部始终保持不变,但正文在不同的电子邮件中会发生变化。解决方案不需要特定于电子邮件。