我正在为我们的内部客户支持系统创建脚本。我想从我们的 IMAP 收件箱(托管在 Gmail 上)收集电子邮件并将电子邮件解析到数据库中。
清理框架、编码错误的标签和混乱的格式的最佳方法是什么,从而得到具有最少格式的干净文本?
我知道正则表达式很可能会发挥很大作用,但我想知道这个功能是否存在于我缺少的另一个库中。
编辑:更具体地说,需要删除的内容:
所有内联 CSS/样式,所有 HTML,除了粗体、下划线和斜体等简单格式。
这是我用作测试用例的一封电子邮件,这是我从 ZoneAlarm 收到的一封相当强大的垃圾邮件,它包含了所有内容。
<td>
<br>
<br>
<table align="center" bgcolor="#749FD0" border="0" cellpadding="0" cellspacing="0" style="font-family:Arial,Helvetica,sans-serif;font-size:12px;line-height:16px;color:#555555" valign="top" width="700">
<tbody>
<tr>
<td>
<table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
<tbody>
<tr>
<td height="10">
<img border="0" height="1" src="http://download.zonealarm.com/bin/images/email/socialguard/spacer.gif" style="display: block; max-width: 2880px;" width="1"></td>
</tr>
</tbody>
</table>
<table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
<tbody>
<tr>
<td height="10" width="10">
<img border="0" height="10" src="http://www.zonealarm.com/email/campaigns/2013/2013_06_SummerSale/nw.png" style="display: block; max-width: 2880px;" width="10"></td>
<td bgcolor="#E3ECEC" height="10" width="660">
<a href="http://track.zonealarm.com:80/track?type=click&enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&&&2000&&&http://www.zonealarm.com?cid=E200246" target="_blank"><img alt="ZoneAlarm by Check Point Software Technologies LTD." border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/za_transparent.png" width="120" style="display: block; max-width: 2880px;" title="ZoneAlarm by Check Point Software Technologies LTD."></a></td>
<td align="right" style="font-family:Arial,Helvetica,sans-serif" width="150">
<span style="color:#999999;font-size:12px">Connect with ZoneAlarm</span></td>
<td align="right" valign="middle" width="125">
<a href="http://track.zonealarm.com:80/track?type=click&enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&&&2001&&&http://www.facebook.com/ZoneAlarmFirewall" target="_blank"><img alt="ZoneAlarm Facebook" border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/facebook.png" width="22" title="ZoneAlarm Facebook" style="max-width: 2880px;"></a> <a href="http://track.zonealarm.com:80/track?type=click&enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&&&2002&&&http://twitter.com/zonealarm" target="_blank"><img alt="ZoneAlarm Twitter" border="0" width="22" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/twitter.png" title="ZoneAlarm Twitter" style="max-width: 2880px;"></a> <a href="http://track.zonealarm.com:80/track?type=click&enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&&&2003&&&http://www.youtube.com/zonealarmsecurity" target="_blank"><img alt="ZoneAlarm YouTube" border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/youtube.png" title="ZoneAlarm YouTube" height="22" style="max-width: 2880px;"></a><img border="0" height="15" src="http://download.zonealarm.com/bin/images/email/socialguard/spacer.gif" width="10" style="max-width: 2880px;"></td>
<td bgcolor="#E3ECEC" rowspan="6" align="center" valign="top" width="1">
<img align="right" height="32" src="http://download.zonealarm.com/bin/images/emails/welcome/borderx1.png" width="1" style="max-width: 2880px;">
</td>
</tr>
</tbody>
</table>
<table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
<tbody>
<tr>
<td height="10" width="10">
<img border="0" height="10" src="http://www.zonealarm.com/email/campaigns/2013/2013_06_SummerSale/sw.png" style="display: block; max-width: 2880px;" width="10"></td>
<td bgcolor="#E3ECEC" height="10" width="660">