0

我正在为我们的内部客户支持系统创建脚本。我想从我们的 IMAP 收件箱(托管在 Gmail 上)收集电子邮件并将电子邮件解析到数据库中。

清理框架、编码错误的标签和混乱的格式的最佳方法是什么,从而得到具有最少格式的干净文本?

我知道正则表达式很可能会发挥很大作用,但我想知道这个功能是否存在于我缺少的另一个库中。

编辑:更具体地说,需要删除的内容:

所有内联 CSS/样式,所有 HTML,除了粗体、下划线和斜体等简单格式。

这是我用作测试用例的一封电子邮件,这是我从 ZoneAlarm 收到的一封相当强大的垃圾邮件,它包含了所有内容。

<td>
                    <br>
                    <br>


                    <table align="center" bgcolor="#749FD0" border="0" cellpadding="0" cellspacing="0" style="font-family:Arial,Helvetica,sans-serif;font-size:12px;line-height:16px;color:#555555" valign="top" width="700">
                        <tbody>
                            <tr>
                                <td>

                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10">
                                                    <img border="0" height="1" src="http://download.zonealarm.com/bin/images/email/socialguard/spacer.gif" style="display: block; max-width: 2880px;" width="1"></td>
                                            </tr>
                                        </tbody>
                                    </table>
                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10" width="10">
                                                    <img border="0" height="10" src="http://www.zonealarm.com/email/campaigns/2013/2013_06_SummerSale/nw.png" style="display: block; max-width: 2880px;" width="10"></td>
                                                <td bgcolor="#E3ECEC" height="10" width="660">

                                                    <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2000&amp;&amp;&amp;http://www.zonealarm.com?cid=E200246" target="_blank"><img alt="ZoneAlarm by Check Point Software Technologies LTD." border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/za_transparent.png" width="120" style="display: block; max-width: 2880px;" title="ZoneAlarm by Check Point Software Technologies LTD."></a></td>
                                                <td align="right" style="font-family:Arial,Helvetica,sans-serif" width="150">
                                                    <span style="color:#999999;font-size:12px">Connect with ZoneAlarm</span></td>
                                                <td align="right" valign="middle" width="125">
                                                    <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2001&amp;&amp;&amp;http://www.facebook.com/ZoneAlarmFirewall" target="_blank"><img alt="ZoneAlarm Facebook" border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/facebook.png" width="22" title="ZoneAlarm Facebook" style="max-width: 2880px;"></a> <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2002&amp;&amp;&amp;http://twitter.com/zonealarm" target="_blank"><img alt="ZoneAlarm Twitter" border="0" width="22" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/twitter.png" title="ZoneAlarm Twitter" style="max-width: 2880px;"></a> <a href="http://track.zonealarm.com:80/track?type=click&amp;enid=ZWFzPTEmbXNpZD0xJmF1aWQ9ODY4NjI4Jm1haWxpbmdpZD01NTE0MCZtZXNzYWdlaWQ9MzAwMDAmZGF0YWJhc2VpZD0xODQwMiZzZXJpYWw9MTY3OTIwMzgmZW1haWxpZD1nZWVrc2l4QGdtYWlsLmNvbSZ1c2VyaWQ9MV82MTE3JnRhcmdldGlkPSZmbD0mZXh0cmE9TXVsdGl2YXJpYXRlSWQ9JiYm&amp;&amp;&amp;2003&amp;&amp;&amp;http://www.youtube.com/zonealarmsecurity" target="_blank"><img alt="ZoneAlarm YouTube" border="0" src="http://www.zonealarm.com/email/campaigns/2013/2013_05_MemorialDay/youtube.png" title="ZoneAlarm YouTube" height="22" style="max-width: 2880px;"></a><img border="0" height="15" src="http://download.zonealarm.com/bin/images/email/socialguard/spacer.gif" width="10" style="max-width: 2880px;"></td>
                                                    <td bgcolor="#E3ECEC" rowspan="6" align="center" valign="top" width="1">
                                                <img align="right" height="32" src="http://download.zonealarm.com/bin/images/emails/welcome/borderx1.png" width="1" style="max-width: 2880px;">
                                                    </td>
                                            </tr>
                                        </tbody>
                                    </table>
                                    <table align="center" border="0" cellpadding="0" cellspacing="0" valign="top" width="680">
                                        <tbody>
                                            <tr>
                                                <td height="10" width="10">
                                                    <img border="0" height="10" src="http://www.zonealarm.com/email/campaigns/2013/2013_06_SummerSale/sw.png" style="display: block; max-width: 2880px;" width="10"></td>
                                                <td bgcolor="#E3ECEC" height="10" width="660">

4

1 回答 1

1

You can use HTML Purifier for this, see: http://htmlpurifier.org/

于 2013-06-27T09:11:55.523 回答