3

我正在尝试从 html 电子邮件的正文中提取 6 个字段SenderCustomer ID等:

$string = '... some other html text ... <p>
   <strong>Sender:</strong>&nbsp;Holly Schöne<br>
   <strong>Customer ID:</strong>&nbsp;3853XXXX<br>
   <strong>Email:</strong>&nbsp;email@test.net<br>
   <strong>Transaction ID:</strong>&nbsp;836248467<br>
   <strong>Reference:</strong>&nbsp;product<br>
   <strong>Explanation:</strong>&nbsp;Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
</p>... some more html text ...';

...我这样提取:

$message = imap_fetchbody($inbox, $email_number, $section);
// determine $encoding and $charset
$decodedMessage = decodeMessage($message, $encoding, $charset);

使用这个函数:(其他编码的情况被忽略了,因为那里什么都没做)

function decodeMessage($message, $encoding, $charset) {
    switch ($encoding) {
        case 3: // BASE64
            $message = base64_decode($message);
            break;
        case 4: // QUOTED-PRINTABLE
            $message = quoted_printable_decode($message);
            break;
        default:
            break;
    }
    if ($charset != NULL) {
        $message = mb_convert_encoding($message , 'utf-8' , $charset);
        //$message = mb_convert_encoding($message , 'iso-8859-1' , $charset);
    }
    return $message;
}

这一切都像一个魅力。问题从这里开始:

$regex = '/\<p\>[\w\W. ]*?\<strong\>Sender\:\<\/strong\>&nbsp;(?<sender>[\w\W ]+?)\<br\>.*?\<strong\>Customer ID\:\<\/strong\>&nbsp;(?<customerId>[\w\W ]+?)\<br\>.*?\<strong\>Email\:\<\/strong\>&nbsp;(?<email>[\w\W ]+?)\<br\>.*?\<strong\>Transaction ID\:\<\/strong\>&nbsp;(?<transactionId>[\w\W ]+?)\<br\>.*?\<strong\>Reference\:\<\/strong\>&nbsp;(?<reference>[\w\W ]+?)\<br\>.*?\<strong\>Explanation\:\<\/strong\>&nbsp;(?<explanation>[\w\W ]+?)\<\/p\>/is';
$result = preg_match($regex, $decodedMessage, $matches);

如果我将该正则表达式应用于上面的字符串,我会得到我想要的 - 一个像这样的数组:

print_r($matches) = Array (
    [0] => <p>
       <strong>Sender:</strong>&nbsp;Holly Schöne<br>
       <strong>Customer ID:</strong>&nbsp;3853XXXX<br>
       <strong>Email:</strong>&nbsp;email@test.net<br>
       <strong>Transaction ID:</strong>&nbsp;836248467<br>
       <strong>Reference:</strong>&nbsp;product<br>
       <strong>Explanation:</strong>&nbsp;Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
    </p>
    [sender] => Holly Schöne
    [1] => Holly Schöne
    [customerId] => 3853XXXX
    [2] => 3853XXXX
    [email] => email@test.net
    [3] => email@test.net
    [transactionId] => 836248467
    [4] => 836248467
    [reference] => product
    [5] => product
    [explanation] => Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
    [6] => Holly Schöne #gfh65f4h65sg1h65sd1hf61ht3d51ht41g#
)

...但是,如果我对 $decodedMessage 做同样的事情,我会得到:

preg_last_error() -> PREG_NO_ERROR
$result -> [empty string]
$matches -> array()

我尝试了一切并环顾四周,但我无法找出问题所在。我的猜测是它与电子邮件正文的编码或字符集有关。任何帮助将不胜感激。


好吧,您问了-我认为这个问题已经很长了...这是vardump-我只更改了一些个人详细信息

...啊该死的...还有我的问题

我让自己被 Waterfox 的源代码查看器愚弄了

它显示<br /><br><tbody>在每个表中添加了一个,因此我基于正则表达式的源代码不是电子邮件实际拥有的源代码 - 我现在觉得很愚蠢 - 下面是实际的 HTML 源代码

<html>
<table width="750" cellpadding="0" cellspacing="0">
    <tr>
        <td style="background-repeat:no-repeat;" background="http://i1.mbsvr.net/images/bg_mailframe.gif" width="100%" align="center">
            <table width="95%" align="center">
                <tr>
                    <td align="left" style="padding:10px 0 0 10px;">
                        <a href="http://www.moneybookers.com/app/?l=EN" target="_blank" style="color:FD932C;font-weight:normal;" onfocus="this.blur()">
                            <img src="http://i1.mbsvr.net/images/skrill/mb-logo-the-future.png" border="0" />
                        </a>
                    </td>
                </tr>
            </table>
            <table width="740">
                <tr><td style="padding:0px 40px 0px 0px" align="center">
<table width="100%" border="0" cellpadding="0" cellspacing="0">
    <tr>
        <td valign="top" align="middle">
            <table cellspacing="0" cellpadding="0" width="100%" border="0">
                <tr>
                    <td>
                        <hr style="!important; font-family: verdana, arial, sans-serif; border: 0; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;" />
                    </td>
                </tr>
                <tr> 
                    <td style="!important; font-family: verdana, arial, sans-serif; margin: 0; padding: 0px 0px 10px 0px; color: #EF8116; font-weight: bold; font-size: 18px;" nowrap width="50%">
                        You have received EUR 0.05
                    </td>
                </tr>
                <tr>
                    <td style="!important; font-family: verdana, arial, sans-serif;  font-size: 11px;   color: #656565;">
                        <br/> 
                        Dear Mmmmmmm Bbbbbbb,<br />
                        <br/>
                        Holly Schöne has sent you EUR 0.05 via Skrill (Moneybookers). The full details of the transaction are:<br />
                        <p>
                            <strong>Sender:</strong> Holly Schöne<br />
                            <strong>Customer ID:</strong> 3853XXXX<br />
                            <strong>Email:</strong> email@test.net<br />
                            <strong>Transaction ID:</strong> 836151721<br />
                            <strong>Reference:</strong> TPBwishes<br />
                            <strong>Explanation:</strong> Holly Schoene
#gsg4sda65g4r65e4g8s4g56asd54e#
                        </p>
                        Your money is waiting for you in your Skrill (Moneybookers) account - <a href="https://www.moneybookers.com">https://www.moneybookers.com</a>.<br />
                        <br />
                        <b>IMPORTANT:</b> If you are using Skrill (Moneybookers) commercially, we <b>STRONGLY</b> advise that you check in your Skrill (Moneybookers) account history that the money is there.<br />
                        <br />
                        Have you increased your withdrawal and receiving limits? Just log into your Skrill (Moneybookers) account and click <b>View Limits</b> in the "My Account" section.<br />
                        <br />
                        Kind regards,<br />
                        Skrill (Moneybookers)<br />
                    </td>
                </tr>
                <tr>
                    <td>
                        <hr style="!important; font-family: verdana, arial, sans-serif; border: 0; margin: 8px 0px 0px 0px; padding: 6px 0px 0px 0px; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;" />
                    </td>
                </tr>
            </table>
            <table cellspacing=0 cellpadding=0 width="100%" border=0> 
    <tr>
        <td style="font-family: verdana, arial, sans-serif; font-size: 12px;    color: #656565;"><b>Skrill (Moneybookers) Security Reminders</b></td>
    </tr>
       <tr>          

<td class=smooth valign="top" style="font-family: verdana, arial, sans-serif; font-size: 11px;  color: #656565;"><p> <br>              <strong>Protect Your Password</strong><br>Skrill (Moneybookers) and its representatives will NEVER ask you to reveal your password. There are NO EXCEPTIONS to this policy. If anyone asks for your password by phone or by email, or on any website other than moneybookers.com, refuse and immediately report this to <a href="mailto:security@moneybookers.com" style="color: #862165; text-decoration: none; outline: none !important; font-weight: bold;">security@moneybookers.com</a>.<br><br><strong>Access your account ONLY using the login link on the Moneybookers homepage</strong><br>Please be advised that Skrill (Moneybookers) and its representatives will NEVER send you an email asking you to provide your login details within a form provided or to click on a hyperlink to access your account! Immediately report any incident to <a href="mailto:security@moneybookers.com"              style="color: #862165; text-decoration: none; outline: none !important; font-weight: bold;">security@moneybookers.com</a>.<br><br><strong>Case Sensitive Login</strong><br>Please remember your password is case-sensitive, at least 8 characters long and contains at least one number or non-alphabetic character such as '-'. <br>              <br>            </p></td>                </tr>      </table>
        </td>
    </tr>
</table>                </td></tr>
                <tr>
                    <td style="padding:0px 54px 0px 0px" class="separator"><hr style="border: 0; margin: 8px 0px 0px 0px; padding: 6px 0px 0px 0px; width: 100%; height: 2px; border-top: 1px solid #9AA6CD; overflow: hidden;"/></td>
                </tr>
            </table>
            <table align="left" width="740">
                <tr>
                    <td width="10"> </td>
                    <td style="font-family: verdana, arial, sans-serif; font-size: 11px;    color: #656565;" valign="top" width="100%" align="center">
                    Moneybookers Ltd., London, Registered in England and Wales no 4260907.<br>
Registered office: Welken House, 10-11 Charterhouse Square, London, EC1M 6EH, United Kingdom.<br>
Authorised by the Financial Services Authority (FSA) under the Electronic Money Regulations 2011 for the issuing of electronic money.
                    </td>
                </tr>
            </table>
        </td>
    <tr>
        <td valign="top">
            <img src="http://i1.mbsvr.net/images/bg_mailframe_bottom.gif" border="0" />
        </td>
    </tr>
</table>
</html>

因此,连同 Tomalak 的回答,我现在得到了两个可行的解决方案:

我现在工作的正则表达式考虑了正确的封闭<br />标签,现在还解析了值:

$regex = '/<td .*?>.*?You have received(?<value>.+?\d+\.\d\d).*?<\/td>.*?<p>.*?<strong>Sender:<\/strong>(?<sender>.+?)<br*.?\/?>.*?<strong>Customer ID:<\/strong>(?<customerId>.+?)<br*.?\/?>.*?<strong>Email:<\/strong>(?<email>.+?)<br*.?\/?>.*?<strong>Transaction ID:<\/strong>(?<transactionId>.+?)<br*.?\/?>.*?<strong>Reference:<\/strong>(?<reference>.+?)<br*.?\/?>.*?<strong>Explanation:<\/strong>(?<explanation>.+?)<\/p>/is';

以及以下 Tomalak 解决方案的调整后 xpath:

$path = "p/strong[contains(., '$info')]/following-sibling::text()[1]";

开头没有斜线意味着:在 DOM 树中的任何位置找到 xpath 并且它只匹配我想要的位置

感谢所有试图提供帮助的人

4

1 回答 1

0

只是为了它,这里有一个完全避免正则表达式的实现。

$doc = new DOMDocument();
$doc->loadHTML($decodedMessage);
$xpath = new DOMXPath($doc);

$info = array(
  'sender'         => get_info($xpath, 'Sender:'),
  'customer_id'    => get_info($xpath, 'Customer ID:'),
  'email'          => get_info($xpath, 'Email:'),
  'transaction_id' => get_info($xpath, 'Transaction ID:'),
  'reference'      => get_info($xpath, 'Reference:'),
  'explanation'    => get_info($xpath, 'Explanation:')
);


function get_info($xpath_object, $info) 
{
    $result = null;
    $path   = "//strong[contains(., '$info')]/following-sibling::text()[1]";
    $nodes  = $xpath_object->query($path);

    foreach ($nodes as $node)
    {
        $result = $node->textContent;
        break;
    }

    return $result;
}
于 2013-03-24T19:30:07.593 回答