php - extract body from raw email with regex

Question

--047d7b33d6decd251504bfe78895
Content-Type: multipart/alternative; boundary=047d7b33d6decd250d04bfe78893

--047d7b33d6decd250d04bfe78893
Content-Type: text/plain; charset=UTF-8

twest

ini sebuah proiduct abru

awdawdawdawdwa

aw
awdawdaw

--047d7b33d6decd250d04bfe78893
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_quote">twest=C2=A0<div><br></div><div>ini sebuah proidu=
ct abru</div><div><br></div><div>awdawdawdawdwa</div><div><br></div><div>aw=
</div><div>awdawdaw</div>
</div><br>

--047d7b33d6decd250d04bfe78893--

how can i get the mail text/plain and the text/html content with regex?
does an email only have 1 content body? consisting a text/html and a text/plain

*heres a snippet what im currently doing it wrong.

    $parts = explode('--', $this->rawemail);
    $this->headers = imap_rfc822_parse_headers($this->rawemail);
    # var_dump($parts);
    # Process the parts
    foreach ($parts as $part) 
    {
        # Get Content text/plain
        if (preg_match('/Content-Type: text\/plain;/', $part)) 
        {
            $body_parts = preg_split('/\n\n/', $part);

            # If Above the newline (Headers)
            if ($body_parts[0]) 
            {
                # var_dump($body_parts[0]);
            }

            # If Below the newline (Data)
            if ($body_parts[1]) 
            {
                var_dump($body_parts[1]);
            }
        }

        # Get Content text/html
        if (preg_match('/Content-Type: text\/html;/', $part)) 
        {
            $body_parts = preg_split('/\n\n/', $part);

            # If Above the newline (Headers)
            if ($body_parts[0]) 
            {
                # var_dump($body_parts[0]);
            }

            # If Below the newline (Data)
            if ($body_parts[1]) 
            {
                var_dump($body_parts[1]);
            }
        }

score 4 · Accepted Answer

I think you'd be better going down the email line at a time as it's the line breaks that are more critical in e-mail formation.

Your rules would be:

If you get a double line break, then the body is starting - plain text type (as there are no headers to indicate which).
Otherwise, carry on until you get the "boundary=" bit, and then you record the boundary and hop into a "looking for boundary" mode.
Then, when you find a boundary, hop into "Looking for content-type or double new-line" mode, and look for Content-Type (and note content-Type) or double new-line (header has finished, body coming next until the next boundary)
While reading the body of the message, you're back in "looking for boundary" mode to repeat teh process.

Something I remember from a long time ago - so the following may not be 100% accurate, but I'll mention just in case. Be careful with files with attachemnts as you can get two "boundary" markers. But one boundary is withing another boundary, so if you follow the rules above (i.e. grab the first boundary and stick with it) then you should be fine. But test your script with some attachemnts :)

Edit: additional info as asked in the question. An e-mail can have as many "bodies" as the user wishes to encode. You can have a plain, and HTML, a UTF encoded version, and RTF version or even a Morse Code version (if the client knew how to handle "Content-Type Morse/Code"!). Sometimes you don't get plain text, but only HTML versions (naughty users). Sometimes the HTML actually comes without the content type declaration (which may or may not get displayed as HTML, depending on the client). The boundary also splits off the attachments. Rich test is a gotcha from Outlook (although, to be fair, it usually IS converted to HTML). So no, there's somewhere between 0 and X bodies.

php - extract body from raw email with regex

1 回答 1

Related

Reference