1

解析接收到的 .eml 文件的标头以提取所有跃点信息的正确方法是什么?特别是我需要提取以下信息:

  • 发件人网址
  • 发件人 IP
  • 收件人网址
  • 接收器 IP
  • 日期
  • 协议

我找到了以下规范,但似乎没有接收到的标头格式的标准约定,并且可能因服务器而异:

  1. RFC 2821 - 网关接收线路
  2. RFC 2822 - 跟踪字段
  3. RFC 822 - 4.1 语法

对我来说,最清楚的解释是RFC 822 规范中的解释:

 received    =  "Received"    ":"           ; one per relay
                  ["from" domain]           ; sending host
                  ["by"   domain]           ; receiving host
                  ["via"  atom]             ; physical path
                 *("with" atom)             ; link/mail protocol
                  ["id"   msg-id]           ; receiver msg id
                  ["for"  addr-spec]        ; initial form



                   ";"    date-time         ; time received

考虑以下received标题

Received: from VE1PR01MB5599.eurprd01.prod.exchangelabs.com
 (2603:10a6:7:7c::43) by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com with
 HTTPS via HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM; Thu, 9 Jan 2020 16:34:13
 +0000

Received: from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com
 (2603:10a6:802::42) by VE1PR01MB5599.eurprd01.prod.exchangelabs.com
 (2603:10a6:803:11f::30) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2602.12; Thu, 9 Jan
 2020 16:34:13 +0000

Received: from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com
 (2a01:111:f400:7e02::203) by VI1PR0102CA0029.outlook.office365.com
 (2603:10a6:802::42) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2623.9 via Frontend
 Transport; Thu, 9 Jan 2020 16:34:13 +0000

Received: from relay-out.ohc.cu (200.55.138.44) by
 DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.2623.9 via Frontend Transport; Thu, 9 Jan 2020 16:34:12 +0000

Received: from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
    by relay-out.ohc.cu (Postfix) with ESMTP id 69EA722DD
    for <some.email@some.domain>; Thu,  9 Jan 2020 11:29:43 -0500 (CST)

Received: from relay-out.ohc.cu ([127.0.0.1])
    by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
    with ESMTP id 7CZku5Y59vGC for <some.email@some.domain>;
    Thu,  9 Jan 2020 11:29:38 -0500 (CST)

Received: from correo.patrimonio.ohc.cu (unknown [192.168.229.20])
    by relay-out.ohc.cu (Postfix) with ESMTP id B83BA22F5
    for <some.email@some.domain>; Thu,  9 Jan 2020 11:29:36 -0500 (CST)

Received: from localhost (localhost.localdomain [127.0.0.1])
    by correo.patrimonio.ohc.cu (Postfix) with ESMTP id 65413232A001
    for <some.email@some.domain>; Thu,  9 Jan 2020 11:40:05 -0500 (CST)

Received: from correo.patrimonio.ohc.cu ([127.0.0.1])
    by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
    with ESMTP id hNMp-6lHHtzH for <some.email@some.domain>;
    Thu,  9 Jan 2020 11:40:05 -0500 (CST)

Received: from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])
    by correo.patrimonio.ohc.cu (Postfix) with ESMTPA id EC62A232A00A;
    Thu,  9 Jan 2020 11:39:53 -0500 (CST)

变化最大的领域似乎是

  1. 主机

    例如

    • 来自 VE1PR01MB5599.eurprd01.prod.exchangelabs.com (2603:10a6:7:7c::43)
    • 通过 HE1PR0102MB2714.eurprd01.prod.exchangelabs.com
    • 来自 VI1PR0102CA0029.eurprd01.prod.exchangelabs.com (2603:10a6:802::42)
    • 通过 VE1PR01MB5599.eurprd01.prod.exchangelabs.com (2603:10a6:803:11f::30)
    • 来自 DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com (2a01:111:f400:7e02::203)
    • 通过 VI1PR0102CA0029.outlook.office365.com (2603:10a6:802::42)
    • 来自 relay-out.ohc.cu (200.55.138.44)
    • 通过 DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246)
    • 来自 relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
    • 通过 relay-out.ohc.cu(后缀)
    • 来自 relay-out.ohc.cu ([127.0.0.1])
    • 通过 relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, 端口 10024)
    • 来自 correo.patrimonio.ohc.cu(未知 [192.168.229.20])
    • 通过 relay-out.ohc.cu(后缀)
    • 来自本地主机(localhost.localdomain [127.0.0.1])
    • 作者:correo.patrimonio.ohc.cu(后缀)
    • 来自 correo.patrimonio.ohc.cu ([127.0.0.1])
    • 通过本地主机(correo.patrimonio.ohc.cu [127.0.0.1])(amavisd-new,端口 10024)
    • 来自 correoweb.patrimonio.ohc.cu(未知 [192.168.229.23])
    • 作者:correo.patrimonio.ohc.cu(后缀)
  2. 邮件协议 ,例如

    • 使用 Microsoft SMTP 服务器(版本=TLS1_2,密码=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
    • 使用 ESMTP

考虑到它们不断变化的性质,提取此类信息的综合方法是什么?SO 上的其他答案不鼓励在此任务中使用正则表达式,但是如何进行这种解析呢?如果存在一些经过测试的正则表达式或 Java 代码/库来解析接收到的标头以提取上述信息,那对我来说没问题。

4

1 回答 1

1

我想提供以下解决方案。您可以在此处找到使用的正则表达式的完整说明。

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.LinkedList;
import java.util.HashMap;
import java.lang.StringBuilder;

class Rextester {
    public static void main(String[] args) {
        Pattern p = Pattern.compile("(?:(Received:)|\\G(?!\\A))" +
                                    "\\s*(from|by|with|id|via|for|;)" +
                                    "\\s*(\\S+?(?:\\s+\\S+?)*?)\\s*" +
                                    "(?=Received:|by|with|id|via|for|;|\\z)");
        String text = "Received: from VE1PR01MB5599.eurprd01.prod.exchangelabs.com\n" +
                      " (2603:10a6:7:7c::43) by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com with\n" +
                      " HTTPS via HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM; Thu, 9 Jan 2020 16:34:13\n" +
                      " +0000\n" +
                      "\n" +
                      "Received: from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com\n" +
                      " (2603:10a6:802::42) by VE1PR01MB5599.eurprd01.prod.exchangelabs.com\n" +
                      " (2603:10a6:803:11f::30) with Microsoft SMTP Server (version=TLS1_2,\n" +
                      " cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2602.12; Thu, 9 Jan\n" +
                      " 2020 16:34:13 +0000\n" +
                      "\n" +
                      "Received: from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com\n" +
                      " (2a01:111:f400:7e02::203) by VI1PR0102CA0029.outlook.office365.com\n" +
                      " (2603:10a6:802::42) with Microsoft SMTP Server (version=TLS1_2,\n" +
                      " cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2623.9 via Frontend\n" +
                      " Transport; Thu, 9 Jan 2020 16:34:13 +0000\n" +
                      "\n" +
                      "Received: from relay-out.ohc.cu (200.55.138.44) by\n" +
                      " DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246) with Microsoft SMTP\n" +
                      " Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id\n" +
                      " 15.20.2623.9 via Frontend Transport; Thu, 9 Jan 2020 16:34:12 +0000\n" +
                      "\n" +
                      "Received: from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])\n" +
                      "    by relay-out.ohc.cu (Postfix) with ESMTP id 69EA722DD\n" +
                      "    for <some.email@some.domain>; Thu,  9 Jan 2020 11:29:43 -0500 (CST)\n" +
                      "\n" +
                      "Received: from relay-out.ohc.cu ([127.0.0.1])\n" +
                      "    by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)\n" +
                      "    with ESMTP id 7CZku5Y59vGC for <some.email@some.domain>;\n" +
                      "    Thu,  9 Jan 2020 11:29:38 -0500 (CST)\n" +
                      "\n" +
                      "Received: from correo.patrimonio.ohc.cu (unknown [192.168.229.20])\n" +
                      "    by relay-out.ohc.cu (Postfix) with ESMTP id B83BA22F5\n" +
                      "    for <some.email@some.domain>; Thu,  9 Jan 2020 11:29:36 -0500 (CST)\n" +
                      "\n" +
                      "Received: from localhost (localhost.localdomain [127.0.0.1])\n" +
                      "    by correo.patrimonio.ohc.cu (Postfix) with ESMTP id 65413232A001\n" +
                      "    for <some.email@some.domain>; Thu,  9 Jan 2020 11:40:05 -0500 (CST)\n" +
                      "\n" +
                      "Received: from correo.patrimonio.ohc.cu ([127.0.0.1])\n" +
                      "    by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)\n" +
                      "    with ESMTP id hNMp-6lHHtzH for <some.email@some.domain>;\n" +
                      "    Thu,  9 Jan 2020 11:40:05 -0500 (CST)\n" +
                      "\n" +
                      "Received: from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])\n" +
                      "    by correo.patrimonio.ohc.cu (Postfix) with ESMTPA id EC62A232A00A;\n" +
                      "    Thu,  9 Jan 2020 11:39:53 -0500 (CST)";
        LinkedList<HashMap<String, String>> data = new LinkedList<HashMap<String, String>>();
        HashMap<String, String> e;
        StringBuilder sb = new StringBuilder(4096);
        Matcher m = p.matcher(text);
        while (m.find()) {
            if (m.group(1) != null) {
                data.add(new HashMap<String, String>());
            }
            e = data.getLast();
            e.put(m.group(2), m.group(3));
        }
        sb.append("[");
        data.stream().forEach((x) -> sb.append(x).append(",\n"));
        if (sb.length() > 2) {
            sb.setLength(sb.length() - 2);
        }
        sb.append("]");
        System.out.println(sb);
    }
}

输出:

[{with=HTTPS, by=HE1PR0102MB2714.eurprd01.prod.exchangelabs.com, from=VE1PR01MB5599.eurprd01.prod.exchangelabs.com
 (2603:10a6:7:7c::43), ;=Thu, 9 Jan 2020 16:34:13
 +0000, via=HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM},
{with=Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=VE1PR01MB5599.eurprd01.prod.exchangelabs.com
 (2603:10a6:803:11f::30), from=VI1PR0102CA0029.eurprd01.prod.exchangelabs.com
 (2603:10a6:802::42), id=15.20.2602.12, ;=Thu, 9 Jan
 2020 16:34:13 +0000},
{with=Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=VI1PR0102CA0029.outlook.office365.com
 (2603:10a6:802::42), from=DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com
 (2a01:111:f400:7e02::203), id=15.20.2623.9, ;=Thu, 9 Jan 2020 16:34:13 +0000, via=Frontend
 Transport},
{with=Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384), by=DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246), from=relay-out.ohc.cu (200.55.138.44), id=15.20.2623.9, ;=Thu, 9 Jan 2020 16:34:12 +0000, via=Frontend Transport},
{with=ESMTP, by=relay-out.ohc.cu (Postfix), for=<some.email@some.domain>, from=relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]), id=69EA722DD, ;=Thu,  9 Jan 2020 11:29:43 -0500 (CST)},
{with=ESMTP, by=relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024), for=<some.email@some.domain>, from=relay-out.ohc.cu ([127.0.0.1]), id=7CZku5Y59vGC, ;=Thu,  9 Jan 2020 11:29:38 -0500 (CST)},
{with=ESMTP, by=relay-out.ohc.cu (Postfix), for=<some.email@some.domain>, from=correo.patrimonio.ohc.cu (unknown [192.168.229.20]), id=B83BA22F5, ;=Thu,  9 Jan 2020 11:29:36 -0500 (CST)},
{with=ESMTP, by=correo.patrimonio.ohc.cu (Postfix), for=<some.email@some.domain>, from=localhost (localhost.localdomain [127.0.0.1]), id=65413232A001, ;=Thu,  9 Jan 2020 11:40:05 -0500 (CST)},
{with=ESMTP, by=localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024), for=<some.email@some.domain>, from=correo.patrimonio.ohc.cu ([127.0.0.1]), id=hNMp-6lHHtzH, ;=Thu,  9 Jan 2020 11:40:05 -0500 (CST)},
{with=ESMTPA, by=correo.patrimonio.ohc.cu (Postfix), from=correoweb.patrimonio.ohc.cu (unknown [192.168.229.23]), id=EC62A232A00A, ;=Thu,  9 Jan 2020 11:39:53 -0500 (CST)}]

演示

于 2020-07-06T09:34:31.883 回答