解析接收到的 .eml 文件的标头以提取所有跃点信息的正确方法是什么?特别是我需要提取以下信息:
- 发件人网址
- 发件人 IP
- 收件人网址
- 接收器 IP
- 日期
- 协议
我找到了以下规范,但似乎没有接收到的标头格式的标准约定,并且可能因服务器而异:
对我来说,最清楚的解释是RFC 822 规范中的解释:
received = "Received" ":" ; one per relay
["from" domain] ; sending host
["by" domain] ; receiving host
["via" atom] ; physical path
*("with" atom) ; link/mail protocol
["id" msg-id] ; receiver msg id
["for" addr-spec] ; initial form
";" date-time ; time received
考虑以下received
标题
Received: from VE1PR01MB5599.eurprd01.prod.exchangelabs.com
(2603:10a6:7:7c::43) by HE1PR0102MB2714.eurprd01.prod.exchangelabs.com with
HTTPS via HE1PR0402CA0054.EURPRD04.PROD.OUTLOOK.COM; Thu, 9 Jan 2020 16:34:13
+0000
Received: from VI1PR0102CA0029.eurprd01.prod.exchangelabs.com
(2603:10a6:802::42) by VE1PR01MB5599.eurprd01.prod.exchangelabs.com
(2603:10a6:803:11f::30) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2602.12; Thu, 9 Jan
2020 16:34:13 +0000
Received: from DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com
(2a01:111:f400:7e02::203) by VI1PR0102CA0029.outlook.office365.com
(2603:10a6:802::42) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2623.9 via Frontend
Transport; Thu, 9 Jan 2020 16:34:13 +0000
Received: from relay-out.ohc.cu (200.55.138.44) by
DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246) with Microsoft SMTP
Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
15.20.2623.9 via Frontend Transport; Thu, 9 Jan 2020 16:34:12 +0000
Received: from relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
by relay-out.ohc.cu (Postfix) with ESMTP id 69EA722DD
for <some.email@some.domain>; Thu, 9 Jan 2020 11:29:43 -0500 (CST)
Received: from relay-out.ohc.cu ([127.0.0.1])
by relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id 7CZku5Y59vGC for <some.email@some.domain>;
Thu, 9 Jan 2020 11:29:38 -0500 (CST)
Received: from correo.patrimonio.ohc.cu (unknown [192.168.229.20])
by relay-out.ohc.cu (Postfix) with ESMTP id B83BA22F5
for <some.email@some.domain>; Thu, 9 Jan 2020 11:29:36 -0500 (CST)
Received: from localhost (localhost.localdomain [127.0.0.1])
by correo.patrimonio.ohc.cu (Postfix) with ESMTP id 65413232A001
for <some.email@some.domain>; Thu, 9 Jan 2020 11:40:05 -0500 (CST)
Received: from correo.patrimonio.ohc.cu ([127.0.0.1])
by localhost (correo.patrimonio.ohc.cu [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id hNMp-6lHHtzH for <some.email@some.domain>;
Thu, 9 Jan 2020 11:40:05 -0500 (CST)
Received: from correoweb.patrimonio.ohc.cu (unknown [192.168.229.23])
by correo.patrimonio.ohc.cu (Postfix) with ESMTPA id EC62A232A00A;
Thu, 9 Jan 2020 11:39:53 -0500 (CST)
变化最大的领域似乎是
主机域
例如
- 来自 VE1PR01MB5599.eurprd01.prod.exchangelabs.com (2603:10a6:7:7c::43)
- 通过 HE1PR0102MB2714.eurprd01.prod.exchangelabs.com
- 来自 VI1PR0102CA0029.eurprd01.prod.exchangelabs.com (2603:10a6:802::42)
- 通过 VE1PR01MB5599.eurprd01.prod.exchangelabs.com (2603:10a6:803:11f::30)
- 来自 DB5EUR01FT034.eop-EUR01.prod.protection.outlook.com (2a01:111:f400:7e02::203)
- 通过 VI1PR0102CA0029.outlook.office365.com (2603:10a6:802::42)
- 来自 relay-out.ohc.cu (200.55.138.44)
- 通过 DB5EUR01FT034.mail.protection.outlook.com (10.152.4.246)
- 来自 relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1])
- 通过 relay-out.ohc.cu(后缀)
- 来自 relay-out.ohc.cu ([127.0.0.1])
- 通过 relay-in.ohc.cu (relay-in.ohc.cu [127.0.0.1]) (amavisd-new, 端口 10024)
- 来自 correo.patrimonio.ohc.cu(未知 [192.168.229.20])
- 通过 relay-out.ohc.cu(后缀)
- 来自本地主机(localhost.localdomain [127.0.0.1])
- 作者:correo.patrimonio.ohc.cu(后缀)
- 来自 correo.patrimonio.ohc.cu ([127.0.0.1])
- 通过本地主机(correo.patrimonio.ohc.cu [127.0.0.1])(amavisd-new,端口 10024)
- 来自 correoweb.patrimonio.ohc.cu(未知 [192.168.229.23])
- 作者:correo.patrimonio.ohc.cu(后缀)
邮件协议 ,例如
- 使用 Microsoft SMTP 服务器(版本=TLS1_2,密码=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
- 使用 ESMTP
考虑到它们不断变化的性质,提取此类信息的综合方法是什么?SO 上的其他答案不鼓励在此任务中使用正则表达式,但是如何进行这种解析呢?如果存在一些经过测试的正则表达式或 Java 代码/库来解析接收到的标头以提取上述信息,那对我来说没问题。