regex - 使用正则表达式解析 EML 文本

Question

你能帮我用正则表达式解析 EML 文本吗？

我想单独获得：

1）。Content-Transfer-Encoding: base64 和 --=_alternative 之间的文本，如果上面有 Content-Type: text/html

2）。Content-Transfer-Encoding: base64 和 --=_related 之间的文本，如果上面有两行 Content-Type: image/jpeg

请看一下powershell中的代码和平：

$text = @"
--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64

111111111111111111111111111111111111111111111111111111

--=_alternative XXXXXXXXXXXXXX_=
Content-Type: text/html; charset="KOI8-R"
Content-Transfer-Encoding: base64

222222222222222222222222222222222222222222222222222222
--=_alternative XXXXXXXXXXXXXX_=--
--=_related XXXXXXXXXXXXXX_=--_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64

333333333333333333333333333333333333333333333333333333
--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64
444444444444444444444444444444444444444444444444444444

--=_related XXXXXXXXXXXXXX_=
Content-Type: image/jpeg
Content-ID: <_2_XXXXXXXXXXXXXX>
Content-Transfer-Encoding: base64

555555555555555555555555555555555555555555555555555555
--=_related XXXXXXXXXXXXXX_=--
"@

$regex1 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_alternative"
$text1 = ([regex]::Matches($text,$regex1) | foreach {$_.groups[1].value})
Write-Host "text1 : " -fore red
Write-Host  $text1

#I want to get as output elements (of array, maybe, or one after another)
#1). text between  Content-Transfer-Encoding: base64 and --=_alternative, if there is above line Content-Type: text/html
#this
#1111111111111111111111111111111111111111111111111111111
#then this
#2222222222222222222222222222222222222222222222222222222

$regex2 = "(?ms).+?Content-Transfer-Encoding: base64(.+?)--=_related"
$text2 = ([regex]::Matches($text,$regex2) | foreach {$_.groups[1].value})
#I want to get as output elements (of array, maybe, or one after another)
#2). text between  Content-Transfer-Encoding: base64 and --=_related, if there is two lines above line Content-Type: image/jpeg
#this
#3333333333333333333333333333333333333333333333333333333
#then this
#4444444444444444444444444444444444444444444444444444444
#then this
#5555555555555555555555555555555555555555555555555555555
Write-Host "text2 : " -fore red
Write-Host  $text2

谢谢你的帮助。祝你今天过得愉快。

PS 基于 Jessie Westlake 的代码，这里是 RegEx 的一个小编辑版本，对我有用：

$files = Get-ChildItem -Path "\\<SERVER_NAME>\mailroot\Drop"
Foreach ($file in $files){
    $text = Get-Content $file.FullName

    $RegexText = '(?:Content-Type: text/html.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'
    $RegexImage = '(?:Content-Type: image/jpeg.+?Content-Transfer-Encoding: base64(.+?)(?:--=_))'

    $TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
    $ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)

    If ($TextMatches[0].Success)
    {
        Write-Host "Found $($TextMatches.Count) Text Matches:"
        Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
    }
    If ($ImageMatches[0].Success)
    {
        Write-Host "Found $($ImageMatches.Count) Image Matches:"
        Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
    }
}

score 1 · Accepted Answer

TL; DR：只需转到底部的代码...

下面的代码很丑，请见谅。

本质上，我只是创建了一个以 . 开头的正则表达式Content-Type: text/html。它匹配之后的任何内容，直到遇到换行符\n、回车符\r或一个接一个的组合\r\n。

您必须将它们括在括号中才能使用 or|运算符。我们不想实际捕获/返回任何这些组，因此我们使用(?:text-to-match). 如您所见，我们在其他地方使用它。您也可以将捕获组和非捕获组放置在彼此内部。

不管怎样，继续。匹配新行后，我们要查看Content-Transfer-Encoding: base64. 您的每个示例似乎都需要这样做。

之后我们要识别下一个换行符，就像上次一样。除了这次我们想匹配 1 个或更多，通过使用+. 我们需要匹配多个的原因是，有时您要保存的数据前面有一行。但由于有时它前面没有额外的行，我们需要通过在加号后面加上问号来使其“懒惰” +?。

之后是我们将捕获您的实际数据的部分。这将是我们第一次使用实际的捕获组，而不是非捕获组（即没有问号后跟冒号）。

我们希望捕获任何不是新行的内容，因为有时您的数据后面似乎有新行，有时则没有。通过不允许我们捕获任何新行，它还将迫使我们之前的团队吞噬我们数据之前的任何额外的新行。该捕获组是([^(?:\n|\n\r)]+)

我们在那里所做的是将正则表达式包装在括号中以捕获它。我们将表达式放在括号内，因为我们想创建自己的字符“类”。括号内的任何字符都将是我们的代码所要查找的。不过，与我们的不同之处在于，我们将克拉^作为括号内的第一个字符。这意味着不是这些字符中的任何一个。显然，我们希望匹配所有内容直到下一行，因此我们希望捕获任何不是换行符的内容，一次或多次，尽可能多次。

然后我们确保我们的正则表达式锚定到一些结尾文本，所以我们继续尝试匹配。从另一个匹配至少一个的换行符开始，但要使我们的捕获成功(?:\n|\r|\r\n)+?。

最后，我们确定我们可以确定我们可以停止寻找重要数据的地方。这就是--=_. 我不确定我们是否会偶然发现“替代”词或“相关”词，所以我没有走那么远。现在它完成了。

一切的关键

如果我们不添加正则表达式“SingleLine”模式，我们将无法匹配新行。为了实现这一点，我们必须使用 .NET 语言来创建匹配项。我们从类型加速[System.Text.RegularExpressions.RegexOptions]。选项是“SingleLine”和“MultiLine”。

text/html我为和image/jpeg搜索创建了一个单独的正则表达式。我们将这些匹配的结果保存到它们各自的变量中。

我们可以通过索引 0 索引来测试匹配是否成功，该索引将包含整个匹配对象并访问其.success属性，该属性返回一个布尔值。可通过该.count属性访问匹配计数。为了访问特定的组和捕获，我们必须在找到适当的捕获组索引后在它们中添加点符号。由于我们只使用一个捕获组，其余的都是非捕获的，因此我们将拥有整个文本匹配的 [0] 索引，并且 [1] 应该包含我们的捕获组的匹配。因为它是一个对象，所以我们必须访问 value 属性。

显然，下面的代码将要求您的 $text 变量包含要搜索的数据。

$RegexText = '(?:Content-Type: text/html.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'
$RegexImage = '(?:Content-Type: image/jpeg.+?(?:\n|\r|\r\n)Content-Transfer-Encoding: base64(?:\n|\r|\r\n)+?([^(?:\n|\n\r)]+)(?:\n|\r|\r\n)+?(?:\n|\r|\r\n)(?:--=_))'

$TextMatches = [Regex]::Matches($text, $RegexText, [System.Text.RegularExpressions.RegexOptions]::Singleline)
$ImageMatches = [Regex]::Matches($text, $RegexImage, [System.Text.RegularExpressions.RegexOptions]::Singleline)

If ($TextMatches[0].Success)
{
    Write-Host "Found $($TextMatches.Count) Text Matches:"
    Write-Output $TextMatches.ForEach({$_.Groups[1].Value})
}
If ($ImageMatches[0].Success)
{
    Write-Host "Found $($ImageMatches.Count) Image Matches:"
    Write-Output $ImageMatches.ForEach({$_.Groups[1].Value})
}

上面的代码会在屏幕上显示以下输出：

Found 2 Text Matches:
111111111111111111111111111111111111111111111111111111
222222222222222222222222222222222222222222222222222222
Found 3 Image Matches:
333333333333333333333333333333333333333333333333333333
444444444444444444444444444444444444444444444444444444
555555555555555555555555555555555555555555555555555555

regex - 使用正则表达式解析 EML 文本

1 回答 1

Related

Reference