regex - 在 Powershell 中优化日志正则表达式

Question

我们有 2 个 SMTP 门，它们可以输出.log大约一周的数据的文本文件（通常大约 10-30MB 个弹出）。总的来说，两者的大小通常约为 1.2GB。

我有（2）只读共享设置到日志目录，并试图使用解析日志条目Select-String（例如，说我想看看“bdole”的电子邮件是否进来。如果我只想在线获得点击数字，没那么糟糕。

但是，我想获得整个“日志条目”。我最初的研究表明我需要立即阅读整个日志的内容，然后针对它执行正则表达式。所以，这就是我正在做的，将近 200 个文件。

但是，我不认为 i/o 才是真正的问题。我正在生成约 200 个线程（每个文件一个）并限制为 20 个线程。最初的 20 个线程需要一些时间才能运行。我输入了一些调试代码并回到单线程；似乎简单地正则表达式一个 10-20MB 文件的内容需要很长时间。

我怀疑我编写的正则表达式在速度方面非常不足（如果我让它运行一夜，它就可以正常工作。）另外，网络 I/O 非常低（峰值为 0.6% 2Ggpbs 连接），而 CPU/RAM 非常高。

理想的日志条目如下所示：

---- SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

唯一可靠的分隔符是开始----（有时它会/不会以 a 结尾----）

“日志条目”的内容可能非常多变，包括被阻止连接的通知等。

我正在使用的正则表达式

(?sm)----((?!----).*?)(log entry)((?!----).*?)(#USERINPUT#)((?!----).*?)----

where#USERINPUT#被传递给脚本的内容所取代。

解析代码 使用获取文件路径列表后gci

if ( !(Test-Path $path) ) {
            write-error "issue accessing $path"
        } else {
            try {
                $buffer = [io.file]::ReadAllText($path)
            }
            catch {
                $errArray += $path
                $_
            }
            [string[]]$matchBuffer = @()
            $matchBuffer += $entrySeperator
            $matchBuffer += $_
            $matchBuffer += $entrySeperator
            $matchBuffer += $buffer | Select-String $regex -AllMatches |
            % {$_.Matches} |
            % {$_.Value; $entrySeperator} 

            if ($errArray) {
                write-warning "There were errors, probably in accessing files. "
                $errArray
            }

            $fileName = (gi $path).Name
            sc -path $tmpDir\$fileName -value $matchBuffer
            $matchBuffer | Out-String

我几乎想知道解析“命中”（例如第 21 行上的 XXXX.LOG）并从上下文向后重建日志条目是否会更快/更好。

score 0 · Accepted Answer

描述

你的表达有几个问题：

通过----在匹配正则表达式的开头和结尾包含，您最终可能会错过日志中的下一个条目，并且您将错过日志的最后一个条目
使用您的构造，您((?!----).*?)似乎正在尝试限制匹配的数量.*?。但是，该构造仅检查一次以查看接下来的 4 个字符----是否不匹配，然后继续匹配.*?. 你最好用((?:(?!----).)*). 由于此构造是自终止的，因此您无需担心使用?来防止贪婪。坏消息是该构造的效率略低于仅([^\r\n]*?)用于匹配第一行中的已知条目并 (.*?)(?=^----|\Z)匹配日志正文。
假设可靠文本----将始终位于行首，那么您还可以包含行首锚点^

(?m)^----\s(.*?)\s(log\sentry)\s(.*?)\s(mm\/dd\/yyyy\sHH:mm:ss)(?sm).*?^(.*?)(?=^----|\Z)

在此处输入图像描述

例子

Powershell 示例

$String = '---- 1 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 2 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 3 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----
---- 4 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
---- 5 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 
---- 6 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.
'
clear

[regex]$Regex = '(?m)^----\s(.*?)\s(log\sentry)\s(.*?)\s(mm\/dd\/yyyy\sHH:mm:ss)(?sm).*?^(.*?)(?=^----|\Z)'
# [regex]$Regex = '(?sm)----((?!----).*?)(log\sentry)((?!----).*?)(mm\/dd\/yyyy\sHH:mm:ss)((?!----).*?)'

# cycle through all matches
$intCount = 0
Measure-Command {
    $Regex.matches($String) | foreach {
            $intCount += 1
            Write-Host "[$intCount][0]=" $_.Groups[0].Value
            Write-Host "[$intCount][1]=" $_.Groups[1].Value
            Write-Host "[$intCount][2]=" $_.Groups[2].Value
            Write-Host "[$intCount][3]=" $_.Groups[3].Value
            Write-Host "[$intCount][4]=" $_.Groups[4].Value
            Write-Host "[$intCount][5]=" $_.Groups[5].Value

        } # next match
    } | select Milliseconds

输出

[1][0]= ---- 1 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[1][1]= 1 SMTPRS
[1][2]= log entry
[1][3]= made at
[1][4]= mm/dd/yyyy HH:mm:ss
[1][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[2][0]= ---- 2 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[2][1]= 2 SMTPRS
[2][2]= log entry
[2][3]= made at
[2][4]= mm/dd/yyyy HH:mm:ss
[2][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[3][0]= ---- 3 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----

[3][1]= 3 SMTPRS
[3][2]= log entry
[3][3]= made at
[3][4]= mm/dd/yyyy HH:mm:ss
[3][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. ----

[4][0]= ---- 4 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[4][1]= 4 SMTPRS
[4][2]= log entry
[4][3]= made at
[4][4]= mm/dd/yyyy HH:mm:ss
[4][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss.

[5][0]= ---- 5 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 

[5][1]= 5 SMTPRS
[5][2]= log entry
[5][3]= made at
[5][4]= mm/dd/yyyy HH:mm:ss
[5][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 

[6][0]= ---- 6 SMTPRS log entry made at mm/dd/yyyy HH:mm:ss ----
Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. [6][1]= 6 SMTPRS
[6][2]= log entry
[6][3]= made at
[6][4]= mm/dd/yyyy HH:mm:ss
[6][5]= Incoming SMTP call from x.x.x.x at HH:mm:ss.
<<< 220 mail.foo.com
>>> QUIT
<<< 221 mail.foo.com closing
Incoming SMTP call from x.x.x.x completed at HH:mm:ss. 


Milliseconds
------------
16

不幸的是，在我的系统上，这个表达式的运行速度稍慢，但我没有使用真实数据。所以我很好奇你是否看到任何改进

score 0 · Accepted Answer

您不一定需要正则表达式来解析这样的日志。像这样的东西也应该起作用：

$userInput = "..."

$logfile = 'C:\path\to\your.log'

$entry = $null
$log = Get-Content $logfile | % {
  $len = [Math]::Min(4, $_.Length)
  if ($_.SubString(0, $len) -eq '----' -and $entry -ne $null) {
    "$entry"
    $entry = $null
  }
  $entry += "$_`n"
}
$log += $entry

$log | ? { $_ -match [regex]::Escape($userInput) }

regex - 在 Powershell 中优化日志正则表达式

2 回答 2

描述

例子

Related

Reference