powershell - 在 Powershell 中搜索许多大型文本文件

Question

我经常需要在一个目录中搜索服务器日志文件，该目录可能包含 50 个或更多文件，每个文件超过 200 MB。我在 Powershell 中编写了一个函数来执行此搜索。它查找并提取给定查询参数的所有值。它适用于单个大文件或小文件集合，但在上述情况下完全咬合，即大文件目录。

该函数接受一个参数，该参数由要搜索的查询参数组成。

在伪代码中：

Take parameter (e.g. someParam or someParam=([^& ]+))
Create a regex (if one is not supplied)
Collect a directory list of *.log, pipe to Select-String
For each pipeline object, add the matchers to a hash as keys
Increment a match counter
Call GC
At the end of the pipelining: 
if (hash has keys) 
    enumerate the hash keys, 
    sort and append to string array
    set-content the string array to a file 
    print summary to console
    exit
else
    print summary to console
    exit

这是文件处理的精简版本。

$wtmatches = @{};
gci -Filter *.log | Select-String -Pattern $searcher |       
%{ $wtmatches[$_.Matches[0].Groups[1].Value]++; $items++; [GC]::Collect(); }

我只是使用一个旧的 perl 技巧，通过将找到的项目设为哈希的键来对它们进行重复数据删除。也许，这是一个错误，但处理的典型输出最多约为 30,000 个项目。更常见的是，找到的项目在数千个范围内。从我所见，散列中的键数不会影响处理时间，而是文件的大小和数量会破坏它。我最近绝望地投入了GC，它确实有一些积极的影响，但它是微不足道的。

问题在于，对于大型文件的大量集合，处理过程会在大约 60 秒内将 RAM 池吸干。有趣的是，它实际上并没有使用很多 CPU，但是有很多易失性存储正在进行。一旦 RAM 使用率上升 90% 以上，我就可以打卡出去看电视了。完成处理以生成具有 15,000 或 20,000 个唯一值的文件可能需要数小时。

我想要提高效率的建议和/或建议，即使这意味着使用不同的范例来完成处理。我带着我所知道的去了。我几乎每天都使用这个工具。

哦，我致力于使用 Powershell。;-) 这个函数是我为我的工作编写的完整模块的一部分，因此，Python、perl 或其他有用语言的建议在这种情况下没有用。

谢谢。

mp

更新：使用 latkin 的ProcessFile函数，我使用以下包装器进行测试。他的功能比我原来的要快几个数量级。

function Find-WtQuery {

<#
 .Synopsis
  Takes a parameter with a capture regex and a wildcard for files list.

 .Description
  This function is intended to be used on large collections of large files that have
  the potential to take an unacceptably long time to process using other methods. It
  requires that a regex capture group be passed in as the value to search for.

 .Parameter Target
  The parameter with capture group to find, e.g. WT.z_custom=([^ &]+).

 .Parameter Files
  The file wildcard to search, e.g. '*.log'

 .Outputs
  An object with an array of unique values and a count of total matched lines.
#>

        param(
        [Parameter(Mandatory = $true)] [string] $target,
        [Parameter(Mandatory = $false)] [string] $files
    )

    begin{
        $stime = Get-Date
    }
    process{
        $results = gci -Filter $files | ProcessFile -Pattern $target  -Group 1;
    }
    end{
        $etime = Get-Date;
        $ptime = $etime - $stime;
        Write-Host ("Processing time for {0} files was {1}:{2}:{3}." -f (gci   
    -Filter $files).Count, $ptime.Hours,$ptime.Minutes,$ptime.Seconds);
        return $results;
    }
}

输出：

clients:\test\logs\global
{powem} [4] --> Find-WtQuery -target "WT.ets=([^ &]+)" -files "*.log"
Processing time for 53 files was 0:1:35.

感谢大家的评论和帮助。

score 2 · Accepted Answer

如果您想在 PowerShell 中执行此操作而不使用某些专用工具，则 IMO @latkin 的方法是可行的方法。不过，我做了一些更改，以使命令在接受管道输入方面发挥得更好。我还修改了正则表达式以搜索特定行上的所有匹配项。两种方法都不会搜索多行，尽管只要模式只跨越几行，这种情况就很容易处理。这是我对该命令的看法（将其放入名为 Search-File.ps1 的文件中）：

[CmdletBinding(DefaultParameterSetName="Path")]
param(
    [Parameter(Mandatory=$true, Position=0)]
    [ValidateNotNullOrEmpty()]
    [string]
    $Pattern,

    [Parameter(Mandatory=$true, Position=1, ParameterSetName="Path", 
               ValueFromPipeline=$true, ValueFromPipelineByPropertyName=$true,
               HelpMessage="Path to ...")]
    [ValidateNotNullOrEmpty()]
    [string[]]
    $Path,

    [Alias("PSPath")]
    [Parameter(Mandatory=$true, Position=1, ParameterSetName="LiteralPath", 
               ValueFromPipelineByPropertyName=$true,
               HelpMessage="Path to ...")]
    [ValidateNotNullOrEmpty()]
    [string[]]
    $LiteralPath,

    [Parameter()]
    [ValidateRange(0, [int]::MaxValue)]
    [int]
    $Group = 0
)

Begin 
{ 
    Set-StrictMode -Version latest 
    $count = 0
    $matched = @{}
    $regex = New-Object System.Text.RegularExpressions.Regex $Pattern,'Compiled'
}

Process 
{
    if ($psCmdlet.ParameterSetName -eq "Path")
    {
        # In the -Path (non-literal) case we may need to resolve a wildcarded path
        $resolvedPaths = @($Path | Resolve-Path | Convert-Path)
    }
    else 
    {
        # Must be -LiteralPath
        $resolvedPaths = @($LiteralPath | Convert-Path)
    }

    foreach ($rpath in $resolvedPaths) 
    {
        Write-Verbose "Processing $rpath"

        $stream = new-object System.IO.FileStream $rpath,'Open','Read','Read',4096
        $reader = new-object System.IO.StreamReader $stream
        try
        {
            while (($line = $reader.ReadLine())-ne $null)
            {
                $matchColl = $regex.Matches($line)
                foreach ($match in $matchColl)
                {
                    $count++
                    $key = $match.Groups[$Group].Value
                    if ($matched.ContainsKey($key))
                    {
                        $matched[$key]++
                    }
                    else
                    {
                        $matched[$key] = 1;
                    }
                }
            }
        }
        finally
        {
            $reader.Close()
        }
    }
}

End
{
    new-object psobject -Property @{TotalCount = $count; Matched = $matched}
}

我针对我的 IIS 日志目录（8.5 GB 和 ~1000 个文件）运行此命令，以查找所有日志中的所有 IP 地址，例如：

$r = ls . -r *.log | C:\Users\hillr\Search-File.ps1 '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'

这在我的系统上花费了 27 分钟，找到了 54356330 个匹配项：

$r.Matched.GetEnumerator() | sort Value -Descending | select -f 20


Name                           Value
----                           -----
xxx.140.113.47                 22459654
xxx.29.24.217                  13430575
xxx.29.24.216                  13321196
xxx.140.113.98                 4701131
xxx.40.30.254                  53724

score 2 · Accepted Answer

这是一个有望加速并减少文件处理部分的内存影响的函数。它将返回一个具有 2 个属性的对象：匹配的总行数，以及来自指定匹配组的唯一字符串的排序数组。（根据您的描述，您似乎并不真正关心每个字符串的计数，只关心字符串值本身）

function ProcessFile
{
   param(
      [Parameter(ValueFromPipeline = $true, Mandatory = $true)]
      [System.IO.FileInfo] $File,

      [Parameter(Mandatory = $true)]
      [string] $Pattern,

      [Parameter(Mandatory = $true)]
      [int] $Group
   )

   begin
   {
      $regex = new-object Regex @($pattern, 'Compiled')
      $set = new-object 'System.Collections.Generic.SortedDictionary[string, int]'
      $totalCount = 0
   }

   process
   {
      try
      {
        $reader = new-object IO.StreamReader $_.FullName

        while( ($line = $reader.ReadLine()) -ne $null)
        {
           $m = $regex.Match($line)
           if($m.Success)
           {
              $set[$m.Groups[$group].Value] = 1      
              $totalCount++
           }
        }
      }
      finally
      {
         $reader.Close()
      }
   }

   end
   {
      new-object psobject -prop @{TotalCount = $totalCount; Unique = ([string[]]$set.Keys)}
   }
}

你可以像这样使用它：

$results = dir *.log | ProcessFile -Pattern 'stuff (capturegroup)' -Group 1
"Total matches: $($results.TotalCount)"
$results.Unique | Out-File .\Results.txt

powershell - 在 Powershell 中搜索许多大型文本文件

2 回答 2

Related

Reference