c# - 在每个 C# 大小为 150 MB 的多个文本文件中搜索字符串

Question

我有多个 .txt 文件，每个文件大小为150MB。使用 C# 我需要从每个文件中检索包含字符串模式的所有行，然后将这些行写入新创建的文件。

我已经研究过类似的问题，但他们建议的答案都不能给我最快的获取结果的方法。我尝试了正则表达式、linq 查询、包含方法、使用字节数组搜索，但所有这些都需要 30 多分钟来读取和比较文件内容。

我的测试文件没有任何特定的格式，它就像我们无法基于分号拆分和基于 DataViews 过滤的原始数据。以下是该文件中每一行的示例格式。

样本.txt

LTYY;;0,0,;123456789;;;;;;;20121002 02:00;;
ptgh;;0,0,;123456789;;;;;;;20121002 02:00;;
HYTF;;0,0,;846234863;;;;;;;20121002 02:00;;
Multiple records......

我的代码

using (StreamWriter SW = new StreamWriter(newFile))
            {
                using(StreamReader sr = new StreamReader(sourceFilePath))
                {
                while (sr.Peek() >= 0) 
                {
                   if (sr.ReadLine().Contains(stringToSearch))
                     SW.WriteLine(sr.ReadLine().ToString());
                 }
}
}

我想要一个示例代码，它可以在不到一分钟的时间内从 Sample.txt中搜索123456789 。如果我的要求不清楚，请告诉我。提前致谢！

编辑

我找到了根本原因，因为文件驻留在远程服务器中会花费更多时间来读取它们，因为当我将文件复制到本地机器时，所有比较方法都很快完成，所以这不是我们阅读方式的问题或比较内容，他们或多或少花费了相同的时间。

但是现在我该如何解决这个问题，我无法将所有这些文件复制到我的机器上进行比较并获得 OutOfMemory 异常

score 3 · Accepted Answer

最快的搜索方法是使用Boyer–Moore 字符串搜索算法，因为这种方法不需要从文件中读取所有字节，但需要随机访问字节，或者您可以尝试使用Rabin Karp 算法

或者您可以尝试从这个答案中执行以下代码：

  public static int FindInFile(string fileName, string value)
  {   // returns complement of number of characters in file if not found
    // else returns index where value found
  int index = 0;
   using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName))
   {
    if (String.IsNullOrEmpty(value))
        return 0;
    StringSearch valueSearch = new StringSearch(value);
    int readChar;
    while ((readChar = reader.Read()) >= 0)
    {
        ++index;
        if (valueSearch.Found(readChar))
            return index - value.Length;
    }
}
return ~index;
}
 public class StringSearch
 {   // Call Found one character at a time until string found
private readonly string value;
private readonly List<int> indexList = new List<int>();
public StringSearch(string value)
{
    this.value = value;
}
public bool Found(int nextChar)
{
    for (int index = 0; index < indexList.Count; )
    {
        int valueIndex = indexList[index];
        if (value[valueIndex] == nextChar)
        {
            ++valueIndex;
            if (valueIndex == value.Length)
            {
                indexList[index] = indexList[indexList.Count - 1];
                indexList.RemoveAt(indexList.Count - 1);
                return true;
            }
            else
            {
                indexList[index] = valueIndex;
                ++index;
            }
        }
        else
        {   // next char does not match
            indexList[index] = indexList[indexList.Count - 1];
            indexList.RemoveAt(indexList.Count - 1);
        }
    }
    if (value[0] == nextChar)
    {
        if (value.Length == 1)
            return true;
        indexList.Add(1);
    }
    return false;
}
public void Reset()
{
    indexList.Clear();
}
}

score 1 · Accepted Answer

我不知道这需要多长时间才能运行，但这里有一些改进：

using (StreamWriter SW = new StreamWriter(newFile))
{
    using (StreamReader sr = new StreamReader(sourceFilePath))
    {
        while (!sr.EndOfStream)
        {
            var line = sr.ReadLine();
            if (line.Contains(stringToSearch))
                SW.WriteLine(line);
        }
    }
}

请注意，你不需要Peek，EndOfStream会给你你想要的。你打ReadLine了两次电话（可能不是你想要的）。并且无需调用ToString().string

score 1 · Accepted Answer

150MB 就是 150MB。如果您有一个线程逐行遍历整个 150MB（“行”由换行符/组或 EOF 终止），您的进程必须读入并旋转所有 150MB 数据（不是全部在一次，它不必同时持有所有这些）。对 157,286,400 个字符进行线性搜索非常简单，需要时间，而您说您有很多这样的文件。

第一件事；您正在从流中读取该行两次。在大多数情况下，这实际上会导致您在匹配时读取两行；写入新文件的内容将是包含搜索字符串的行之后的行。这可能不是您想要的（再次，它可能是）。如果要写入实际包含搜索字符串的行，请在执行包含检查之前将其读入变量。

其次，String.Contains() 将根据需要执行线性搜索。在您的情况下，该行为实际上会接近 N^2，因为在字符串中搜索字符串时，必须找到第一个字符，并且在它所在的位置，然后将每个字符与后续字符一个接一个地匹配，直到所有字符搜索字符串已匹配或找到不匹配的字符；当发生不匹配时，算法必须返回初始匹配后的字符以避免跳过可能的匹配，这意味着在检查长字符串与具有许多部分匹配的较长字符串时，它可以多次测试相同的字符。因此，该策略在技术上是一种“蛮力”解决方案。不幸的是，当您不知道在哪里查找时（例如在未排序的数据文件中），没有更有效的解决方案。

除了能够对文件的数据进行排序然后执行索引搜索之外，我可以建议的唯一可能的加速方法是多线程解决方案；如果您只在一个查看每个文件的线程上运行此方法，那么不仅只有一个线程在做这项工作，而且该线程一直在等待硬盘驱动器提供所需的数据。有 5 或 10 个线程，每个线程一次处理一个文件，不仅可以更有效地利用现代多核 CPU 的真正功能，而且当一个线程在硬盘驱动器上等待时，另一个已加载数据的线程可以执行，进一步提高这种方法的效率。请记住，数据离 CPU 越远，CPU 获取数据所需的时间就越长，当您的 CPU 每秒可以处理 2 到 40 亿件事情时，

score 1 · Accepted Answer

As I said already, you should have a database, but whatever.

The fastest, shortest and nicest way to do it (even one-lined) is this:

File.AppendAllLines("b.txt", File.ReadLines("a.txt")
                                 .Where(x => x.Contains("123456789")));

But fast? 150MB is 150MB. It's gonna take a while. You can replace the Contains method with your own, for faster comparison, but that's a whole different question.

Other possible solution...

var sb = new StringBuilder();

foreach (var x in File.ReadLines("a.txt").Where(x => x.Contains("123456789")))
{
    sb.AppendLine(x);
}

File.WriteAllText("b.txt", sb.ToString()); // That is one heavy operation there...

Testing it with a file size 150MB, and it found all results within 3 seconds. The thing that takes time is writing the results into the 2nd file (in case there are many results).

score 0 · Accepted Answer

不要同时读取和写入。先搜索，保存匹配行列表，最后写入文件。

using System;
using System.Collections.Generic;
using System.IO;
...
List<string> list = new List<string>();
using (StreamReader reader = new StreamReader("input.txt")) {
  string line;
  while ((line = reader.ReadLine()) != null) {
    if (line.Contains(stringToSearch)) {
      list.Add(line); // Add to list.
    }
  }
}
using (StreamWriter writer = new StreamWriter("output.txt")) {
  foreach (string line in list) {
    writer.WriteLine(line);
  }
}

score 0 · Accepted Answer

在进行字符串比较时，您将在阻止来自这些文件的输入的方法中遇到性能问题。

但是 Windows 有一个非常高性能的类似 GREP 的工具，用于对文本文件进行字符串搜索，称为FINDSTR，它可能足够快。您可以简单地将其称为 shell 命令或将命令的结果重定向到您的输出文件。

预处理（排序）或将大文件加载到数据库中会更快，但我假设您已经有需要搜索的现有文件。

score 0 · Accepted Answer

我没有给您示例代码，但是您是否尝试过对文件内容进行排序？

尝试从价值 150MB 的文件中搜索字符串将花费一些时间来分割它，如果正则表达式对您来说花费的时间太长，那么我建议您对文件的内容进行排序，以便您大致了解where"123456789"将在您实际搜索之前发生，这样您就不必搜索不重要的部分。

c# - 在每个 C# 大小为 150 MB 的多个文本文件中搜索字符串

7 回答 7

Related

Reference