c# - 大文本文件 1 > 使用 File.ReadLine 的 KeyValuePair 的 GB 频率

Question

一般来说，我是 C# 和面向对象编程的新手。我有一个解析非常大的文本文件的应用程序。

我有两个字典：

Dictionary<string, string> parsingDict //key: original value, value: replacement Dictionary<int, string> Frequency // key: count, value: counted string

我正在寻找每个键的频率。我能够获得所需的输出，即：

System1 已被 MachineA 替换 5 次

System2 已被 MachineB 替换 7 次

System3 已被 MachineC 替换 10 次

System4 已被 MachineD 替换 19 次

以下是我的代码：

String[] arrayofLine = File.ReadAllLines(File);
           foreach (var replacement in parsingDict.Keys)
        {
            for (int i = 0; i < arrayofLine.Length; i++)
            {
                if (arrayofLine[i].Contains(replacement))
                {
                    countr++;

                    Frequency.Add(countr, Convert.ToString(replacement));
                }
            }

        }


        Frequency = Frequency.GroupBy(s => s.Value)
                .Select(g => g.First())
                .ToDictionary(kvp => kvp.Key, kvp => kvp.Value);  //Get only the distinct records.

        foreach (var freq in Frequency)
        {
            sbFreq.AppendLine(string.Format("The text {0} was replaced {2} time(s) with {1} \n",
            freq.Value, parsingDict[freq.Value],
            arrayofLine.Where(x => x.Contains(freq.Value)).Count())); 
        }

使用String[] arrayofLine = File.ReadAllLines(File);会增加内存利用率。

arrayofLine.Where (x => x.Contains(freq.Value)).Count())如何使用 File.ReadLine 来实现，因为它是内存友好的。

score 0 · Accepted Answer

string line = string.Empty;
Dictionary<string, int> found = new Dictionary<int, string>();
using(System.IO.StreamReader file = new System.IO.StreamReader(@"c:\Path\To\BigFile.txt"))
{
   while(!file.EndOfStream)
   {
      line = file.ReadLine();
      // Matches found logic
      if (!found.ContainsKey(line)) found.Add(line, 1);
      else found[line] = found[line] + 1;
    }
}

score 0 · Accepted Answer

您可以很容易地一次阅读一行（ref）。

相关代码如下所示：

Dictionary<string,int> lineCount = new Dictionary<string,int>();
string line;

// Read the file and display it line by line.
using(System.IO.StreamReader file = 
   new System.IO.StreamReader("c:\\test.txt"))
{
   while((line = file.ReadLine()) != null)
   {
      string value = DiscoverFreq(line);
      lineCount[value] += 1;
    }
}

注意：重要的是您还要考虑您正在存储的其他信息。将大文件中的行附加到字符串中与一次读取整个文件基本相同，但垃圾收集更多。

注意 2：我简化了更新计数的部分。您必须检查计数条目是否存在并添加它，或者如果存在则增加它。freq.Values或者，您可以在扫描文件之前将所有设置为 0 的 lineCounts 初始化。

如果唯一词的数量足够多，那么您可能需要使用像 SQLite 这样的小型数据库来为您存储计数。这使您可以快速查询信息，而无需考虑如何存储和读取您自己编写的自定义文件。

c# - 大文本文件 1 > 使用 File.ReadLine 的 KeyValuePair 的 GB 频率

2 回答 2

Related

Reference