c# - 用 C# 计算某些单词在文本中出现的次数

Question

我很接近，但我的程序仍然无法正常工作。我试图计算一组单词在文本文件中出现的次数，列出这些单词及其单个计数，然后给出所有找到的匹配单词的总和。

如果有 3 个“lorem”实例，2 个“ipsum”实例，那么总数应该是 5。我的示例文本文件只是在文本文件中重复几次的“Lorem ipsum”段落。

我的问题是到目前为止我的这段代码只计算每个单词的第一次出现，即使每个单词在整个文本文件中重复多次。

我正在使用一个名为“GroupDocs.Parser”的“付费”解析器，它是通过 NuGet 包管理器添加的。如果可能的话，我宁愿不使用付费版本。

在 C# 中有没有更简单的方法来做到这一点？

这是我想要的结果的屏幕截图。

这是我到目前为止的完整代码。

using GroupDocs.Parser;
using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;


namespace ConsoleApp5

{
    class Program
    {
        static void Main(string[] args)
        {

            using (Parser parser = new Parser(@"E:\testdata\loremIpsum.txt"))
            {

                // Extract a text into the reader
                using (TextReader reader = parser.GetText())

                   

                {
                    // Define the search terms. 
                    string[] wordsToMatch = { "Lorem", "ipsum", "amet" };

                    Dictionary<string, int> stats = new Dictionary<string, int>();
                    string text = reader.ReadToEnd();
                    char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
                    // split words
                    string[] words = text.Split(chars);
                    int minWordLength = 2;// to count words having more than 2 characters

                    // iterate over the word collection to count occurrences
                    foreach (string word in wordsToMatch)
                    {
                        string w = word.Trim().ToLower();
                        if (w.Length > minWordLength)
                        {
                            if (!stats.ContainsKey(w))
                            {
                                // add new word to collection
                                stats.Add(w, 1);
                            }
                            else
                            {
                                // update word occurrence count
                                stats[w] += 1;
                            }
                        }
                    }

                    // order the collection by word count
                    var orderedStats = stats.OrderByDescending(x => x.Value);


                    // print occurrence of each word
                    foreach (var pair in orderedStats)
                    {
                        Console.WriteLine("Total occurrences of {0}: {1}", pair.Key, pair.Value);

                    }
                    // print total word count
                    Console.WriteLine("Total word count: {0}", stats.Count);
                    Console.ReadKey();
                }
            }
        }
    }
}

关于我做错了什么有什么建议吗？

提前致谢。

score 1 · Accepted Answer

拆分文本文件的全部内容以获得单词的字符串数组并不是一个好主意，因为这样做会在内存中为每个单词创建一个新的字符串对象。您可以想象处理大文件时的成本。

另一种方法是：

使用Parallel.ForEach方法从文本文件中并行读取行。
使用线程安全的ConcurrentDictionary<TKey,TValue>集合以供并行线程访问。
通过Regex.Matches方法的计数增加每个单词（键）的值。

using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.Threading.Tasks;
using System.Text.RegularExpressions;

static void Main(string[] args)
{
    var file = @"loremIpsum.txt";            
    var obj = new object();
    var wordsToMatch = new ConcurrentDictionary<string, int>();

    wordsToMatch.TryAdd("Lorem", 0);
    wordsToMatch.TryAdd("ipsum", 0);
    wordsToMatch.TryAdd("amet", 0);

    Console.WriteLine("Press a key to continue...");
    Console.ReadKey();

    Parallel.ForEach(File.ReadLines(file),
        (line) =>
        {
            foreach (var word in wordsToMatch.Keys)
                lock (obj)
                    wordsToMatch[word] += Regex.Matches(line, word, 
                        RegexOptions.IgnoreCase).Count;
        });

    foreach (var kv in wordsToMatch.OrderByDescending(x => x.Value))
        Console.WriteLine($"Total occurrences of {kv.Key}: {kv.Value}");

    Console.WriteLine($"Total word count: {wordsToMatch.Values.Sum()}");
    Console.ReadKey();
}

score 0 · Accepted Answer

您可以将此代码替换为使用不区分大小写的分组的 LINQ 查询。例如：

char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
var text=File.ReadAllText(somePath);
var query=text.Split(chars)
              .GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
              .Select(g=>new {word=g.Key,count=g.Count())
              .Where(stat=>stat.count>2)
              .OrderByDescending(stat=>stat.count);

此时，您可以使用、或遍历查询或将结果复制到数组或字典ToArray()中。ToList()ToDictionary()

这不是最有效的代码——一方面，整个文件被加载到内存中。可以用来File.ReadLines逐行加载和迭代。LINQ 也可用于遍历这些行：

var lines=File.ReadLines(somePath);
var query=lines.SelectMany(line=>line.Split(chars))
              .GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
              .Select(g=>new {word=g.Key,count=g.Count())
              .Where(stat=>stat.count>2)
              .OrderByDescending(stat=>stat.count);

score 0 · Accepted Answer

stats是一本字典，所以stats.Count只会告诉你有多少不同的词。您需要将其中的所有值相加。类似的东西stats.Values.Sum()。

c# - 用 C# 计算某些单词在文本中出现的次数

3 回答 3

Related

Reference