c# - Count number of times a token appears in a document

Question

I have extracted tokens from the document and inserted them in an array. Now what i want is determining how many times a token appears in the document.

The present algorithm that i have in my mind gets each token and scans the whole array of tokens to find similar tokens. This is as you have guessed not very scalable.

I was looking for some other good algorithm to help me solve this problem.

I have some similar questions on SO but they all assume that the token is provided at compile time.

score 4 · Accepted Answer

假设为了简化示例，“令牌”是整数。使用将标记划分为等价类group by，然后计算每个组的大小。

var tokens = new[] { 10, 20, 30, 10, 30, 20, 20, 20, 10 };
var grouped = from token in tokens group token by token;
foreach (var grp in grouped)
     Console.WriteLine("{0} {1}", grp.Key, grp.Count());

输出是：

10 3
20 4
30 2

score 3 · Accepted Answer

使用 ID 为 String（即令牌）和 Integer（即计数）的 Map/Hashmap。

这是您需要的逻辑。

对于每个令牌：

如果令牌存在，则通过令牌ID获取对象来增加计数，
如果令牌不存在则替换旧令牌，将令牌添加到映射并将整数值设置为1。

score 1 · Accepted Answer

不确定是否完全理解这个问题，但这就是您可以对值（令牌）进行分组然后计算它们出现的次数的方式。

List<string> tokens = new List<string> { "A", "B", "A", "A", "B", "C"};
var tokensCount = tokens.GroupBy(g => g).Select(g => new KeyValuePair<string, int>(g.Key, g.Count()));
// Returns A 3, B 2, C 1

score 1 · Accepted Answer

这个答案适用于 Java

您可以使用 a HashMap<String,Integer>（或者SortedMap<String,Integer>，如果您希望按字母顺序排列结果），其中keys 是标记，而 thevalue是计数。对于列表中的每个元素，您需要检查它是否已经存在于地图中。如果不是，则使用 value 创建一个新键1。如果它已经存在，您只需将value(count) 增加 1。

HashMap<String,Integer> counts= new HashMap<String,Integer>() ;
for(String e: myTokenList ) {
    if( counts.get(e) == null )
        counts.put(e,1);
    else
        counts.put(e,counts.get(e)+1);
}

有一个可能的微优化：

HashMap<String,Integer> counts= new HashMap<String,Integer>() ;
for(String e: myTokenList ) {
    Integer c= counts.get(e) ;
    if( c == null )
        counts.put(e,1);
    else
        counts.put(e,c+1);
}

score 0 · Accepted Answer

好的，根据其他一些建议，不要将文档中的单词插入数组中（除非您有充分的理由，但您的问题中尚未突出显示）。

相反，将其插入到地图/字典中，例如在下面的示例中（注意可以更有效地完成，但这显示了明确执行的每个步骤）。

var wordCounts = new Dictionary<string, int>();
var wordSeparators = new char[] {',', ' ', '\t', ';' /* etc */ };
using (var reader = File.OpenText("allmaywords.txt")
{
    while (!reader.EndOfStream)
    {
        var words = reader
            .ReadLine() 
            .Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries)
            .Select(f => f.Trim()).ToList();
        foreach (var word in words)
        {
            if (!wordCounts.ContainsKey(word))
                wordCounts[word] = 1;
            else
                wordCounts[word] = wordCounts[word] + 1;
        } 
    }    
}

现在，您还可以通过以下方式访问所有唯一单词（或标记）：

var uniqueTokens = wordCounts.Keys;

您可以查看是否存在令牌：

var gotAFoo = wordCounts.ContainsKey("Foo");

以及它出现的频率：

var numbeOfFoosGiven = wordCounts["Foo"];

c# - Count number of times a token appears in a document

5 回答 5

这个答案适用于 Java

Related

Reference