c# - C#中整数数据的简单直方图生成

Question

作为我正在构建的测试台的一部分，我正在寻找一个简单的类来计算整数值的直方图（算法解决问题的迭代次数）。答案应该是这样的：

Histogram my_hist = new Histogram();

for( uint i = 0; i < NUMBER_OF_RESULTS; i++ )
{

    myHist.AddValue( some_result );
}

for( uint j = 0; j < myHist.NumOfBins; j++ )
{
     Console.WriteLine( "{0} occurred {1} times", myHist.BinValues[j], myHist.BinCounts[j] );
}

我很惊讶谷歌搜索并没有找到一个很好的解决方案，但也许我没有找到正确的东西。那里有通用解决方案还是值得我自己推出？

score 19 · Accepted Answer

你可以使用 SortedDictionary

uint[] items = new uint[] {5, 6, 1, 2, 3, 1, 5, 2}; // sample data
SortedDictionary<uint, int> histogram = new SortedDictionary<uint, int>();
foreach (uint item in items) {
    if (histogram.ContainsKey(item)) {
        histogram[item]++;
    } else {
        histogram[item] = 1;
    }
}
foreach (KeyValuePair<uint, int> pair in histogram) {
    Console.WriteLine("{0} occurred {1} times", pair.Key, pair.Value);
}

不过，这会遗漏空垃圾箱

score 6 · Accepted Answer

根据 BastardSaint 的建议，我想出了一个简洁且相当通用的包装器：

public class Histogram<TVal> : SortedDictionary<TVal, uint>
{
    public void IncrementCount(TVal binToIncrement)
    {
        if (ContainsKey(binToIncrement))
        {
            this[binToIncrement]++;
        }
        else
        {
            Add(binToIncrement, 1);
        }
    }
}

所以现在我可以这样做：

const uint numOfInputDataPoints = 5;
Histogram<uint> hist = new Histogram<uint>();

// Fill the histogram with data
for (uint i = 0; i < numOfInputDataPoints; i++)
{
    // Grab a result from my algorithm
    uint numOfIterationsForSolution = MyAlorithm.Run();

    // Add the number to the histogram
    hist.IncrementCount( numOfIterationsForSolution );
}

// Report the results
foreach (KeyValuePair<uint, uint> histEntry in hist.AsEnumerable())
{
    Console.WriteLine("{0} occurred {1} times", histEntry.Key, histEntry.Value);
}

我花了一段时间才弄清楚如何使它通用（首先我只是覆盖了SortedDictionary构造函数，这意味着你只能将它用于uint键）。

score 5 · Accepted Answer

您可以使用 Linq：

var items = new[] {5, 6, 1, 2, 3, 1, 5, 2};
items
    .GroupBy(i => i)
    .Select(g => new {
        Item = g.Key,
        Count = g.Count()
    })
    .OrderBy(g => g.Item)
    .ToList()
    .ForEach(g => {
        Console.WriteLine("{0} occurred {1} times", g.Item, g.Count);
    });

score 0 · Accepted Answer

我实现一个简单的扩展方法来创建直方图：

public static IReadOnlyDictionary<T, int> ToHistogram<T>(this IEnumerable<T> enumerable)
   => enumerable.GroupBy(item => item).ToDictionary(grouping => grouping.Key, grouping => grouping.Count());

score 0 · Accepted Answer

这建立在公认的答案之上。问题是SortedDictionary迭代构建一个很慢，因为插入和检索都花费O(log(N))。

如果您不需要在累积时显示直方图，则可以避免这种情况。

我的修改使用了一个普通的Dictionary，并且只在最后将它排序为一个SortedList.

对于 1000 万个项目的样本大小，这个版本大约快 11 倍（在我的机器上），代价是内存使用量稍高，直到 GC 启动（约 10% 额外内存）。

//generate a random sample
Random r = new Random();
var items = Enumerable
    .Range(1, 10_000_000)
    .Select( _ => (uint)r.Next(100_000))
    .ToList();

//build the histogram using a normal dictionary with O(1) lookups and insertions.
var tempHistogram = new Dictionary<uint, int>();
foreach (uint item in items)
{
    if (tempHistogram.ContainsKey(item))
    {
        tempHistogram[item]++;
    }
    else
    {
        tempHistogram[item] = 1;
    }
}

//Sort it once. SortedList conveniently has a ctor that takes a dictionary.
var sortedHistogram = new SortedList<uint, int>(tempHistogram);

foreach (KeyValuePair<uint, int> pair in sortedHistogram.Take(100))
{
    Console.WriteLine("{0} occurred {1} times", pair.Key, pair.Value);
}

对于非常大的样本（大于可用内存），有惊人的概率算法可以解决这个问题。
它们也非常适合流式传输数据。寻找“分位数草图”。这是 Apache 基金会的一个实现：https ://datasketches.apache.org/

c# - C#中整数数据的简单直方图生成

5 回答 5

Related

Reference