algorithm - 实时数据捕获的百分比

Question

我正在寻找一种确定实时数据捕获百分位数的算法。

例如，考虑一个服务器应用程序的开发。

服务器的响应时间可能如下：17 ms 33 ms 52 ms 60 ms 55 ms 等。

报告第 90 个百分位响应时间、第 80 个百分位响应时间等很有用。

天真的算法是将每个响应时间插入到一个列表中。当请求统计时，对列表进行排序并在适当的位置获取值。

内存使用量与请求数量呈线性关系。

在有限的内存使用情况下，是否有一种算法可以产生“近似”百分位数统计信息？例如，假设我想以一种处理数百万个请求的方式解决这个问题，但只想使用一千字节的内存进行百分位跟踪（放弃对旧请求的跟踪不是一个选项，因为百分位应该是适用于所有请求）。

还要求没有分布的先验知识。例如，我不想提前指定任何桶范围。

score 34 · Accepted Answer

如果您想在获得越来越多的数据时保持内存使用量不变，那么您将不得不以某种方式重新采样该数据。这意味着您必须应用某种重组方案。您可以等到获得一定数量的原始输入后再开始重新分箱，但您不能完全避免它。

所以你的问题真的是问“动态分箱我的数据的最佳方式是什么”？有很多方法，但是如果您想最小化您对可能收到的值的范围或分布的假设，那么一种简单的方法是对固定大小的桶进行平均k，具有对数分布的宽度。例如，假设您想随时在内存中保存 1000 个值。为k选择一个大小，比如 100。选择你的最小分辨率，比如 1ms。然后

第一个桶处理 0-1ms (width=1ms) 之间的值
第二桶：1-3ms（w=2ms）
第三桶：3-7ms（w=4ms）
第四个桶：7-15ms（w=8ms）
...
第十桶：511-1023ms（w=512ms）

This type of log-scaled approach is similar to the chunking systems used in hash table algorithms, used by some filesystems and memory allocation algorithms. It works well when your data has a large dynamic range.

As new values come in, you can choose how you want to resample, depending on your requirements. For example, you could track a moving average, use a first-in-first-out, or some other more sophisticated method. See the Kademlia algorithm for one approach (used by Bittorrent).

Ultimately, rebinning must lose you some information. Your choices regarding the binning will determine the specifics of what information is lost. Another way of saying this is that the constant size memory store implies a trade-off between dynamic range and the sampling fidelity; how you make that trade-off is up to you, but like any sampling problem, there's no getting around this basic fact.

If you're really interested in the pros and cons, then no answer on this forum can hope to be sufficient. You should look into sampling theory. There's a huge amount of research on this topic available.

For what it's worth, I suspect that your server times will have a relatively small dynamic range, so a more relaxed scaling to allow higher sampling of common values may provide more accurate results.

Edit: To answer your comment, here's an example of a simple binning algorithm.

You store 1000 values, in 10 bins. Each bin therefore holds 100 values. Assume each bin is implemented as a dynamic array (a 'list', in Perl or Python terms).
When a new value comes in:
- Determine which bin it should be stored in, based on the bin limits you've chosen.
- If the bin is not full, append the value to the bin list.
- If the bin is full, remove the value at the top of the bin list, and append the new value to the bottom of the bin list. This means old values are thrown away over time.
To find the 90th percentile, sort bin 10. The 90th percentile is the first value in the sorted list (element 900/1000).

If you don't like throwing away old values, then you can implement some alternative scheme to use instead. For example, when a bin becomes full (reaches 100 values, in my example), you could take the average of the oldest 50 elements (i.e. the first 50 in the list), discard those elements, and then append the new average element to the bin, leaving you with a bin of 51 elements that now has space to hold 49 new values. This is a simple example of rebinning.

Another example of rebinning is downsampling; throwing away every 5th value in a sorted list, for example.

I hope this concrete example helps. The key point to take away is that there are lots of ways of achieving a constant memory aging algorithm; only you can decide what is satisfactory given your requirements.

score 20 · Accepted Answer

I've once published a blog post on this topic. The blog is now defunct but the article is included in full below.

The basic idea is to reduce the requirement for an exact calculation in favor of "95% percent of responses take 500ms-600ms or less" (for all exact percentiles of 500ms-600ms).

As we’ve recently started feeling that response times of one of our webapps got worse, we decided to spend some time tweaking the apps’ performance. As a first step, we wanted to get a thorough understanding of current response times. For performance evaluations, using minimum, maximum or average response times is a bad idea: “The ‘average’ is the evil of performance optimization and often as helpful as ‘average patient temperature in the hospital'” (MySQL Performance Blog). Instead, performance tuners should be looking at the percentile: “A percentile is the value of a variable below which a certain percent of observations fall” (Wikipedia). In other words: the 95th percentile is the time in which 95% of requests finished. Therefore, a performance goals related to the percentile could be similar to “The 95th percentile should be lower than 800 ms”. Setting such performance goals is one thing, but efficiently tracking them for a live system is another one.

I’ve spent quite some time looking for existing implementations of percentile calculations (e.g. here or here). All of them required storing response times for each and every request and calculate the percentile on demand or adding new response times in order. This was not what I wanted. I was hoping for a solution that would allow memory and CPU efficient live statistics for hundreds of thousands of requests. Storing response times for hundreds of thousands of requests and calculating the percentile on demand does neither sound CPU nor memory efficient.

Such a solution as I was hoping for simply seems not to exist. On second thought, I came up with another idea: For the type of performance evaluation I was looking for, it’s not necessary to get the exact percentile. An approximate answer like “the 95th percentile is between 850ms and 900ms” would totally suffice. Lowering the requirements this way makes an implementation extremely easy, especially if upper and lower borders for the possible results are known. For example, I’m not interested in response times higher than several seconds – they are extremely bad anyway, regardless of being 10 seconds or 15 seconds.

So here is the idea behind the implementation:

Define any random number of response time buckets (e.g. 0-100ms, 100-200ms, 200-400ms, 400-800ms, 800-1200ms, …)
Count number of responses and number of response each bucket (For a response time of 360ms, increment the counter for the 200ms – 400ms bucket)
Estimate the n-th percentile by summing counter for buckets until the sum exceeds n percent of the total

It’s that simple. And here is the code.

Some highlights:

public void increment(final int millis) {
    final int i = index(millis);
    if (i < _limits.length) {
        _counts[i]++;
    }
    _total++;
}
 
public int estimatePercentile(final double percentile) {
    if (percentile < 0.0 || percentile > 100.0) {
        throw new IllegalArgumentException("percentile must be between 0.0 and 100.0, was " + percentile);
    }
 
    for (final Percentile p : this) {
        if (percentile - p.getPercentage() <= 0.0001) {
            return p.getLimit();
        }
    }
    return Integer.MAX_VALUE;
}

This approach only requires two int values (= 8 byte) per bucket, allowing to track 128 buckets with 1K of memory. More than sufficient for analysing response times of a web application using a granularity of 50ms). Additionally, for the sake of performance, I’ve intentionally implemented this without any synchronization(e.g. using AtomicIntegers), knowing that some increments might get lost.

By the way, using Google Charts and 60 percentile counters, I was able to create a nice graph out of one hour of collected response times:

score 18 · Accepted Answer

(It's been quite some time since this question was asked, but I'd like to point out a few related research papers)

There has been a significant amount of research on approximate percentiles of data streams in the past few years. A few interesting papers with full algorithm definitions:

All of these papers propose algorithms with sub-linear space complexity for the computation of approximate percentiles over a data stream.

score 17 · Accepted Answer

I believe there are many good approximate algorithms for this problem. A good first-cut approach is to simply use a fixed-size array (say 1K worth of data). Fix some probability p. For each request, with probability p, write its response time into the array (replacing the oldest time in there). Since the array is a subsampling of the live stream and since subsampling preserves the distribution, doing the statistics on that array will give you an approximation of the statistics of the full, live stream.

This approach has several advantages: it requires no a-priori information, and it's easy to code. You can build it quickly and experimentally determine, for your particular server, at what point growing the buffer has only a negligible effect on the answer. That is the point where the approximation is sufficiently precise.

If you find that you need too much memory to give you statistics that are precise enough, then you'll have to dig further. Good keywords are: "stream computing", "stream statistics", and of course "percentiles". You can also try "ire and curses"'s approach.

score 5 · Accepted Answer

Try the simple algorithm defined in the paper “Sequential Procedure for Simultaneous Estimation of Several Percentiles” (Raatikainen). It’s fast, requires 2*m+3 markers (for m percentiles) and tends to an accurate approximation quickly.

score 3 · Accepted Answer

Use a dynamic array T[] of large integers or something where T[n] counts the numer of times the response time was n milliseconds. If you really are doing statistics on a server application then possibly 250 ms response times are your absolute limit anyway. So your 1 KB holds one 32 bits integer for every ms between 0 and 250, and you have some room to spare for an overflow bin. If you want something with more bins, go with 8 bit numbers for 1000 bins, and the moment a counter would overflow (i.e. 256th request at that response time) you shift the bits in all bins down by 1. (effectively halving the value in all bins). This means you disregard all bins that capture less than 1/127th of the delays that the most visited bin catches.

If you really, really need a set of specific bins I'd suggest using the first day of requests to come up with a reasonable fixed set of bins. Anything dynamic would be quite dangerous in a live, performance sensitive application. If you choose that path you'd better know what your doing, or one day you're going to get called out of bed to explain why your statistics tracker is suddenly eating 90% CPU and 75% memory on the production server.

As for additional statistics: For mean and variance there are some nice recursive algorithms that take up very little memory. These two statistics can be usefull enough in themselves for a lot of distributions because the central limit theorem states that distributions that that arise from a sufficiently large number of independent variables approach the normal distribution (which is fully defined by mean and variance) you can use one of the normality tests on the last N (where N sufficiently large but constrained by your memory requirements) to monitor wether the assumption of normality still holds.

score 1 · Accepted Answer

@thkala started off with some literature citations. Let me extend that.

Implementations

T-digest from the 2019 Dunning paper has reference implementation in Java, and ports on that page to Python, Go, Javascript, C++, Scala, C, Clojure, C#, Kotlin, and C++ port by facebook, and a further rust port of that C++ port
Spark implements "GK01" from the 2001 Greenwald/Khanna paper for approximate quantiles
Beam: org.apache.beam.sdk.transforms.ApproximateQuantiles has approximate quantiles
Java: Guava:com.google.common.math.Quantiles implements exact quantiles, thus taking more memory
Rust: quantiles crate has implementations for the 2001 GK algorithm "GK01", and the 2005 CKMS algorithm. (caution: I found the CKMS implementation slow - issue)
C++: boost quantiles has some code, but I didn't understand it.
I did some profiling of the options in Rust [link] for up to 100M items, and found GK01 the best, T-digest the second, and "keep 1% top values in priority queue" the third.

Literature

2001: Space-efficient online computation of quantile summaries (by Greenwald, Khanna). Implemented in Rust: quantiles::greenwald_khanna.
2004: Medians and beyond: new aggregation techniques for sensor networks (by Shrivastava, Buragohain, Agrawal, Suri). Introduces "q-digests", used for fixed-universe data.
2005: Effective computation of biased quantiles over data streams (by Cormode, Korn, Muthukrishnan, Srivastava)... Implemented in Rust: quantiles::ckms which notes that the IEEE presentation is correct but the self-published one has flaws. With carefully crafted data, space can grow linearly with input size. "Biased" means it focuses on P90/P95/P99 rather than all the percentiles).
2006: Space-and time-efficient deterministic algorithms for biased quantiles over data streams (by Cormode, Korn, Muthukrishnan, Srivastava)... improved space bound over 2005 paper
2007: A fast algorithm for approximate quantiles in high speed data streams (by Zhang, Wang). Claims 60-300x speedup over GK. The 2020 literature review below says this has state-of-the-art space upper bound.
2019 Computing extremely accurate quantiles using t-digests (by Dunning, Ertl). Introduces t-digests, O(log n) space, O(1) updates, O(1) final calculation. It's neat feature is you can build partial digests (e.g. one per day) and merge them into months, then merge months into years. This is what the big query engines use.
2020 A survey of approximate quantile computation on large-scale data (technical report) (by Chen, Zhang).
2021 The t-digest: Efficient estimates of distributions - an approachable wrap-up paper about t-digests.

Cheap hack for P99 of <10M values: just store top 1% in a priority queue!

This'll sound stupid, but if I want to calculate the P99 of 10M float64s, I just created a priority queue with 100k float32s (takes 400k). This takes only 4x as much space as "GK01" and is much faster. For 5M or fewer items, it takes less space than GK01!!

struct TopValues {
    values: std::collections::BinaryHeap<std::cmp::Reverse<ordered_float::NotNan<f32>>>,
}

impl TopValues {
fn new(count: usize) -> Self {
    let capacity = std::cmp::max(count / 100, 1);
    let values = std::collections::BinaryHeap::with_capacity(capacity);
    TopValues { values }
}

fn render(&mut self) -> String {
    let p99 = self.values.peek().unwrap().0;
    let max = self.values.drain().min().unwrap().0;
    format!("TopValues, p99={:.4}, max={:.4}", p99, max)
}
fn insert(&mut self, value: f64) {
    let value = value as f32;
    let value = std::cmp::Reverse(unsafe { ordered_float::NotNan::new_unchecked(value) });

    if self.values.len() < self.values.capacity() {
        self.values.push(value);
    } else if self.values.peek().unwrap().0 < value.0 {
        self.values.pop();
        self.values.push(value);
    } else {
    }
}
}

score 0 · Accepted Answer

You can try the following structure:

Take on input n, ie. n = 100.

We'll keep an array of ranges [min, max] sorted by min with count.

Insertion of value x – binary search for min range for x. If not found take preceeding range (where min < x). If value belongs to range (x <= max) increment count. Otherwise insert new range with [min = x, max = x, count = 1].

If number of ranges hits 2*n – collapse/merge array into n (half) by taking min from odd and max from even entries, summing their count.

To get ie. p95 walk from the end summing the count until next addition would hit threshold sum >= 95%, take p95 = min + (max - min) * partial.

It will settle on dynamic ranges of measurements. n can be modified to trade accuracy for memory (to lesser extent cpu). If you make values more discrete, ie. by rounding to 0.01 before insertion – it'll stabilise on ranges sooner.

You could improve accuracy by not assuming that each range holds uniformly distributed entries, ie. something cheap like sum of values which will give you avg = sum / count, it would help to read closer p95 value from range where it sits.

You can also rotate them, ie. after m = 1 000 000 entries start filling new array and take p95 as weighted sum on count in array (if array B has 10% of count of A, then it contributes 10% to p95 value).

algorithm - 实时数据捕获的百分比

8 回答 8

Implementations

Literature

Cheap hack for P99 of <10M values: just store top 1% in a priority queue!

Related

Reference