Looking around the web for information on doing maths in Redis and don't actually find much. I'm using the Redis-RB gem in Rails, and caching lists of results:
e = [1738738.0, 2019461.0, 1488842.0, 2272588.0, 1506046.0, 2448701.0, 3554207.0, 1659395.0, ...]
$redis.lpush "analytics:math_test", e
Currently, our lists of numbers max in the thousands to tens of thousands per list per day, with number of lists likely in the thousands per day. (This is not actually that much; however, we're growing, and expect much larger sample sizes very soon.)
For each of these lists, I'd like to be able to run stats. I currently do this in-app
def basic_stats(arr)
return nil if arr.nil? or arr.empty?
min = arr.min.to_f
max = arr.max.to_f
total = arr.inject(:+)
len = arr.length
mean = total.to_f / len # to_f so we don't get an integer result
sorted = arr.sort
median = len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
sum = arr.inject(0){|accum, i| accum +(i-mean)**2 }
variance = sum/(arr.length - 1).to_f
std_dev = Math.sqrt(variance).nan? ? 0 : Math.sqrt(variance)
{min: min, max: max, mean: mean, median: median, std_dev: std_dev, size: len}
end
and, while I could simply store the stats, I will often have to aggregate lists together to run stats on the aggregated list. Thus, it makes sense to store the raw numbers rather than every possible aggregated set. Because of this, I need the math to be fast, and have been exploring ways to do this. The simplest way is just doing it in-app, with 150k items in a list, this isn't actually too terrible:
$redis_analytics.llen "analytics:math_test", 0, -1
=> 156954
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 2.650000 0.060000 2.710000 ( 2.732993)
While I'd rather not push 3 seconds for a single calculation, given that this might be outside of my current use-case by about 10x number of samples, it's not terrible. What if we were working with a sample size of one million or so?
$redis_analytics.llen("analytics:math_test")
=> 1063454
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 21.360000 0.340000 21.700000 ( 21.847734)
Options
- Use the SORT method on the list, then you can instantaneously get min/max/length in Redis. Unfortunately, it seems that you still have to go in-app for things like median, mean, std_dev. Unless we can calculate these in Redis.
- Use Lua scripting to do the calculations. (I haven't learned any Lua yet, so can't say I know what this would look like. If it's likely faster, I'd like to know so I can try it.)
- Some more efficient way to utilize Ruby, which seems a wee bit unlikely since utilizing what seems like a fairly decent stats gem has analogous results
- Use a different database.
Example using StatsSample gem
Using a gem seems to gain me nothing. In Python, I'd probably write a C module, not sure if many ruby stats gems are in C.
require 'statsample'
def basic_stats(stats)
return nil if stats.nil? or stats.empty?
arr = stats.to_scale
{min: arr.min, max: arr.max, mean: arr.mean, median: arr.median, std_dev: arr.sd, size: stats.length}
end
Benchmark.measure do
basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=> 20.860000 0.440000 21.300000 ( 21.436437)
Coda
It's quite possible, of course, that such large stats calculations will simply take a long time and that I should offload them to a queue. However, given that much of this math is actually happening inside Ruby/Rails, rather than in the database, I thought I might have other options.