2

Looking around the web for information on doing maths in Redis and don't actually find much. I'm using the Redis-RB gem in Rails, and caching lists of results:

e = [1738738.0, 2019461.0, 1488842.0, 2272588.0, 1506046.0, 2448701.0, 3554207.0, 1659395.0, ...]
$redis.lpush "analytics:math_test", e

Currently, our lists of numbers max in the thousands to tens of thousands per list per day, with number of lists likely in the thousands per day. (This is not actually that much; however, we're growing, and expect much larger sample sizes very soon.)

For each of these lists, I'd like to be able to run stats. I currently do this in-app

def basic_stats(arr)
  return nil if arr.nil? or arr.empty?
  min = arr.min.to_f
  max = arr.max.to_f
  total = arr.inject(:+)
  len = arr.length
  mean = total.to_f / len # to_f so we don't get an integer result
  sorted = arr.sort
  median = len % 2 == 1 ? sorted[len/2] : (sorted[len/2 - 1] + sorted[len/2]).to_f / 2
  sum = arr.inject(0){|accum, i| accum +(i-mean)**2 }
  variance = sum/(arr.length - 1).to_f
  std_dev = Math.sqrt(variance).nan? ? 0 : Math.sqrt(variance)

  {min: min, max: max, mean: mean, median: median, std_dev: std_dev, size: len}
end

and, while I could simply store the stats, I will often have to aggregate lists together to run stats on the aggregated list. Thus, it makes sense to store the raw numbers rather than every possible aggregated set. Because of this, I need the math to be fast, and have been exploring ways to do this. The simplest way is just doing it in-app, with 150k items in a list, this isn't actually too terrible:

$redis_analytics.llen "analytics:math_test", 0, -1
=> 156954
Benchmark.measure do
  basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end 
=>   2.650000   0.060000   2.710000 (  2.732993)

While I'd rather not push 3 seconds for a single calculation, given that this might be outside of my current use-case by about 10x number of samples, it's not terrible. What if we were working with a sample size of one million or so?

$redis_analytics.llen("analytics:math_test")
=> 1063454 
Benchmark.measure do
  basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=>  21.360000   0.340000  21.700000 ( 21.847734) 

Options

  1. Use the SORT method on the list, then you can instantaneously get min/max/length in Redis. Unfortunately, it seems that you still have to go in-app for things like median, mean, std_dev. Unless we can calculate these in Redis.
  2. Use Lua scripting to do the calculations. (I haven't learned any Lua yet, so can't say I know what this would look like. If it's likely faster, I'd like to know so I can try it.)
  3. Some more efficient way to utilize Ruby, which seems a wee bit unlikely since utilizing what seems like a fairly decent stats gem has analogous results
  4. Use a different database.

Example using StatsSample gem

Using a gem seems to gain me nothing. In Python, I'd probably write a C module, not sure if many ruby stats gems are in C.

require 'statsample'
def basic_stats(stats)
  return nil if stats.nil? or stats.empty?
  arr = stats.to_scale

  {min: arr.min, max: arr.max, mean: arr.mean, median: arr.median, std_dev: arr.sd, size: stats.length}
end

Benchmark.measure do
  basic_stats $redis_analytics.lrange("analytics:math_test", 0, -1).map(&:to_f)
end
=>  20.860000   0.440000  21.300000 ( 21.436437)

Coda

It's quite possible, of course, that such large stats calculations will simply take a long time and that I should offload them to a queue. However, given that much of this math is actually happening inside Ruby/Rails, rather than in the database, I thought I might have other options.

4

3 回答 3

4

我想保持这个开放,以防有人有任何意见可以帮助其他人做同样的事情。然而,对我来说,我刚刚意识到我花了太多时间试图强迫 Redis 做一些 SQL 做得很好的事情。如果我简单地将其转储到 Postgres 中,我可以直接在数据库中进行非常有效的聚合和数学运算。我想我只是被困在使用 Redis 的一些东西上,当它开始时,它是一个好主意,但扩展到了一些不好的东西。

于 2012-09-11T19:10:28.707 回答
3

如果可以切换到 Redis 2.6,Lua 脚本可能是解决此问题的最佳方法。顺便说一句,测试速度应该非常简单,所以考虑到所需的少量时间投资,我强烈建议尝试 Lua 脚本,看看你得到了什么结果。

您可以做的另一件事是使用 Lua设置数据,并确保它还将更新每个列表的相关哈希类型以直接保留最小/最大/平均统计信息,因此您不必每次都计算这些统计信息,因为它们是增量更新的。顺便说一句,并非总是可行,这取决于您的具体用例。

于 2012-09-24T00:02:07.860 回答
0

我会看看NArray。从他们的主页:

这个扩展库将大型数值数组的快速计算和轻松操作整合到 Ruby 语言中。

看起来他们的数组类具有您需要内置的大部分功能。该页面上的 Cmd-F“统计”。

于 2012-09-11T19:12:05.793 回答