mysql - 数据集太大而无法加载到内存中进行处理

Question

我有一个更大的快速增长的数据集，大约有 400 万行，为了定义和排除异常值（用于统计/分析用途），我需要算法来考虑该数据集中的所有条目。然而，这是太多的数据加载到内存和我的系统阻塞。我目前正在使用它来收集和处理数据：

@scoreInnerFences = innerFence Post.where( :source => 1 ).
                                    order( :score ).
                                    pluck( :score )

使用典型的分治法是行不通的，我不认为因为必须考虑每个条目以保持我的异常值计算准确。如何有效地实现这一目标？

innerFence识别数据集的下四分位数和上四分位数，然后使用这些发现来计算异常值。这是（尚未重构，非 DRY）代码：

def q1(s)
  q = s.length / 4

  if s.length % 2 == 0
    return ( s[ q ] + s[ q - 1 ] ) / 2
  else
    return s[ q ]
  end
end

def q2(s)
  q = s.length / 4

  if s.length % 2 == 0
    return ( s[ q * 3 ] + s[ (q * 3) - 1 ] ) / 2
  else
    return s[ q * 3 ]
  end
end

def innerFence(s)
  q1 = q1(s)
  q2 = q2(s)

  iq = (q2 - q1) * 3

  if1 = q1 - iq
  if2 = q2 + iq

  return [if1, if2]
end

score 1 · Accepted Answer

这不是最好的方法，但它是一种简单的方法：

做几个查询。首先你计算分数的数量：

q = Post.where(:source => 1).count

然后你做你的计算然后你获取分数

q1 = Post.where(:source => 1)。reverse_order（：分数）。选择（“平均（分数）作为分数”）。偏移量(q).limit((q%2)+1)

q2 = Post.where(:source => 1)。reverse_order（：分数）。选择（“平均（分数）作为分数”）。偏移量(q*3).limit((q%2)+1)

代码可能是错误的，但我相信你明白了。

score 0 · Accepted Answer

对于大型数据集，我有时会下拉到 ActiveRecord 下方。即使我想，它也是一个记忆猪，使用 pluck。当然它的便携性较差，但有时它是值得的。

score = Post.connection.execute('从 score > 1 order by score 的帖子中选择分数').map(&:first)

不知道这是否对 400 万条记录有足够的帮助。如果没有，也许看看存储过程？

mysql - 数据集太大而无法加载到内存中进行处理

2 回答 2

Related

Reference