我有一堆 csv 数据集,每个大小约为 10Gb。我想从他们的列中生成直方图。但似乎在 numpy 中执行此操作的唯一方法是首先将整个列加载到 numpy 数组中,然后调用numpy.histogram
该数组。这会消耗不必要的内存量。
numpy 是否支持在线分箱?我希望在读取它们时逐行迭代我的 csv 和 bin 值。这种方式在任何时候最多有一行在内存中。
我自己滚动并不难,但想知道是否有人已经发明了这个轮子。
正如你所说,自己动手并不难。您需要自己设置 bin 并在迭代文件时重复使用它们。以下应该是一个不错的起点:
import numpy as np
datamin = -5
datamax = 5
numbins = 20
mybins = np.linspace(datamin, datamax, numbins)
myhist = np.zeros(numbins-1, dtype='int32')
for i in range(100):
d = np.random.randn(1000,1)
htemp, jnk = np.histogram(d, mybins)
myhist += htemp
我猜这样大文件的性能将是一个问题,并且在每一行上调用 histogram 的开销可能太慢了。 @doug对生成器的建议似乎是解决该问题的好方法。
这是一种直接对您的值进行分类的方法:
import numpy as NP
column_of_values = NP.random.randint(10, 99, 10)
# set the bin values:
bins = NP.array([0.0, 20.0, 50.0, 75.0])
binned_values = NP.digitize(column_of_values, bins)
'binned_values' 是一个索引数组,包含 column_of_values 中每个值所属的 bin 的索引。
'bincount' 会给你(显然)bin 计数:
NP.bincount(binned_values)
鉴于您的数据集的大小,使用 Numpy 的 'loadtxt' 来构建生成器,可能会很有用:
data_array = NP.loadtxt(data_file.txt, delimiter=",")
def fnx() :
for i in range(0, data_array.shape[1]) :
yield dx[:,i]
Binning with a Fenwick Tree (very large dataset; percentile boundaries needed)
I'm posting a second answer to the same question since this approach is very different, and addresses different issues.
What if you have a VERY large dataset (billions of samples), and you don't know ahead of time WHERE your bin boundaries should be? For example, maybe you want to bin things up in to quartiles or deciles.
For small datasets, the answer is easy: load the data in to an array, then sort, then read off the values at any given percentile by jumping to the index that percentage of the way through the array.
For large datasets where the memory size to hold the array is not practical (not to mention the time to sort)... then consider using a Fenwick Tree, aka a "Binary Indexed Tree".
I think these only work for positive integer data, so you'll at least need to know enough about your dataset to shift (and possibly scale) your data before you tabulate it in the Fenwick Tree.
I've used this to find the median of a 100 billion sample dataset, in reasonable time and very comfortable memory limits. (Consider using generators to open and read the files, as per my other answer; that's still useful.)
More on Fenwick Trees:
使用生成器进行分箱 (大型数据集;固定宽度的箱;浮动数据)
如果您提前知道所需垃圾箱的宽度——即使有数百或数千个桶——那么我认为滚动您自己的解决方案会很快(无论是编写还是运行)。下面是一些 Python,它假设您有一个迭代器,可以为您提供文件中的下一个值:
from math import floor
binwidth = 20
counts = dict()
filename = "mydata.csv"
for val in next_value_from_file(filename):
binname = int(floor(val/binwidth)*binwidth)
if binname not in counts:
counts[binname] = 0
counts[binname] += 1
print counts
这些值可以是浮点数,但这是假设您使用整数 binwidth;如果你想使用一些浮点值的 binwidth,你可能需要稍微调整一下。
至于next_value_from_file()
,如前所述,您可能希望编写一个自定义生成器或使用iter () 方法的对象来有效地做到这一点。这种生成器的伪代码是这样的:
def next_value_from_file(filename):
f = open(filename)
for line in f:
# parse out from the line the value or values you need
val = parse_the_value_from_the_line(line)
yield val
如果给定的行有多个值,则parse_the_value_from_the_line()
要么返回一个列表,要么本身就是一个生成器,并使用以下伪代码:
def next_value_from_file(filename):
f = open(filename)
for line in f:
for val in parse_the_values_from_the_line(line):
yield val