我正在尝试对大于内存的数组数据执行多元线性回归。我想知道如何dask_ml
在多维 dask 数组上迭代线性回归函数。
在足够小的数据上,我可以使用sklearn.linear_model.LinearRegression
or sklearn.linear_model.Ridge
(with alpha=0.0
),因为这些函数可以采用多维y
, with shape (n_samples, n_targets)
。该问题可以看作是执行线性回归n_targets
时间。
具体来说,我正在考虑使用dask_ml.linear_model.LinearRegression
(但我对替代方案的建议持开放态度)。然而,这个函数只需要一维y
。我可以考虑使用 for 循环,但这似乎是一种非常缓慢且低效的方法。有什么更好的方法来做到这一点?
作为一个额外的问题:我观察到的输出.coef
是dask_ml.linear_model.LinearRegression
一个 numpy 数组,这意味着它被急切地执行。它没有作为可计算的 dask 数组返回是否有原因?
import dask.array as da
n_samples = 1024
n_features = 20
n_targets = 50 # this number is much larger in real life, around 1e6 to 1e8
# generate some random data
X = da.random.random((n_samples, n_features))
y = da.random.random((n_samples, n_targets))
# "regular" non-dask way of doing it, will result in MemoryError for large data
from sklearn.linear_model import LinearRegression
LR1 = LinearRegression()
LR1.fit(X, y)
LR1.coef_ # intended result, with shape (n_targets, n_features)
# very slow attempt at a dask version, but A) for loop is slow, B) coef output from function is numpy array
from dask_ml.linear_model import LinearRegression
LR2 = LinearRegression(C=999999) # seting regularizer 1/C to zero
coef_ = []
for i in range(n_targets):
c = LR2.fit(X, y[:,i]).coef_
coef_.append(c)
coef_ = da.asarray(coef_)