0

我正在将 csv 读入 Dask Dataframe,然后从 dask_ml 库中调用 SimpleImputer。我面临两种不同的问题。

问题 1)Dask 上的 Simple Imputer 因 FileNotFound 而失败,而实际上我能够读取这些列。代码:

 import dask.dataframe as dd
 df = dd.read_csv('outlier.csv')
 X = df.drop('Column_A', axis=1)
 print(X.columns)  # Print statement works. It gives me all the rest of the columns
 p = SimpleImputer().fit_transform(X)

输出:

Error
Traceback (most recent call last):
 File "C:\Users\user\Documents\code\blah.py", line 127, in train_blahblah_model
    p = SimpleImputer().fit_transform(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\sklearn\base.py", line 699, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py", line 53, in fit
    self._fit_frame(X)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask_ml\impute.py", line 80, in _fit_frame
    self.statistics_ = pd.Series(dask.compute(avg)[0], index=X.columns)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\dask\base.py", line 561, in compute
    results = schedule(dsk, keys, **kwargs)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 2681, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 1990, in gather
    return self.sync(
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 836, in sync
    return sync(
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py", line 340, in sync
    raise exc.with_traceback(tb)
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\utils.py", line 324, in f
    result[0] = yield future
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\tornado\gen.py", line 762, in run
    value = future.result()
  File "C:\Users\user\Documents\code\condavirtualenv\lib\site-packages\distributed\client.py", line 1855, in _gather
    raise exception.with_traceback(traceback)
  File "/opt/conda/lib/python3.8/site-packages/dask/bytes/core.py", line 185, in read_block_from_file
  File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 102, in __enter__
  File "/opt/conda/lib/python3.8/site-packages/fsspec/spec.py", line 930, in open
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 117, in _open
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 199, in __init__
  File "/opt/conda/lib/python3.8/site-packages/fsspec/implementations/local.py", line 204, in _open
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/user/Documents/code/outlier.csv'
  1. 从 Pandas 读取 csv 然后放入 Dask
df = pd.read_csv('outlier.csv', index_col='new')
df = dd.from_pandas(df, npartitions=3)
X = df.drop('Column_A', axis=1)
print(X.columns)  # Print statement works. It gives me all the rest of the columns
p = SimpleImputer().fit_transform(X) 
            

输出:SimpleImputer().fitTransform(X) 线上的错误

AttributeError: 'DataFrame' object has no attribute '_data'

注意:当我使用 IterativeImputer 来适应变换时,所有这些东西都适用于 pandas。当我尝试使用 dask 生成模型时会出现问题,因为我最终想使用 dask 工作人员来生成我的模型

4

1 回答 1

0

此问题已解决。问题在于客户端和工作人员上的熊猫版本不同。工人在 1.0.1。我在两台机器上都将它升级到 1.2.3,这个错误就消失了。

另请参阅问题joblib connection to Dask backend: tornado.iostream.StreamClosedError: Stream is closed 以解决其他可能的问题。

于 2021-04-07T14:06:10.467 回答