0

我正在尝试将所有缺失的数据(如“?”所示)归入NaN并使用插补工具sklearn将它们平均为平均值。为了重现我的问题,我包含了如下代码:我在 Py 2.7.12 上使用 PyCharm 作为 IDE、Mac OS X 和 anaconda

这是我的代码:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/communities/communities.data', header=None, sep=',\s', na_values=["?"])
df.tail()
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr= imr.fit(df)

这是我的错误信息

 /Users/zdong/anaconda/bin/python/Users/zdong/PycharmProjects/ml/crim_workingfile.py
/Users/zdong/PycharmProjects/ml/crim_workingfile.py:4: ParserWarning:   Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning databases/communities/communities.data', header=None, sep=',\s', na_values=["?"])
Traceback (most recent call last):
  File "/Users/zdong/PycharmProjects/535_final/535_workingfile.py", line 8,
in <module>
imr= imr.fit(df)
  File "/Users/zdong/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/imputation.py",
line 156, in fit
force_all_finite=False)
  File "/Users/zdong/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py"
line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: invalid literal for float(): 6,?,?,Ontariocity,10,0.2,0.78,0.14,0.46,0.24,0.77,0.5,0.62,0.4,0.17,0.21,1,0.4,0.73,0.22,0.25,0.26,0.47,0.29,0.36,0.24,0.28,0.32,0.22,0.27,0.25,0.29,0.16,0.35,0.5,0.55,0.16,0.47,0.58,0.53,0.2,0.6,0.24

请帮助我被摧毁的初学者QAQ...

4

1 回答 1

1

好的,我认为这里有足够的实际答案。查看您的数据,前 5 列看起来像有关城市的信息(名称、其他值 >= 1),其余的看起来像您fit对最后一行感兴趣的数据。

您的问题是 fit 尝试将所有数据转换为浮点数,并且显然在城市名称上失败了。传递给拟合的数据可能应该是除前 5 列之外的所有数据(如果第 5 列是偏差,可能是 4 列?)。无论哪种方式,尝试类似:

df = pd.read_csv('communities.data', header=None, na_values=["?"], usecols=range(5, 128))

并根据您需要的列更改 5。

于 2016-12-05T10:34:23.037 回答