当我在此数据集上使用 OneHotEncoder 和列转换器时,它会产生压缩的稀疏行格式。编码后,我想使用 train_test_split 拆分数据,但显示此错误:
Singleton array array(<32561x105 sparse matrix of type '<class 'numpy.float64'>'
with 394963 stored elements in Compressed Sparse Row format>,
dtype=object) cannot be considered a valid collection.
首先我处理这样的缺失值
from sklearn.impute import SimpleImputer
imputer_nominal = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')
imputer_numerical = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer_nominal.fit(x[:,[1,3,5,6,7,8,9,13]])
x[:,[1,3,5,6,7,8,9,13]] = imputer_nominal.transform(x[:,[1,3,5,6,7,8,9,13]])
imputer_numerical.fit(x[:,[0,2,4,10,11,12]])
x[:,[0,2,4,10,11,12]] = imputer_numerical.transform(x[:,[0,2,4,10,11,12]])
然后我对数据进行编码:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [1,3,5,6,7,8,9,13])], remainder = 'passthrough')
x = np.array(ct.fit_transform(x))
当我输出 numpy 数组“x”时,它看起来像这样,这是一种压缩的稀疏行格式
(0, 6) 1.0
(0, 17) 1.0
(0, 28) 1.0
(0, 31) 1.0
(0, 46) 1.0
(0, 55) 1.0
(0, 57) 1.0
(0, 96) 1.0
(0, 99) 39.0
在此之后,我尝试拆分数据并显示上述错误。我之前使用过 uesd 列转换器和 OneHotEncoder,但我不知道这个出了什么问题。另外,我不在此代码中的任何地方使用 scipy 库。