我正在尝试在 Python 中实现决策树算法来预测丢失的输入数据。
假设我有一列有 99 个条目。在这 99 个条目中,有 20 个是 NaN。我想将这个单个数组分解为 x 个大小为 y(在本例中为 y = 5)的子数组。
具有完整单元格的子数组分配给特征,包含NaN的子数组分配给目标。
# breaking target array into subarrays
subarray_size = 5
target = []
features = []
# complete break up and assign to array "chunks"
chunks = [test[x : x + subarray_size] for x in xrange(0, len(test), subarray_size)]
# assigns NaN containg subarray to "target" and filled subarrays to "features"
for i in chunks:
if (np.where(np.isnan(i)))[0].shape[0]:
target.append(i)
else:
features.append(i)
代码一直工作到 for 循环结束。现在我有了特性和目标,我尝试了下面的代码块
from sklearn.cross_validation import train_test_split as tts
X_train, X_test, y_train, y_test = tts(features, target, test_size=0.2)
这产生了这个错误:
202 if len(uniques) > 1:
203 raise ValueError("Found input variables with inconsistent numbers of"
--> 204 " samples: %r" % [int(l) for l in lengths])
205
206
ValueError: Found input variables with inconsistent numbers of samples: [5, 15].
我认为错误发生在数组操作期间的某个地方。我很难修复它。有什么建议/见解/建议吗?
编辑:下面是示例“测试”列。不知道如何把它放在表格格式中。对不起,糟糕的视觉效果。
Site2_ThirdIonizationEnergy
39.722
39.722
33.667
39.722
39.722
23.32
25.04
NaN
27.491
22.99
39.722
23.32
25.04
NaN
27.491
22.99
33.667
23.32
33.667
NaN
27.491
22.99
39.722
23.32
25.04
NaN
27.491
22.99
19.174
19.174
19.174
19.174
39.722
39.722
33.667
39.722
39.722
23.32
25.04
NaN
27.491
22.99
39.722
23.32
25.04
NaN
27.491
22.99
33.667
23.32
33.667
NaN
27.491
22.99
39.722
23.32
25.04
NaN
27.491
22.99
39.722
39.722
33.667
39.722
39.722
39.722
33.667
39.722
39.722
23.32
25.04
NaN
27.491
22.99
39.722
23.32
25.04
NaN
27.491
22.99
33.667
23.32
33.667
NaN
27.491
22.99
39.722
23.32
25.04
NaN
27.491
22.99
21.62
21.62
21.62
21.62
39.722
39.722
33.667