0

这是我在 jupyternotebook 中的代码,我很困惑为什么我会收到错误的输入形状错误。下面给出了我的代码中失败的行,通过打开数据集文件并用于拆分可能的输出类高于 50K 或低于或等于 50K。这个数据集有一个轻微的扭曲,因为每个数据点都是数字和字符串的混合

with open(input_file, 'r') as f:
    for line in f.readlines():
        if '?' in line:
            continue
        data = line[:-1].split(', ')

        if data[-1] == '<=50K' and count_lessthan50k < num_images_threshold:
            X.append(data)
            count_lessthan50k = count_lessthan50k + 1
        elif data[-1] == '>50K' and count_morethan50k <
num_images_threshold:
            X.append(data)
            count_morethan50k = count_morethan50k + 1
        if count_lessthan50k >= num_images_threshold and count_morethan50k>= num_images_threshold:
            break
X = np.array(X)

这是用于将字符串数据转换为数值数据

label_encoder = []
X_encoded = np.empty(X.shape)

for i, item in enumerate(X[0]):
    if item.isdigit():
        X_encoded[:, i] = X[:, i]
    else:
        label_encoder.append(preprocessing.LabelEncoder())
        X_encoded[:, i] = label_encoder[-1].fit_transform(X[:,i])


X = X_encoded[:, :-1].astype(int)
y = X_encoded[:, -1].astype(int)

交叉验证数据

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                random_state=5)

classifier_gaussiannb = GaussianNB()
classifier_gaussiannb.fit(X_train, y_train)

y_test_pred = classifier_gaussiannb.predict(X_test)

在单个数据实例上测试编码

input_data = ['39', 'State-gov', '77516', 'Bachelors', '13','Never-married', 'Adm-clerical', 'Not-in-family', 'White','Male', '2174', '0', '40', 'United-States']



count = 0
input_data_encoded = [-1] * len(input_data)

for i,item in enumerate(input_data):
    if item.isdigit():
        input_data_encoded[i] = int(input_data[i])
    else:
        input_data_encoded[i] = int(label_encoder[count].transform(input_data[i]))
        count = count + 1

input_data_encoded = np.array(input_data_encoded)

我已经浏览了 sklearn 文档,但对我没有用,有什么帮助吗??

4

1 回答 1

0

LabelEncodertransform()需要一次所有样本的可迭代来转换,如文档中所述:-

Transform labels to normalized encoding.

Parameters     y : array-like of shape [n_samples]
               Target values.

如果你想每次都传递一个值给它,你需要把它包装在一个这样的列表中:

else:
    input_data_encoded[i] = int(label_encoder[count].transform([input_data[i]]))

请注意 . 周围的额外方括号input_data[i]

于 2018-03-05T11:06:19.100 回答