我有一个数据集如下:
X_data =
BankNum | ID |
00987772 | AB123 |
00987772 | AB123 |
00987772 | AB123 |
00987772 | ED245 |
00982123 | GH564 |
另一个是:
y_data =
ID | Labels
AB123 | High
ED245 | Low
GH564 | Low
我正在执行以下操作:
from sklearn import svm
from sklearn import metrics
import numpy as np
clf = svm.SVC(gamma=0.001, C=100., probability=True)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, random_state=42)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
但我想知道如何X_data
在我做之前将其转换为浮动clf.fit()
?我可以DictVectorizer
在这种情况下使用吗?如果是,那么我该如何使用它?
另外,我正在通过X_data
并y_data
通过train_test_split
找出预测准确性,但它会正确拆分吗?就像在from中取正确Label
的一样?ID
X_data
y_data
更新:
有人可以告诉我我是否正确执行以下操作吗?
new_df = pd.merge(df, df3, on="ID")
columns = ['BankNum', 'ID']
labels = new_df['Labels']
le = LabelEncoder()
labels = le.fit_transform(labels)
X_train, X_test, y_train, y_test = train_test_split(new_df[columns], labels, test_size=0.25, random_state=42)
X_train.fillna( 'NA', inplace = True )
X_test.fillna( 'NA', inplace = True )
x_cat_train = X_train.to_dict( orient = 'records' )
x_cat_test = X_test.to_dict( orient = 'records' )
vectorizer = DictVectorizer( sparse = False )
vec_x_cat_train = vectorizer.fit_transform( x_cat_train )
vec_x_cat_test = vectorizer.transform( x_cat_test )
x_train = vec_x_cat_train
x_test = vec_x_cat_test
clf = svm.SVC(gamma=0.001, C=100., probability=True)
clf.fit(x_train, y_train)