我正在尝试为我的分类器和 csp 找到最佳参数,但是 train_test_split 和交叉验证的 f1 分数不同。这是我的 train_test_split 代码:
X_train, X_test, y_train, y_test = train_test_split(data_1, classes_1, test_size=0.2,
stratify=classes_1, random_state=42)
sc = Scaler(scalings='mean')
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
csp = CSP(n_components=10)
X_train = csp.fit_transform(X_train, y_train)
X_test = csp.transform(X_test)
svc = SVC(random_state=42)
svc.fit(X_train, y_train)
precision_recall_fscore_support(y_test, svc.predict(X_test), average='macro')
结果:
Computing rank from data with rank=None
Using tolerance 9.9 (2.2e-16 eps * 22 dim * 2e+15 max singular value)
Estimated rank (mag): 22
MAG: rank 22 computed from 22 data channels with 0 projectors
Reducing data rank from 22 -> 22
Estimating covariance using EMPIRICAL
Done.
Computing rank from data with rank=None
Using tolerance 9.7 (2.2e-16 eps * 22 dim * 2e+15 max singular value)
Estimated rank (mag): 22
MAG: rank 22 computed from 22 data channels with 0 projectors
Reducing data rank from 22 -> 22
Estimating covariance using EMPIRICAL
Done.
Computing rank from data with rank=None
Using tolerance 9.6 (2.2e-16 eps * 22 dim * 2e+15 max singular value)
Estimated rank (mag): 22
MAG: rank 22 computed from 22 data channels with 0 projectors
Reducing data rank from 22 -> 22
Estimating covariance using EMPIRICAL
Done.
Computing rank from data with rank=None
Using tolerance 9.7 (2.2e-16 eps * 22 dim * 2e+15 max singular value)
Estimated rank (mag): 22
MAG: rank 22 computed from 22 data channels with 0 projectors
Reducing data rank from 22 -> 22
Estimating covariance using EMPIRICAL
Done.
(0.6939451810472472, 0.6924810661243371, 0.690826330421708, None)
这是我的 cross_validate 代码:
scorer = {'precision' : make_scorer(precision_score, average='macro'),
'recall' : make_scorer(recall_score, average='macro'),
'f1_score' : make_scorer(f1_score, average='macro')}
pipe = Pipeline([('sc', Scaler(scalings='mean')), ('csp', CSP(n_components=10)), ('svc', SVC(random_state=42))])
cross_validate(pipe, data_1, classes_1, cv=5, scoring=scorer, n_jobs=-1)
结果:
{'fit_time': array([14.22801042, 14.27563739, 14.52271605, 14.33256483, 14.07182837]),
'score_time': array([0.51321149, 0.46285748, 0.39439225, 0.47808695, 0.49983597]),
'test_precision': array([0.45328272, 0.51844098, 0.45984035, 0.57410918, 0.53001667]),
'test_recall': array([0.45506263, 0.5237793 , 0.45782758, 0.56854307, 0.53123101]),
'test_f1_score': array([0.45276712, 0.51451502, 0.45853913, 0.56818711, 0.5278188 ])}
我原以为结果会有很小的差异,但我认为 0.2 差异太大了,有人知道为什么会这样吗?
我正在使用来自https://github.com/bregydoc/bcidatasetIV2a的 A01E BCICV 2a gdf 数据集并应用自定义窗口函数将我的数据长度减少到 1 秒
# Aplicar ventanas
def windowing(data, classes, duracion=1, overlap=0.8, fs=250):
# Cantidad de mediciones por trial (1000)
limite = len(data[0, 0])
data_convertida = []
classes_convertida = []
# Tamaño a convertir de mediciones
muestras = int(duracion*fs)
for idx in range(len(data)):
ptrLeft = 0
ptrRight = muestras
while limite >= ptrRight:
if not data_convertida: # Primera iteracion
data_convertida = [data[idx, :, ptrLeft:ptrRight]]
else:
data_convertida.append(data[idx, :, ptrLeft:ptrRight])
ptrLeft = ptrRight - int((ptrRight-ptrLeft)*overlap)
ptrRight = ptrLeft + muestras
classes_convertida.append(classes[idx])
return np.array(data_convertida), np.array(classes_convertida)