scikit-learn - advanced feature extraction for cross-validation using sklearn

Question

Given a sample dataset with 1000 samples of data, suppose I would like to preprocess the data in order to obtain 10000 rows of data, so each original row of data leads to 10 new samples. In addition, when training my model I would like to be able to perform cross validation as well. The scoring function I have uses the original data to compute the score so I would like cross validation scoring to work on the original data as well rather than the generated one. Since I am feeding the generated data to the trainer (I am using a RandomForestClassifier), I cannot rely on cross-validation to correctly split the data according to the original samples.

What I thought about doing:

Create a custom feature extractor to extract features to feed to the classifier.
add the feature extractor to a pipeline and feed it to, say, GridSearchCv for example
implement a custom scorer which operates on the original data to score the model given a set of selected parameters.

Is there a better method for what I am trying to accomplish?

I am asking this in connection to a competition going on right now on Kaggle

score 0 · Accepted Answer

也许您可以在扩展样本上使用分层交叉验证（例如分层 K 折叠或分层随机拆分），并将原始样本 idx 用作分层信息，并结合自定义评分函数，该函数将忽略模型评估中的非原始样本。

scikit-learn - advanced feature extraction for cross-validation using sklearn

1 回答 1

Related

Reference