0

由于数据集太大,无法一次全部加载。我需要标准化、提取特征并批量训练它。我选择iris作为数据集并在 python 中选择scikit-learn来验证我的想法。第一步,我使用标准化批次standarScaler.particial_fit()

def batch_normalize(data):
    scaler = StandardScaler()
    dataset=[]
    for i in data:
        sc = scaler.partial_fit(i)
    for i in data:
        dataset.append(scaler.transform(i))

    return dataset

第二步,我使用提取特征IncrementalPCA.particial_fit()

def batch_feature_extracrton(dataset):
    ipca = IncrementalPCA(n_components=4)
    dataset_1=[]
    for i in dataset:
        ipca.partial_fit(i)
    for i in dataset:
        dataset_1.extend(ipca.transform(i))
    return dataset_1

第三步,我训练数据使用MLPClassifier.particial_fit()

def batch_classify(X_train, X_test, y_train, y_test):
    batch_mlp = MLPClassifier(hidden_layer_sizes=(50,10), max_iter=500,
                    solver='sgd', alpha=1e-4,  tol=1e-4, random_state=1,
                    learning_rate_init=.01)
    for i,j in zip(X_train,y_train):
        batch_mlp.partial_fit(i, j,[0,1,2])
    print("batch Test set score: %f" % batch_mlp.score(X_test, y_test))

下面是我调用上面定义的三个函数的主要函数:

def batch(iris,batch_size):
    dataset=batch_normalize(list(chunks(iris.data, batch_size)))
    dataset=batch_feature_extracrton(dataset)
    X_train, X_test, y_train, y_test = train_test_split(dataset, iris.target, test_size=0.2)
    batch_data = list(chunks(X_train, batch_size))
    batch_label = list(chunks(y_train, batch_size))
    batch_classify(batch_data, X_test, batch_label, y_test)

但是,在这种方法中,每一步,包括归一化和特征提取,我都必须对所有批次的数据进行两次检查。有没有其他方法可以简化流程?(例如,一个批次可以直接从第 1 步到第 3 步)

4

0 回答 0