machine-learning - 使用gridsearch优化scikit中的自定义高斯过程内核

Question

我正在使用高斯过程，当我使用 scikit-learn GP 模块时，我很难使用gridsearchcv. 描述这个问题的最好方法是使用经典的 Mauna Loa 示例，其中适当的内核是使用已定义的内核（例如RBF和）的组合构建的RationalQuadratic。在那个例子中，自定义内核的参数没有被优化，而是被视为给定的。如果我想运行一个更一般的情况，我想使用交叉验证来估计这些超参数怎么办？我应该如何构建自定义内核，然后构建param_grid网格搜索的相应对象？

以一种非常天真的方式，我可以使用以下方式构建自定义内核：

def custom_kernel(a,ls,l,alpha,nl):
    kernel = a*RBF(length_scale=ls) \
    + b*RationalQuadratic(length_scale=l,alpha=alpha) \
    + WhiteKernel(noise_level=nl)
    return kernel

但是，当然不能gridsearchcv使用 eg调用此函数GaussianProcessRegressor(kernel=custom_kernel(a,ls,l,alpha,nl))。

在这个 SO question中提出了一个可能的前进路径但是我想知道是否有比从头开始编写内核（连同它的超参数）更简单的方法来解决这个问题，因为我希望使用标准内核的组合并且有还有我想把它们混在一起的可能性。

score 3 · Accepted Answer

所以这就是我走了多远。它回答了这个问题，但对于 Mauna Loa 示例来说真的很慢，但这可能是一个难以处理的数据集：

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.gaussian_process.kernels import ConstantKernel,RBF,WhiteKernel,RationalQuadratic,ExpSineSquared
import numpy as np
from sklearn.datasets import fetch_openml

# from https://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_co2.html
def load_mauna_loa_atmospheric_co2():
    ml_data = fetch_openml(data_id=41187)
    months = []
    ppmv_sums = []
    counts = []

    y = ml_data.data[:, 0]
    m = ml_data.data[:, 1]
    month_float = y + (m - 1) / 12
    ppmvs = ml_data.target

    for month, ppmv in zip(month_float, ppmvs):
        if not months or month != months[-1]:
            months.append(month)
            ppmv_sums.append(ppmv)
            counts.append(1)
        else:
            # aggregate monthly sum to produce average
            ppmv_sums[-1] += ppmv
            counts[-1] += 1

    months = np.asarray(months).reshape(-1, 1)
    avg_ppmvs = np.asarray(ppmv_sums) / counts
    return months, avg_ppmvs

X, y = load_mauna_loa_atmospheric_co2()

# Kernel with parameters given in GPML book
k1 = ConstantKernel(constant_value=66.0**2) * RBF(length_scale=67.0)  # long term smooth rising trend
k2 = ConstantKernel(constant_value=2.4**2) * RBF(length_scale=90.0) \
    * ExpSineSquared(length_scale=1.3, periodicity=1.0)  # seasonal component
# medium term irregularity
k3 = ConstantKernel(constant_value=0.66**2) \
    * RationalQuadratic(length_scale=1.2, alpha=0.78)
k4 = ConstantKernel(constant_value=0.18**2) * RBF(length_scale=0.134) \
    + WhiteKernel(noise_level=0.19**2)  # noise terms
kernel_gpml = k1 + k2 + k3 + k4
gp = GaussianProcessRegressor(kernel=kernel_gpml)

# print parameters
print(gp.get_params())

param_grid = {'alpha': np.logspace(-2, 4, 5),
              'kernel__k1__k1__k1__k1__constant_value': np.logspace(-2, 4, 5),
              'kernel__k1__k1__k1__k2__length_scale': np.logspace(-2, 2, 5),
              'kernel__k2__k2__noise_level':np.logspace(-2, 1, 5)
              }
grid_gp = GridSearchCV(gp,cv=5,param_grid=param_grid,n_jobs=4)
grid_gp.fit(X, y)

帮助我的是首先将模型初始化为gp = GaussianProcessRegressor(kernel=kernel_gpml)，然后使用该get_params属性来获取模型超参数的列表。

最后，我注意到 Rasmussen 和 Williams 在他们的书中似乎使用了 Leave one out 交叉验证来调整超参数。

machine-learning - 使用gridsearch优化scikit中的自定义高斯过程内核

1 回答 1

Related

Reference