python - 由不同的 train_test_ratio 引起的 Shogun / 二次 MMD 误差

Question

我正在使用 Shogun 运行 MMD（二次）并根据它们的样本比较两个非参数分布（下面的代码适用于 1D，但我也查看了 2D 样本）。在下面显示的玩具问题中，我尝试在选择优化内核的过程中改变训练样本和测试样本的比例（KSM_MAXIMIZE_MMD 是选择策略；我也使用过 KSM_MEDIAN_HEURISTIC）。似乎除 1 以外的任何比率都会产生错误。

我可以在此设置中更改此比率吗？（我看到它用于：http ://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html ，但它在那里设置为1）

我的代码的简明版本（灵感来自http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html上的笔记本）：

import shogun as sg
import numpy as np
from scipy.stats import laplace, norm

n = 220
mu = 0.0
sigma2 = 1
b=np.sqrt(0.5)
X = sg.RealFeatures((norm.rvs(size=n) * np.sqrt(sigma2) + mu).reshape(1,-1))
Y = sg.RealFeatures(laplace.rvs(size=n, loc=mu, scale=b).reshape(1,-1))

mmd = sg.QuadraticTimeMMD(X, Y)
mmd.add_kernel(sg.GaussianKernel(10, 1.0))
mmd.set_kernel_selection_strategy(sg.KSM_MAXIMIZE_MMD)
mmd.set_train_test_mode(True)       
mmd.set_train_test_ratio(1)
mmd.select_kernel()

mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())
kernel_width = mmd_kernel.get_width()
statistic = mmd.compute_statistic()
p_value = mmd.compute_p_value(statistic)

print p_value

这个确切的版本可以很好地运行和打印 p 值。如果我将传递给的参数mmd.set_train_test_ratio()从 1 更改为 2，我得到：

SystemErrorTraceback (most recent call last)
<ipython-input-30-dd5fcb933287> in <module>()
     25 kernel_width = mmd_kernel.get_width()
     26 
---> 27 statistic = mmd.compute_statistic()
     28 p_value = mmd.compute_p_value(statistic)
     29 

SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90: assertion kernel_matrix.num_rows==size && kernel_matrix.num_cols==size failed in float32_t shogun::internal::mmd::ComputeMMD::operator()(const shogun::SGMatrix<T>&) const [with T = float; float32_t = float] file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/statistical_testing/internals/mmd/ComputeMMD.h line 90

如果我使用低于 1 的值，情况会变得更糟。除了以下错误之外，jupyter notebook 内核每次都会崩溃（之后我需要重新运行整个笔记本；消息说：“内核似乎已经死了。它会自动重启。”）。

SystemErrorTraceback (most recent call last)
<ipython-input-31-cb4a5224f4ef> in <module>()
     20 mmd.set_train_test_ratio(0.5)
     21 
---> 22 mmd.select_kernel()
     23 
     24 mmd_kernel = sg.GaussianKernel.obtain_from_generic(mmd.get_kernel())

SystemError: [ERROR] In file /feedstock_root/build_artefacts/shogun-cpp_1512688880429/work/shogun-shogun_6.1.3/src/shogun/kernel/Kernel.h line 210: GaussianKernel::kernel(): index out of Range: idx_a=146/146 idx_b=0/146

完整代码（在 jypyter 笔记本中）可以在以下位置找到：http: //nbviewer.jupyter.org/url/dmitry.duplyakin.org/p/jn/kernel-minimal.ipynb

如果我遗漏了一个步骤或需要尝试不同的方法，请告诉我。

附带问题：

http://www.shogun-toolbox.org/examples/latest/examples/statistical_testing/quadratic_time_mmd.html和http://www.shogun-toolbox.org/notebook/latest/mmd_two_sample_testing.html都显示了使用sg.GaussianKernel(10, <width>). 除了名称、缓存大小之外，我找不到有关第一个参数的更多信息。我应该如何以及何时更改它？
如引用笔记本中所述， mmd.get_kernel_selection_strategy().get_name()仅返回通用名称，特别是KernelSelectionStrategy. 如何KSM_MEDIAN_HEURISTIC从 sg.QuadraticTimeMMD 类的实例中获取所选策略的更具体名称（例如）？

任何相关信息或参考资料将不胜感激。

幕府版：v6.1.3_2017-12-7_19:14

score 1 · Accepted Answer

train_test_ratio属性是训练中使用的样本数与测试中使用的样本数之间的比率。打开train_test_mode后，它决定在每种模式下获取多少样本的方式如下所示。
```
num_training_samples = m_num_samples * train_test_ratio / (train_test_ratio + 1)
num_testing_samples  = m_num_samples / (train_test_ratio + 1)
```
它隐含地假设了可分性。因此， A train_test_ratioof 2 会尝试使用 2/3 的数据进行训练，使用 1/3 的数据进行测试，这对于您拥有的样本总数 220 来说是有问题的。根据逻辑，它设置num_training_samples为 = 146和num_testing_samples= 73，加起来不等于 220。当使用 0.5 作为训练测试比率时，也会出现类似的问题。如果您使用其他一些值来train_test_ratio完美分割样本总数，我认为这些错误会消失。
我不完全确定，但我认为当您将 SVMLight 与 Shogun 一起使用时，缓存是有意义的。详情请查看http://svmlight.joachims.org/。从他们的页面
```
-m [5..]    - size of cache for kernel evaluations in MB (default 40)
              The larger the faster...
```
正在使用的内核选择策略没有漂亮的打印，但您可以这样做mmd.get_kernel_selection_strategy().get_method()返回枚举值（类型为 EKernelSelectionMethod），这可能会有所帮助。由于它尚未在 Shogun api-doc 中记录，因此这是您可能使用的 C++ 等效项。
```
enum EKernelSelectionMethod
{
    KSM_MEDIAN_HEURISTIC,
    KSM_MAXIMIZE_MMD,
    KSM_MAXIMIZE_POWER,
    KSM_CROSS_VALIDATION,
    KSM_AUTO = KSM_MAXIMIZE_POWER
};
```

score 0 · Accepted Answer

摘要（来自评论）：

该错误未出现在最新代码中
解决方案在：https ://github.com/shogun-toolbox/shogun/pull/4134

python - 由不同的 train_test_ratio 引起的 Shogun / 二次 MMD 误差

2 回答 2

Related

Reference