python - 如何在平庸的笔记本电脑上成功运行具有中等规模数据集的 ML 算法？

Question

我有一台配备 8 GB RAM 和 Intel Core I5 处理器的联想 IdeaPad 笔记本电脑。我有 60k 个数据点，每 100 个维度。我想做 KNN，为此我正在运行 LMNN 算法来查找 Mahalanobis Metric。
问题是运行 2 小时后，我的 ubuntu 上出现了一个空白屏幕。我没有得到什么问题！是我的记忆变满了还是别的什么？
那么有什么方法可以优化我的代码吗？

我的数据集：数据
我的 LMNN 实现：

import numpy as np
import sys
from modshogun import LMNN, RealFeatures, MulticlassLabels
from sklearn.datasets import load_svmlight_file

def main(): 

    # Get training file name from the command line
    traindatafile = sys.argv[1]

    # The training file is in libSVM format
    tr_data = load_svmlight_file(traindatafile);

    Xtr = tr_data[0].toarray(); # Converts sparse matrices to dense
    Ytr = tr_data[1]; # The trainig labels

    # Cast data to Shogun format to work with LMNN
    features = RealFeatures(Xtr.T)
    labels = MulticlassLabels(Ytr.astype(np.float64))



    # Number of target neighbours per example - tune this using validation
    k = 18

    # Initialize the LMNN package
    lmnn = LMNN(features, labels, k)
    init_transform = np.eye(Xtr.shape[1])

    # Choose an appropriate timeout
    lmnn.set_maxiter(200000)
    lmnn.train(init_transform)

    # Let LMNN do its magic and return a linear transformation
    # corresponding to the Mahalanobis metric it has learnt
    L = lmnn.get_linear_transform()
    M = np.matrix(np.dot(L.T, L))

    # Save the model for use in testing phase
    # Warning: do not change this file name
    np.save("model.npy", M) 

if __name__ == '__main__':
    main()

score 0 · Accepted Answer

Exact k-NN 存在可扩展性问题。

Scikit-learn 有关于在这种情况下做什么的文档页面partial_fit（缩放策略）（许多算法都有方法，但不幸的是 kNN 没有）。

如果您愿意以精度换取速度，您可以运行类似近似最近邻的方法。

python - 如何在平庸的笔记本电脑上成功运行具有中等规模数据集的 ML 算法？

1 回答 1

Related

Reference