4

I am performing logistic regression in MATLAB with L2 regularization on text data. My program works well for small datasets. For larger sets, it keeps running infinitely.

I have seen the potentially duplicate question (matlab fminunc not quitting (running indefinitely)). In that question, the cost for initial theta was NaN and there was an error printed in the console. For my implementation, I am getting a real valued cost and there is no error even with verbose parameters being passed to fminunc(). Hence I believe this question might not be a duplicate.

I need help in scaling it to larger sets. The size of the training data I am currently working on is roughly 10k*12k (10k text files cumulatively containing 12k words). Thus, I have m=10k training examples and n=12k features.

My cost function is defined as follows:

function [J gradient] = costFunction(X, y, lambda, theta)

    [m n] = size(X);
    g = inline('1.0 ./ (1.0 + exp(-z))'); 
    h = g(X*theta);
    J =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h))+ (lambda/(2*m))*norm(theta(2:end))^2;

    gradient(1) = (1/m)*sum((h-y) .* X(:,1));

    for i = 2:n
        gradient(i) = (1/m)*sum((h-y) .* X(:,i)) - (lambda/m)*theta(i);
    end
end

I am performing optimization using MATLAB's fminunc() function. The parameters I pass to fminunc() are:

options = optimset('LargeScale', 'on', 'GradObj', 'on', 'MaxIter', MAX_ITR);
theta0 = zeros(n, 1);

[optTheta, functionVal, exitFlag] = fminunc(@(t) costFunction(X, y, lambda, t), theta0, options);

I am running this code on a machine with these specifications:

Macbook Pro i7 2.8GHz / 8GB RAM / MATLAB R2011b

The cost function seems to behave correctly. For initial theta, I get acceptable values of J and gradient.

K>> theta0 = zeros(n, 1);
K>> [j g] = costFunction(X, y, lambda, theta0);
K>> j

j =

    0.6931

K>> max(g)

ans =

    0.4082

K>> min(g)

ans =

  -2.7021e-05

The program takes incredibly long to run. I started profiling keeping MAX_ITR = 1 for fminunc(). With a single iteration, the program did not complete execution even after a couple of hours had elapsed. My questions are:

  1. Am I doing something wrong mathematically?

  2. Should I use any other optimizer instead of fminunc()? With LargeScale=on, fminunc() uses trust-region algorithms.

  3. Is this problem cluster-scale and should not be run on a single machine?

Any other general tips will be appreciated. Thanks!


This helped solve the problem: I was able to get this working by setting the LargeScale flag to 'off' in fminunc(). From what I gather, LargeScale = 'on' uses trust region algorithms, while keeping it 'off' uses quasi-newton methods. Using quasi-newton methods and passing the gradient worked a lot faster for this particular problem and gave very nice results.


4

3 回答 3

1

I was able to get this working by setting the LargeScale flag to 'off' in fminunc(). From what I gather, LargeScale = 'on' uses trust region algorithms, while keeping it 'off' uses quasi-newton methods. Using quasi-newton methods and passing the gradient worked a lot faster for this particular problem and gave very nice results.

于 2013-08-07T05:20:40.043 回答
0

这是我的建议:

- 设置 Matlab 标志以在运行期间显示调试输出。如果不只是在您的成本函数中打印出成本,这将允许您监控迭代次数和错误。

第二,这很重要:

你的问题是不恰当的,或者可以说是不确定的。你有一个 12k 的特征空间并且只提供了 10k 个示例,这意味着对于无约束的优化,答案是 -Inf。举一个简单的例子,你的问题是这样的:最小化 x+y+z,因为 x+yz = 2。特征空间暗淡 3,跨度向量空间 - 1d。我建议使用 PCA 或 CCA 来降低文本文件的维度,将它们的变化保持在 99%。这可能会给你一个~100-200dim 的特征空间。

PS:只是指出这个问题非常适合集群大小要求,通常是 1kk+ 数据点,而 fminunc 一点也不矫枉过正,与 LIBSVM 无关,因为 fminunc 只是一个二次优化器,而LIBSVM 是一个分类器。为了清除 LIBSVM,使用类似于 fminunc 的东西,只是目标函数不同。

于 2013-08-05T22:44:18.423 回答
0

根据我对此类问题的经验,这就是我怀疑的问题。您使用的是密集表示X而不是稀疏表示。您还看到了文本分类中的典型效果,即术语数量与样本数量大致呈线性增长。实际上,矩阵乘法的成本X*theta随样本数量呈二次方上升。

相比之下,一个好的稀疏矩阵表示只迭代非零元素以进行矩阵乘法,如果它们具有适当的恒定长度,则每个文档的矩阵乘法趋于大致恒定,从而导致样本数量的线性而不是二次减慢.

我不是 Matlab 大师,但我知道它有一个稀疏矩阵包,所以尝试使用它。

于 2013-08-06T09:42:27.780 回答