r - 梯度下降计算错误

Question

我试图编写一个函数来计算线性回归模型的梯度下降。但是，我得到的答案与我使用正规方程方法得到的答案不匹配。

我的样本数据是：

df <- data.frame(c(1,5,6),c(3,5,6),c(4,6,8))

其中 c(4,6,8) 是 y 值。

lm_gradient_descent <- function(df,learning_rate, y_col=length(df),scale=TRUE){

n_features <- length(df) #n_features is the number of features in the data set

#using mean normalization to scale features

if(scale==TRUE){

for (i in 1:(n_features)){
  df[,i] <- (df[,i]-mean(df[,i]))/sd(df[,i])
    }
  }
  y_data <- df[,y_col]
  df[,y_col] <- NULL
  par <- rep(1,n_features)
  df <- merge(1,df)
  data_mat <- data.matrix(df)
  #we need a temp_arr to store each iteration of parameter values so that we can do a 
  #simultaneous update
  temp_arr <- rep(0,n_features)
  diff <- 1
  while(diff>0.0000001){
    for (i in 1:(n_features)){
      temp_arr[i] <- par[i]-learning_rate*sum((data_mat%*%par-y_data)*df[,i])/length(y_data)
    }
    diff <- par[1]-temp_arr[1]
    print(diff)
    par <- temp_arr
  }

  return(par)
}

运行这个函数，

lm_gradient_descent(df,0.0001,,0)

我得到的结果是

c(0.9165891,0.6115482,0.5652970)

当我使用正规方程法时，我得到

c(2,1,0).

希望有人可以阐明我在此功能中出错的地方。

score 0 · Accepted Answer

看来您还没有实施偏差项。在这样的线性模型中，您总是希望有一个额外的附加常数，即您的模型应该像

w_0 + w_1*x_1 + ... + w_n*x_n.

没有这个w_0词，你通常不会很合身。

score 0 · Accepted Answer

您使用了停止标准

old parameters - new parameters <= 0.0000001

首先，abs()如果你想使用这个标准，我认为有一个缺失（尽管我对 R 的无知可能是错误的）。但即使你使用

abs(old parameters - new parameters) <= 0.0000001

这不是一个好的停止标准：它只告诉您进度已经放缓，而不是它已经足够准确。而是尝试简单地迭代固定数量的迭代。不幸的是，在这里给出一个好的、普遍适用的梯度下降停止标准并不容易。

score 0 · Accepted Answer

我知道此时这已经有几个星期了，但出于几个原因，我会尝试一下，即

对 R 来说相对较新，所以破译你的代码并重写它对我来说是个好习惯
研究一个不同的梯度下降问题，所以这对我来说都是新鲜的
需要堆栈流点和
据我所知，您从未得到有效的答案。

首先，关于您的数据结构。你从一个数据框开始，重命名一列，去掉一个向量，然后去掉一个矩阵。X从矩阵（大写，因为它的组件“特征”被称为x下标i）和y解决方案向量开始会容易得多。

X <- cbind(c(1,5,6),c(3,5,6))
y <- c(4,6,8)

通过拟合线性拟合模型，我们可以很容易地看到所需的解决方案，无论是否缩放。（注意我们只缩放X/features 而不是y/solutions）

> lm(y~X)

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)           X1           X2  
         -4           -1            3  

> lm(y~scale(X))

Call:
lm(formula = y ~ scale(X))

Coefficients:
(Intercept)    scale(X)1    scale(X)2  
      6.000       -2.646        4.583

关于您的代码，R 的优点之一是它可以执行矩阵乘法，这比使用循环要快得多。

lm_gradient_descent <- function(X, y, learning_rate, scale=TRUE){

  if(scale==TRUE){X <- scale(X)}

  X <- cbind(1, X)

  theta <- rep(0, ncol(X)) #your old temp_arr
  diff <- 1
  old.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
  while(diff>0.000000001){
    theta <- theta - learning_rate * t(X) %*% (X %*% theta - y) / length(y)
    new.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
    diff <- abs(old.error - new.error)
    old.error <- new.error
  }
  return(theta)
}

为了证明它有效......

> lm_gradient_descent(X, y, .01, 0)
           [,1]
[1,] -3.9360685
[2,] -0.9851775
[3,]  2.9736566

与预期相比(-4, -1, 3)

尽管我同意@cfh 的观点，我更喜欢定义迭代次数的循环，但实际上我不确定您是否需要该abs函数。如果diff < 0那么你的功能没有收敛。

最后，而不是使用类似的东西old.error，new.error我建议使用记录所有错误的向量。然后，您可以绘制该向量以查看函数收敛的速度。

r - 梯度下降计算错误

3 回答 3

Related

Reference