python - 用于获取逻辑回归的最大似然估计的 Python 包

Question

由于这篇文章很长，我将在这里提出我的问题：

1. python 中是否有一个包可以为我提供最大似然估计参数，对于给定数量的参数 p，协变量 x 和数据值 y？（最好有关于如何实现它的综合文档）

2. 如果我尝试 O(100) 协变量和相似数量的数据样本，这种方法是否可扩展，即是否可行？

背景：

我正在尝试使用 python 研究具有不同数量的样本 n /协变量 p 的最大似然估计量的分布。我的脚本可以很好地生成逻辑回归的数据，但是我无法让任何参数估计方法（即最大化对数似然的参数值）正常工作。

我尝试过的方法：

- 编写我自己版本的 Newton Raphson 程序。但是我的估计中的错误在重复迭代中是不同的（我当然检查了明显的符号和不等式错误！）

- 使用牛顿共轭梯度实现。这也失败了，因为错误“数组的真值不明确”。当我编写自己的版本时，我可以使用 all() 来解决这个问题，但在使用包时不行。奇怪，因为我认为 Newton CG 是为多变量案例工作的！

-最后，我打算使用包statsmodels。我不知道我是否真的很迟钝，但我找不到任何全面的文档？为了获得逻辑回归参数，我发现的最好的是 this，我尝试过例如

X,y = logit_data(np.power(10,6),p,theta)

y=np.reshape(y, (len(y),))

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X, y)

thetaEst = clf.get_params(X, y)

我已经尝试过最后一行：

thetaEst = clf.get_params()

但似乎没有什么可以给我估计参数值。我要么有错误，要么有我不期望的对象。当然有一个python包应该这样做？！？当然，我不应该求助于 R（我不知道 RD：！！！）

可选代码

我不想让这篇文章更长，但我相信它会被要求。所以我已经输入了实现 Newton Raphson 的代码，并且我一直在用现有的包替换这个函数：

#Script for producing y data from p covariates from a specified distribution, specified theta paraemters,
#and n data samples for logit link function.

import numpy as np
import matplotlib.pyplot as plt


#Define link function here
def g(z):
    g=1/(1+np.exp(-z))
    return g

#For producing y data values given true paramters theta and number of covariates
def logit_data(n,p, theta):
    #Define parameters

    #1)Number of covariates
    p_i = p+1  #with intercept
    p_i=np.int(p_i)

    #2) m as correct data type
    n=np.int(n)

    #4)Specify parameter valueas to be estimated
    theta=np.reshape(theta, (p_i,1))

    #5)Define distribution from which covariate values are drawn i.i.d., and initiate data values
    X=np.zeros((n,p_i))
    X[:,0]=1    #intercept
    mean=0
    sigma=1.5

    X[:,1:]=np.random.normal(mean,sigma,(n,p))


    #6)Produce y values treating y as a Bernoulli variable with p=g(X*theta)
    r=np.random.uniform(0,1,n)
    r=np.reshape(r, (len(r),1))
    htrue=g(X.dot(theta))
    y=htrue-r
    y[y>=0]=1
    y[y<0]=0

    return X, y





#Newton Raphson implementation
def NewtonRaphson(X,y):
    ##NOTE: All functions negloglikelihood, gradf, hessian, return the values for f = (-ve of the log likelihood function)
    #, to use NR method to minimise f (rather than maximise l)

    #Define log likelihood function to be maximised
    def negloglikelihood(y,h):
        l= y.transpose() @ np.log(h) + (1-y).transpose() @ np.log(1-h)
        f=-l
        return f

    #Define gradient of log likelihood function
    def gradf(y, h, X):
        a=(y-h).transpose()
        gradl= np.matmul(a,X)
        grad_f=-gradl
        return grad_f

    #Define second derivative (Hessian) of log likelihood function
    def hessian(h, X):
        D=np.identity(len(h))*(np.matmul(h,(1-h).transpose()))
        H=-X.transpose() @ D @ X
        Hf=-H
        return Hf


    #Minimise f=-l

    #Produce initial theta estimate and probability parameter h
    np.random.seed(555)
    thetaEst=np.random.normal(1.1, 0.4, 6)

    eta=np.matmul(X,thetaEst)
    h=g(eta)

    #While not at a manimum of f

    #control constants
    a=10e-8
    b=10e-8
    i=0
    j=0
    k=0

    while not (np.linalg.norm(gradf(y,h,X)) < np.absolute(negloglikelihood(y,h)) * a + b):
        i=i+1
        #print('i = %s' %i)
        h=g(np.matmul(X,thetaEst)) 
        H=hessian(h,X)      #Cholesky decomposition to ensure hessian (of f) is positive semi-definite)
       # print(H)
        try:
            np.linalg.cholesky(H)
            #print('j = %s' %j)
            j=j+1
        except np.linalg.LinAlgError:
            print('Hessian not positive semi-definite!')
            try:
                v,w=np.linalg.eig(H)
               # print(v,w)
                v=np.absolute(v)
                H=w @ np.diag(v) @ np.linalg.inv(w)
            except:
                return thetaEst


        delta = 0

        try:
            delta=np.linalg.solve(H, np.reshape(gradf(y,h,X),(6,1))) #Solve for incremental theta step
            #print('k = %s' %k)
            k=k+1
        except:
            return thetaEst   #Simply return theta estimate if have singular hessian

        while negloglikelihood(y, h) > negloglikelihood(y, g(np.matmul(X,thetaEst+delta))):
            print('!!')
            delta=0.5*delta               #Ensure added step is sufficently small so as not to diverge

        thetaEst=thetaEst+delta

    return thetaEst









#Main control

#1)Sample numbers to test for erros in beta, as powers of 10.
npowers=np.arange(1,2,0.05)
n=np.power(10,npowers)


#2)Number of independent covariates
p=5


#3)True theta to be estimated (parameter values)
theta=np.asarray([1,1.2,1.1,0.8,0.9,1.3])


#4)#Initiate arrays to store estimates of theta (and errors) computed at specified sample numbers N
Thetas=np.zeros((len(npowers),p+1))
Errors=np.zeros((len(npowers),p+1))


#5)Obtain random covariate values from specified distribution, and corresponding y values using true theta
#plus gaussian noise term.
X,y = logit_data(n[-1],p,theta)


#6)Calulcate cumulative means for given n values, for the theta estimates
for ind,N in enumerate(n):
    N=np.int(N)
    thetaTemp=NewtonRaphson(X[0:N,:],y[0:N])
    Thetas[ind,:] = np.reshape(thetaTemp,6)



#7)Calculate true erros 
#print(Thetas)
Errors=Thetas-theta.transpose()
absError=np.abs(Errors)
nerror=Errors*np.sqrt(n)[:,np.newaxis]
logerror = np.log10(absError)

#8)Save data as csv


#9)Plots

plt.scatter(X@theta, g(X@theta))
plt.scatter(X@theta,y)
plt.show()



fig=plt.figure()
for i in range(p+1):
    plt.plot(npowers, logerror[:,i])

fig.suptitle('log10(Absolute Error) in MLE against log10(Number of samples,N) for logistic regression')
plt.xlabel('log_10(N)')
plt.ylabel('log_10(Absolute Error)')

fig.savefig('logiterrors7.png')
plt.show()

score 1 · Accepted Answer

可以在此处找到有关 statsmodels 中逻辑回归模型的文档，以获取最新的开发版本。所有模型都遵循一系列熟悉的步骤，因此这应该提供足够的信息以在实践中实现它（请务必查看一些示例，例如此处）。除非您有非常特殊的需要，否则我不建议您重新实现已经提供scipy或一般可用的求解器/模型。statsmodels

现在，我已经使用您的脚本生成了一些数据和Logit模型来估计参数，如下所示，

import statsmodels.api as sm

X, y = logit_data(n[-1], p, theta)

model = sm.Logit(y, X)
result = model.fit()

print(result.summary())

这输出（您的里程可能会有所不同），

Optimization terminated successfully.
         Current function value: 0.203609
         Iterations 9
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   89
Model:                          Logit   Df Residuals:                       83
Method:                           MLE   Df Model:                            5
Date:                Wed, 17 Jul 2019   Pseudo R-squ.:                  0.7062
Time:                        13:40:37   Log-Likelihood:                -18.121
converged:                       True   LL-Null:                       -61.684
                                        LLR p-value:                 2.695e-17
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0735      0.540      1.986      0.047       0.014       2.133
x1             2.0890      0.594      3.518      0.000       0.925       3.253
x2             1.7191      0.459      3.746      0.000       0.820       2.618
x3             1.7228      0.464      3.713      0.000       0.813       2.632
x4             1.1897      0.410      2.902      0.004       0.386       1.993
x5             2.2008      0.653      3.370      0.001       0.921       3.481
==============================================================================

Possibly complete quasi-separation: A fraction 0.10 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

这些系数可以通过访问result.params，如下所示，

>>>result.params
[1.0734945  2.08898192 1.71907914 1.72278748 1.18972079 2.20079805]

python - 用于获取逻辑回归的最大似然估计的 Python 包

1 回答 1

Related

Reference