由于这篇文章很长,我将在这里提出我的问题:
1. python 中是否有一个包可以为我提供最大似然估计参数,对于给定数量的参数 p,协变量 x 和数据值 y?(最好有关于如何实现它的综合文档)
2. 如果我尝试 O(100) 协变量和相似数量的数据样本,这种方法是否可扩展,即是否可行?
背景:
我正在尝试使用 python 研究具有不同数量的样本 n /协变量 p 的最大似然估计量的分布。我的脚本可以很好地生成逻辑回归的数据,但是我无法让任何参数估计方法(即最大化对数似然的参数值)正常工作。
我尝试过的方法:
- 编写我自己版本的 Newton Raphson 程序。但是我的估计中的错误在重复迭代中是不同的(我当然检查了明显的符号和不等式错误!)
- 使用牛顿共轭梯度实现。这也失败了,因为错误“数组的真值不明确”。当我编写自己的版本时,我可以使用 all() 来解决这个问题,但在使用包时不行。奇怪,因为我认为 Newton CG 是为多变量案例工作的!
-最后,我打算使用包statsmodels。我不知道我是否真的很迟钝,但我找不到任何全面的文档?为了获得逻辑回归参数,我发现的最好的是 this,我尝试过例如
X,y = logit_data(np.power(10,6),p,theta)
y=np.reshape(y, (len(y),))
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X, y)
thetaEst = clf.get_params(X, y)
我已经尝试过最后一行:
thetaEst = clf.get_params()
但似乎没有什么可以给我估计参数值。我要么有错误,要么有我不期望的对象。当然有一个python包应该这样做?!?当然,我不应该求助于 R(我不知道 RD:!!!)
可选 代码
我不想让这篇文章更长,但我相信它会被要求。所以我已经输入了实现 Newton Raphson 的代码,并且我一直在用现有的包替换这个函数:
#Script for producing y data from p covariates from a specified distribution, specified theta paraemters,
#and n data samples for logit link function.
import numpy as np
import matplotlib.pyplot as plt
#Define link function here
def g(z):
g=1/(1+np.exp(-z))
return g
#For producing y data values given true paramters theta and number of covariates
def logit_data(n,p, theta):
#Define parameters
#1)Number of covariates
p_i = p+1 #with intercept
p_i=np.int(p_i)
#2) m as correct data type
n=np.int(n)
#4)Specify parameter valueas to be estimated
theta=np.reshape(theta, (p_i,1))
#5)Define distribution from which covariate values are drawn i.i.d., and initiate data values
X=np.zeros((n,p_i))
X[:,0]=1 #intercept
mean=0
sigma=1.5
X[:,1:]=np.random.normal(mean,sigma,(n,p))
#6)Produce y values treating y as a Bernoulli variable with p=g(X*theta)
r=np.random.uniform(0,1,n)
r=np.reshape(r, (len(r),1))
htrue=g(X.dot(theta))
y=htrue-r
y[y>=0]=1
y[y<0]=0
return X, y
#Newton Raphson implementation
def NewtonRaphson(X,y):
##NOTE: All functions negloglikelihood, gradf, hessian, return the values for f = (-ve of the log likelihood function)
#, to use NR method to minimise f (rather than maximise l)
#Define log likelihood function to be maximised
def negloglikelihood(y,h):
l= y.transpose() @ np.log(h) + (1-y).transpose() @ np.log(1-h)
f=-l
return f
#Define gradient of log likelihood function
def gradf(y, h, X):
a=(y-h).transpose()
gradl= np.matmul(a,X)
grad_f=-gradl
return grad_f
#Define second derivative (Hessian) of log likelihood function
def hessian(h, X):
D=np.identity(len(h))*(np.matmul(h,(1-h).transpose()))
H=-X.transpose() @ D @ X
Hf=-H
return Hf
#Minimise f=-l
#Produce initial theta estimate and probability parameter h
np.random.seed(555)
thetaEst=np.random.normal(1.1, 0.4, 6)
eta=np.matmul(X,thetaEst)
h=g(eta)
#While not at a manimum of f
#control constants
a=10e-8
b=10e-8
i=0
j=0
k=0
while not (np.linalg.norm(gradf(y,h,X)) < np.absolute(negloglikelihood(y,h)) * a + b):
i=i+1
#print('i = %s' %i)
h=g(np.matmul(X,thetaEst))
H=hessian(h,X) #Cholesky decomposition to ensure hessian (of f) is positive semi-definite)
# print(H)
try:
np.linalg.cholesky(H)
#print('j = %s' %j)
j=j+1
except np.linalg.LinAlgError:
print('Hessian not positive semi-definite!')
try:
v,w=np.linalg.eig(H)
# print(v,w)
v=np.absolute(v)
H=w @ np.diag(v) @ np.linalg.inv(w)
except:
return thetaEst
delta = 0
try:
delta=np.linalg.solve(H, np.reshape(gradf(y,h,X),(6,1))) #Solve for incremental theta step
#print('k = %s' %k)
k=k+1
except:
return thetaEst #Simply return theta estimate if have singular hessian
while negloglikelihood(y, h) > negloglikelihood(y, g(np.matmul(X,thetaEst+delta))):
print('!!')
delta=0.5*delta #Ensure added step is sufficently small so as not to diverge
thetaEst=thetaEst+delta
return thetaEst
#Main control
#1)Sample numbers to test for erros in beta, as powers of 10.
npowers=np.arange(1,2,0.05)
n=np.power(10,npowers)
#2)Number of independent covariates
p=5
#3)True theta to be estimated (parameter values)
theta=np.asarray([1,1.2,1.1,0.8,0.9,1.3])
#4)#Initiate arrays to store estimates of theta (and errors) computed at specified sample numbers N
Thetas=np.zeros((len(npowers),p+1))
Errors=np.zeros((len(npowers),p+1))
#5)Obtain random covariate values from specified distribution, and corresponding y values using true theta
#plus gaussian noise term.
X,y = logit_data(n[-1],p,theta)
#6)Calulcate cumulative means for given n values, for the theta estimates
for ind,N in enumerate(n):
N=np.int(N)
thetaTemp=NewtonRaphson(X[0:N,:],y[0:N])
Thetas[ind,:] = np.reshape(thetaTemp,6)
#7)Calculate true erros
#print(Thetas)
Errors=Thetas-theta.transpose()
absError=np.abs(Errors)
nerror=Errors*np.sqrt(n)[:,np.newaxis]
logerror = np.log10(absError)
#8)Save data as csv
#9)Plots
plt.scatter(X@theta, g(X@theta))
plt.scatter(X@theta,y)
plt.show()
fig=plt.figure()
for i in range(p+1):
plt.plot(npowers, logerror[:,i])
fig.suptitle('log10(Absolute Error) in MLE against log10(Number of samples,N) for logistic regression')
plt.xlabel('log_10(N)')
plt.ylabel('log_10(Absolute Error)')
fig.savefig('logiterrors7.png')
plt.show()