我正在尝试在 Python 中重现以下 R 结果。在这种特殊情况下,R 预测技能低于 Python 技能,但根据我的经验,这通常不是这种情况(因此想要在 Python 中重现结果的原因),所以请忽略这里的细节。
目的是预测花卉种类('versicolor' 0 或 'virginica' 1)。我们有 100 个标记样本,每个样本由 4 个花特征组成:萼片长度、萼片宽度、花瓣长度、花瓣宽度。我将数据分为训练集(60% 的数据)和测试集(40% 的数据)。对训练集进行 10 折交叉验证以搜索最优 lambda(优化的参数为 scikit-learn 中的“C”)。
我在 R 中使用glmnet,alpha 设置为 1(用于 LASSO 惩罚),对于 python,scikit-learn 的LogisticRegressionCV函数与“liblinear”求解器(唯一可用于 L1 惩罚的求解器)。交叉验证中使用的评分指标在两种语言之间是相同的。然而不知何故,模型结果是不同的(为每个特征找到的截距和系数变化很大)。
代码
library(glmnet)
library(datasets)
data(iris)
y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2
n_sample = NROW(X)
w = .6
X_train = X[0:(w * n_sample),] # (60, 4)
y_train = y[0:(w * n_sample)] # (60,)
X_test = X[((w * n_sample)+1):n_sample,] # (40, 4)
y_test = y[((w * n_sample)+1):n_sample] # (40,)
# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
nfolds = 10, alpha=1, family="binomial", type.measure="class")
best_s <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))
# best lambda
print(best_s)
# 0.04136537
# fraction correct
print(sum(y_test==pred)/NROW(pred))
# 0.75
# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept) -14.680479
#Sepal.Length 0
#Sepal.Width 0
#Petal.Length 1.181747
#Petal.Width 4.592025
Python代码
from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0] # four features. Disregard one of the 3 species.
y = y[y != 0]-1 # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.
n_sample = len(X)
w = .6
X_train = X[:int(w * n_sample)] # (60, 4)
y_train = y[:int(w * n_sample)] # (60,)
X_test = X[int(w * n_sample):] # (40, 4)
y_test = y[int(w * n_sample):] # (40,)
X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)
clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)
print clf.score(X_train_fit.transform(X_test), y_test) # score is 0.775
print clf.intercept_ #-1.83569557
print clf.coef_ # [ 0, 0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_ # optimal lambda: 0.35938137