python - Python中的多元线性回归

Question

我似乎找不到任何进行多重回归的 python 库。我发现的唯一东西只做简单的回归。我需要根据几个自变量（x1、x2、x3 等）对因变量 (y) 进行回归。

例如，使用此数据：

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

（上面的输出：）

      y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55

我将如何在 python 中回归这些，以获得线性回归公式：

Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c

score 111 · Accepted Answer

sklearn.linear_model.LinearRegression会做的：

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

然后clf.coef_会有回归系数。

sklearn.linear_model也有类似的接口来对回归进行各种正则化。

score 64 · Accepted Answer

这是我创建的一个小工作。我用 R 检查了它，它工作正常。

import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [
     [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
     ]

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

结果：

print reg_m(y, x).summary()

输出：

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas提供了一种运行 OLS 的便捷方法，如本答案中所述：

使用 Pandas 数据框运行 OLS 回归

score 55 · Accepted Answer

只是为了澄清，你给出的例子是多元线性回归，而不是多元线性回归。区别：

单个标量预测变量 x 和单个标量响应变量 y 的最简单情况称为简单线性回归。对多元和/或向量值预测变量（用大写 X 表示）的扩展称为多元线性回归，也称为多元线性回归。几乎所有现实世界的回归模型都涉及多个预测变量，线性回归的基本描述通常用多元回归模型来表述。但是请注意，在这些情况下，响应变量 y 仍然是一个标量。另一个术语多元线性回归是指y是向量的情况，即与一般线性回归相同。

简而言之：

多元线性回归：响应 y 是标量。
多元线性回归：响应 y 是一个向量。

（另一个来源。）

score 29 · Accepted Answer

您可以使用numpy.linalg.lstsq：

import numpy as np

y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array(
    [
        [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
        [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
        [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
        [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
        [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
        [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
        [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
    ]
)
X = X.T  # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])]  # add bias term
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]
print(beta_hat)

结果：

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

您可以通过以下方式查看估计的输出：

print(np.dot(X,beta_hat))

结果：

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]

score 13 · Accepted Answer

使用scipy.optimize.curve_fit. 而且不仅适用于线性拟合。

from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt

score 10 · Accepted Answer

将数据转换为 pandas 数据框 ( df) 后，

import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)

默认情况下包含截距项。

有关更多示例，请参阅此笔记本。

score 4 · Accepted Answer

我认为这可能是完成这项工作的最简单方法：

from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4

print x.head()

         x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1

print y.head()

0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64

model = OLS(y, x)
result = model.fit()
print result.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
==============================================================================
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89
==============================================================================

score 4 · Accepted Answer

可以使用上面提到的 sklearn 库处理多重线性回归。我正在使用 Python 3.6 的 Anaconda 安装。

创建模型如下：

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)

# display coefficients
print(regressor.coef_)

score 3 · Accepted Answer

3

您可以使用numpy.linalg.lstsq

于 2012-07-14T07:25:34.580 回答

score 1 · Accepted Answer

您可以使用下面的函数并将其传递给 DataFrame：

def linear(x, y=None, show=True):
    """
    @param x: pd.DataFrame
    @param y: pd.DataFrame or pd.Series or None
              if None, then use last column of x as y
    @param show: if show regression summary
    """
    import statsmodels.api as sm

    xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
    res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()

    if show: print res.summary()
    return res

score 1 · Accepted Answer

Scikit-learn 是 Python 的机器学习库，可以为您完成这项工作。只需将 sklearn.linear_model 模块导入您的脚本。

在 Python 中使用 sklearn 查找多元线性回归的代码模板：

import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd

# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself

#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)

# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the test set results
y_pred = regressor.predict(X_test)

就是这样。您可以将此代码用作在任何数据集中实现多元线性回归的模板。要通过示例更好地理解，请访问：带示例的线性回归

score 0 · Accepted Answer

这是一种替代的基本方法：

from patsy import dmatrices
import statsmodels.api as sm

y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ### 
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())

代替sm.OLS你也可以使用sm.Logitorsm.Probit等。

score 0 · Accepted Answer

可以使用OpenTURNS来找到像这样的线性模型。

在 OpenTURNS 中，这是通过LinearModelAlgorithm从数值样本创建线性模型的类来完成的。更具体地说，它建立了以下线性模型：

Y = a0 + a1.X1 + ... + an.Xn + epsilon，

其中误差 epsilon 是具有零均值和单位方差的高斯分布。假设您的数据在 csv 文件中，这是一个获取回归系数 ai 的简单脚本：

from __future__ import print_function
import pandas as pd
import openturns as ot

# Assuming the data is a csv file with the given structure                          
# Y X1 X2 .. X7
df = pd.read_csv("./data.csv", sep="\s+")

# Build a sample from the pandas dataframe
sample = ot.Sample(df.values)

# The observation points are in the first column (dimension 1)
Y = sample[:, 0]

# The input vector (X1,..,X7) of dimension 7
X = sample[:, 1::]

# Build a Linear model approximation
result = ot.LinearModelAlgorithm(X, Y).getResult()

# Get the coefficients ai
print("coefficients of the linear regression model = ", result.getCoefficients())

然后，您可以通过以下调用轻松获得置信区间：

# Get the confidence intervals at 90% of the ai coefficients
print(
    "confidence intervals of the coefficients = ",
    ot.LinearModelAnalysis(result).getCoefficientsConfidenceInterval(0.9),
)

您可以在 OpenTURNS 示例中找到更详细的示例。

score 0 · Accepted Answer

尝试使用高斯族的广义线性模型

y = np.array([-6, -5, -10, -5, -8, -3, -6, -8, -8])
X = np.array([
    [-4.95, -4.55, -10.96, -1.08, -6.52, -0.81, -7.01, -4.46, -11.54],
    [-5.87, -4.52, -11.64, -3.36, -7.45, -2.36, -7.33, -7.65, -10.03],
    [-0.76, -0.71, -0.98, 0.75, -0.86, -0.50, -0.33, -0.94, -1.03],
    [14.73, 13.74, 15.49, 24.72, 16.59, 22.44, 13.93, 11.40, 18.18],
    [4.02, 4.47, 4.18, 4.96, 4.29, 4.81, 4.32, 4.43, 4.28],
    [0.20, 0.16, 0.19, 0.16, 0.10, 0.15, 0.21, 0.16, 0.21],
    [0.45, 0.50, 0.53, 0.60, 0.48, 0.53, 0.50, 0.49, 0.55],
])
X=zip(*reversed(X))

df=pd.DataFrame({'X':X,'y':y})
columns=7
for i in range(0,columns):
    df['X'+str(i)]=df.apply(lambda row: row['X'][i],axis=1)

df=df.drop('X',axis=1)
print(df)


#model_formula='y ~ X0+X1+X2+X3+X4+X5+X6'
model_formula='y ~ X0'

model_family = sm.families.Gaussian()
model_fit = glm(formula = model_formula, 
             data = df, 
             family = model_family).fit()

print(model_fit.summary())

# Extract coefficients from the fitted model wells_fit
#print(model_fit.params)
intercept, slope = model_fit.params

# Print coefficients
print('Intercept =', intercept)
print('Slope =', slope)

# Extract and print confidence intervals
print(model_fit.conf_int())

df2=pd.DataFrame()
df2['X0']=np.linspace(0.50,0.70,50)

df3=pd.DataFrame()
df3['X1']=np.linspace(0.20,0.60,50)

prediction0=model_fit.predict(df2)
#prediction1=model_fit.predict(df3)

plt.plot(df2['X0'],prediction0,label='X0')
plt.ylabel("y")
plt.xlabel("X0")
plt.show()

score 0 · Accepted Answer

线性回归是开始人工智能的一个很好的例子

这是使用 Python 进行多元线性回归的机器学习算法的一个很好的示例：

##### Predicting House Prices Using Multiple Linear Regression - @Y_T_Akademi
    
#### In this project we are gonna see how machine learning algorithms help us predict house prices. Linear Regression is a model of predicting new future data by using the existing correlation between the old data. Here, machine learning helps us identify this relationship between feature data and output, so we can predict future values.

import pandas as pd

##### we use sklearn library in many machine learning calculations..

from sklearn import linear_model

##### we import out dataset: housepricesdataset.csv

df = pd.read_csv("housepricesdataset.csv",sep = ";")

##### The following is our feature set:
##### The following is the output(result) data:
##### we define a linear regression model here: 

reg = linear_model.LinearRegression()
reg.fit(df[['area', 'roomcount', 'buildingage']], df['price'])

# Since our model is ready, we can make predictions now:
# lets predict a house with 230 square meters, 4 rooms and 10 years old building..

reg.predict([[230,4,10]])

# Now lets predict a house with 230 square meters, 6 rooms and 0 years old building - its new building..
reg.predict([[230,6,0]])

# Now lets predict a house with 355 square meters, 3 rooms and 20 years old building 
reg.predict([[355,3,20]])

# You can make as many prediction as you want.. 
reg.predict([[230,4,10], [230,6,0], [355,3,20], [275, 5, 17]])

我的数据集如下：

python - Python中的多元线性回归

15 回答 15

Related

Reference