r - 以月份为自变量（标签）的 R 回归

Question

我想知道是否有一种比虚拟编码月份（例如，isJan、isFeb...）更简洁的方法来获得更有意义的自变量名称（在拦截下）。我的数据集比较大，所以这里模拟了一个简单的。

#create simulated data set with sales, and date
sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- seq(from = 14610, to = 15609)
data <- cbind(sales, dates)

#regression with months 
model <- lm(sales ~ months(dates))
summary(model)

我希望拦截标签显示它们所指的实际月份......目前我的输出如下所示：

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      999.1934     1.2673 788.432   <2e-16 ***
months(dates).L   -4.9537     4.5689  -1.084   0.2785    
months(dates).Q   -6.4931     4.4211  -1.469   0.1422    
months(dates).C   -5.5078     4.4180  -1.247   0.2128    
months(dates)^4    2.3713     4.4864   0.529   0.5972    
months(dates)^5   -1.7749     4.4605  -0.398   0.6908    
months(dates)^6    1.5774     4.4555   0.354   0.7234    
months(dates)^7  -10.9954     4.4511  -2.470   0.0137 *  
months(dates)^8   -0.9627     4.4032  -0.219   0.8270    
months(dates)^9    1.8847     4.2996   0.438   0.6612    
months(dates)^10  -8.5990     4.1776  -2.058   0.0398 *  
months(dates)^11   7.8436     4.1292   1.900   0.0578 .

提前致谢，--JT

score 7 · Accepted Answer

The problem you have is that R has created an ordered factor and the contrasts produced for an ordered factor a polynomial contrasts (.L is linear, .Q is quadratic, .C cubic and .^n is the n-th order polynomial. It may be better to define the month as a factor, set the first level to January and then fit the model.

If in an English locale, then we can use the month.name or month.abb constants as follows

set.seed(42)
dat <- data.frame(sales = rnorm(1000, mean = 1000, sd = 40),
                  dates = as.Date(seq(from = 14610, to = 15609),
                                  origin = "1970-01-01"))
dat <- transform(dat, month = factor(format(dates, format = "%B"),
                                     levels = month.name))

This gives

> head(dat)
      sales      dates   month
1 1054.8383 2010-01-01 January
2  977.4121 2010-01-02 January
3 1014.5251 2010-01-03 January
4 1025.3145 2010-01-04 January
5 1016.1707 2010-01-05 January
6  995.7550 2010-01-06 January
> with(dat, levels(month))
 [1] "January"   "February"  "March"     "April"     "May"      
 [6] "June"      "July"      "August"    "September" "October"  
[11] "November"  "December"

Note the order of the levels is in a logical rather than alphabetical order. If you are in a none English locale then the output of "%B" will be the month names in your local language or convention. You will then need to provide the correct levels as a character vector to the levels argument in the code above.

This data set can then be used to fit the model and we get more meaningful coefficient names

> mod <- lm(sales ~ month, data = dat)
> summary(mod)

Call:
lm(formula = sales ~ month, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.333  -24.551    0.108   28.102  134.349 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1001.7034     4.1567 240.983   <2e-16 ***
monthFebruary    -8.3618     6.0153  -1.390    0.165    
monthMarch       -0.5347     5.8785  -0.091    0.928    
monthApril       -7.5618     5.9273  -1.276    0.202    
monthMay         -2.2961     5.8785  -0.391    0.696    
monthJune         3.5091     5.9273   0.592    0.554    
monthJuly        -4.9975     5.8785  -0.850    0.395    
monthAugust      -0.3558     5.8785  -0.061    0.952    
monthSeptember    3.7597     5.9970   0.627    0.531    
monthOctober     -2.5948     6.5724  -0.395    0.693    
monthNovember   -10.5670     6.6378  -1.592    0.112    
monthDecember    -6.9064     6.5724  -1.051    0.294    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.01173,    Adjusted R-squared: 0.0007317 
F-statistic: 1.066 on 11 and 988 DF,  p-value: 0.3854

In the above, note that January is the first level so its mean is the (Intercept) estimate and the other estimates are deviations from the January mean. An alternative parameterisation of the model is to suppress the intercept:

> mod2 <- lm(sales ~ month - 1, data = dat)
> summary(mod2)

Call:
lm(formula = sales ~ month - 1, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.333  -24.551    0.108   28.102  134.349 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
monthJanuary   1001.703      4.157   241.0   <2e-16 ***
monthFebruary   993.342      4.348   228.5   <2e-16 ***
monthMarch     1001.169      4.157   240.9   <2e-16 ***
monthApril      994.142      4.225   235.3   <2e-16 ***
monthMay        999.407      4.157   240.4   <2e-16 ***
monthJune      1005.213      4.225   237.9   <2e-16 ***
monthJuly       996.706      4.157   239.8   <2e-16 ***
monthAugust    1001.348      4.157   240.9   <2e-16 ***
monthSeptember 1005.463      4.323   232.6   <2e-16 ***
monthOctober    999.109      5.091   196.3   <2e-16 ***
monthNovember   991.136      5.175   191.5   <2e-16 ***
monthDecember   994.797      5.091   195.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984 
F-statistic: 5.175e+04 on 12 and 988 DF,  p-value: < 2.2e-16

Now the Estimates are of the monthly means and the t-tests are of the hypothesis that the individual monthly means are zero (0).

score 2 · Accepted Answer

创建一个作为因子的月份变量，R 将自动创建漂亮的名称。

sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- as.Date(seq(from = 14610, to = 15609),origin='1970-01-01')
data <- data.frame(sales, dates)
data$months=as.factor(months(dates))

model <- lm(sales ~ months,data=data)
summary(model)

它会自动选择四月作为对比月份，但您可以使用contrasts.

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1001.3989     4.2880 233.535   <2e-16 ***
monthsAugust       6.8982     6.0150   1.147   0.2517    
monthsDecember    -6.0561     6.7140  -0.902   0.3673    
monthsFebruary    -1.3977     6.1527  -0.227   0.8203    
monthsJanuary     -3.2086     6.0150  -0.533   0.5939    
monthsJuly       -10.0742     6.0150  -1.675   0.0943 .  
monthsJune        -3.3393     6.0641  -0.551   0.5820    
monthsMarch        0.3159     6.0150   0.053   0.9581    
monthsMay         -0.1448     6.0150  -0.024   0.9808    
monthsNovember     3.4901     6.7799   0.515   0.6068    
monthsOctober      3.2082     6.7140   0.478   0.6329    
monthsSeptember   -7.3039     6.1343  -1.191   0.2341

r - 以月份为自变量（标签）的 R 回归

2 回答 2

Related

Reference