3

我正在使用 plm 处理固定效应回归模型。

该模型如下所示:

FE.model <-plm(fml, data = data.reg2,
           index=c('Site.ID','date.hour'), # cross section ID and time series ID
           model='within', #coefficients are fixed
           effect='individual')
summary(FE.model)

“fml”是我之前定义的公式。我有很多自变量,所以这使它更有效率。

我想要做的是获取我的拟合值(我的 yhats)并将它们加入我的基础数据集;数据.reg2

我能够使用以下代码获得拟合值:

 Fe.model.fitted <- FE.model$model[[1]] - FE.model$residuals

但是,这只给了我一个仅包含拟合值的列向量 - 我无法将它加入我的基础数据集。

或者,我尝试过这样的事情:

 Fe.model.fitted <- cbind(data.reg2, resid=resid(FE.model), fitted=fitted(FE.model))

但是,我得到了这个错误:

 Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""pseries"" to a data.frame

有没有其他方法可以在我的基础数据集中获得我的拟合值?或者有人可以解释我遇到的错误以及修复它的方法吗?

我应该注意,我不想根据我的 beta 手动计算 yhats。我对该选项有太多的自变量,并且我定义的公式(fml)可能会改变,因此该选项不会有效。

非常感谢!!

4

5 回答 5

6

将拟合值合并plm回原始数据集需要一些中间步骤——plm删除任何缺少数据的行,据我所知,一个plm对象不包含索引信息。数据的顺序没有plm保留——请参阅的作者之一 Giovanni Millo在此线程中的评论:

“......输入顺序并不总是被保留:观察总是在内部按(个人,时间)重新排序,因此您得到的输出会相应地排序......”

简短的步骤:

  1. 从估计的plm对象中获取拟合值。它是单个向量,但条目已命名。名称对应于索引中的位置。
  2. index()使用函数获取索引。它可以返回单个索引和时间索引。请注意,索引可能包含比拟合值更多的行,以防因缺少数据而删除行。(也可以直接从原始数据生成索引,但我没有看到数据的原始顺序保留在plm返回的内容中的承诺。)
  3. 合并到原始数据中,从索引中查找 id 和 time 值。

下面提供了示例代码。有点长,但我试图发表评论。代码没有优化,我的意图是明确列出这些步骤。另外,我使用的是data.tables 而不是data.frames。

library(data.table); library(plm)

### Generate dummy data. This way we know the "true" coefficients
set.seed(100)
n <- 500 # Run with more data if you want to get closer to the "true" coefficients
DT <- data.table(CJ(id = c("a","b","c","d","e"), time = c(1:(n / 5))))
DT[, x1 := rnorm(n)]
DT[, x2 := rnorm(n)]
DT[, y  := x1 + 2 * x2 + rnorm(n) / 10]

setkey(DT, id, time)
# # Make it an unbalanced panel & put in some NAs
DT <- DT[!(id == "a" & time == 4)]
DT[.("a", 3), x2 := as.numeric(NA)]
DT[.("d", 2), x2 := as.numeric(NA)]

str(DT)

### Run the model -- both individual and time effects; "within" model
summary(PLM <- plm(data = DT, id = c("id", "time"), formula = y ~ x1 + x2, model = "within", effect = "twoways", na.action = "na.omit"))

### Merge the fitted values back into the data.table DT
# Note that PLM$model$y is shorter than the data, i.e. the row(s) with NA have been dropped
cat("\nRows omitted (due to NA): ", nrow(DT) - length(PLM$model$y))

# Since the objects returned by plm() do not contain the index, need to generate it from the data
# The object returned by plm(), i.e. PLM$model$y, has names that point to the place in the index
# Note: The index can also be done as INDEX <- DT[, j = .(id, time)], but use the longer way with index() in case plm does not preserve the order
INDEX <- data.table(index(x = pdata.frame(x = DT, index = c("id", "time")), which = NULL)) # which = NULL extracts both the individual and time indexes
INDEX[, id := as.character(id)]
INDEX[, time := as.integer(time)] # it is returned as a factor, convert back to integer to match the variable type in DT

# Generate the fitted values as the difference between the y values and the residuals
if (all(names(PLM$residuals) == names(PLM$model$y))) { # this should not be needed, but just in case...
    FIT <- data.table(
        index   = as.integer(names(PLM$model$y)), # this index corresponds to the position in the INDEX, from where we get the "id" and "time" below
        fit.plm = as.numeric(PLM$model$y) - as.numeric(PLM$residuals)
    )
}

FIT[, id   := INDEX[index]$id]
FIT[, time := INDEX[index]$time]
# Now FIT has both the id and time variables, can match it back into the original dataset (i.e. we have the missing data accounted for)
DT <- merge(x = DT, y = FIT[, j = .(id, time, fit.plm)], by = c("id", "time"), all = TRUE) # Need all = TRUE, or some data from DT will be dropped!
于 2015-11-13T10:51:22.737 回答
0

The residuals are deviation of the model from the value on the LHS of the formula .... which you have not shown to us. There is a fitted.panelmodel function in the 'plm' package, but it appears to expect that there will be a fitted value which the plm function does not return by default, nor is it documented to do so, nor is the a way that I see to make it cough one up.

library(plm)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
          data = Produc, index = c("state","year"))
summary(zz)  # the example on the plm page:
> str(fitted(zz))
 NULL
> names(zz$model)
[1] "log(gsp)"  "log(pcap)" "log(pc)"   "log(emp)"  "unemp"    
> Produc[ , c("Yvar", "Fitted")] <- cbind( zz$model[ ,"log(gsp)", drop=FALSE], zz$residuals)
> str(Produc)
'data.frame':   816 obs. of  12 variables:
 $ state : Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year  : int  1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
 $ pcap  : num  15033 15502 15972 16406 16763 ...
 $ hwy   : num  7326 7526 7765 7908 8026 ...
 $ water : num  1656 1721 1765 1742 1735 ...
 $ util  : num  6051 6255 6442 6756 7002 ...
 $ pc    : num  35794 37300 38670 40084 42057 ...
 $ gsp   : int  28418 29375 31303 33430 33749 33604 35764 37463 39964 40979 ...
 $ emp   : num  1010 1022 1072 1136 1170 ...
 $ unemp : num  4.7 5.2 4.7 3.9 5.5 7.7 6.8 7.4 6.3 7.1 ...
 $ Yvar  :Classes 'pseries', 'pseries', 'integer'  atomic [1:816] 10.3 10.3 10.4 10.4 10.4 ...
  .. ..- attr(*, "index")='data.frame': 816 obs. of  2 variables:
  .. .. ..$ state: Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. ..$ year : Factor w/ 17 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Fitted: num  -0.04656 -0.03064 -0.01645 -0.00873 -0.02708 ...
于 2014-04-17T21:58:55.423 回答
0

我有一个简化的方法。这里的主要问题有两个:

1) pdata.frames 按名称的字母顺序对您的输入进行排序,然后是年份。这可以通过在运行 plm 之前先对数据框进行排序来解决。

2) 删除公式中包含的变量中具有 NA 的行。我通过创建包含我的 id 和时间变量的第二个公式来处理这个问题,然后使用 model.frame 提取回归中使用的数据(不包括 NA,但现在还包括 id 和时间)

library(plm)
set.seed(100)
n <- 10 # Run with more data if you want to get closer to the "true" coefficients
DT <- data.frame(id = c("a","c","b","d","e"), time = c(1:(n / 5)),x1 = rnorm(n),x2= rnorm(n),x3=rnorm(n))
DT$Y = DT$x2 + 2 * DT$x3 + rnorm(n) / 10 # make x1 a function of other variables
DT$x3[3]=NA  # add an NA to show this works with missing data 
DT  

# now can add drop.index = F, but note that DT is now sorted by order(id,time)
pdata.frame(DT,index=c('id','time'),drop.index = F)

# order DT to match pdata.frame that will be used for plm
DT=DT[order(DT$id,DT$time),]

# formulas
formulas =Y~x1+x2+x3 
formulas_dataframe = Y~x1+x2+x3 +id+time # add id and time for model.frame

# estimate
random <- plm(formulas, data=DT, index=c("id", "time"), model="random",na.action = 'na.omit')
summary(random) 

# merge prediction and and model.frame 
fitted = data.frame(fitted = random$model[[1]] - random$residuals)
model_data = cbind(as.data.frame(as.matrix(random$model)),fitted)  # this isn't really needed but shows that input and model.frame are same
model_data = cbind(model_data,na.omit(model.frame(formulas_dataframe,DT)))  
model_data
于 2016-10-03T15:01:43.750 回答
0

predict.out.plm在用估计一阶差分或固定效应模型后,plm我编写了一个函数 (

该函数进一步将预测值添加到原始数据的索引中。这是通过使用保存rownamesplm-attributes(plmobject)$indexrownamesmodel.matrix

有关更多详细信息,请参阅此处发布的功能:

https://stackoverflow.com/a/44185441/2409896

于 2017-05-25T16:44:52.587 回答
0

这篇文章已经有一段时间了,但我相信现在最简单的方法是:

Fe.model.fitted <- cbind(FE.model$model, 
                         resid=FE.model$residuals, 
                         fitted=plm:::fitted_exp.plm(FE.model))

该函数fitted_exp.plm不是由plm包导出的,但我们可以使用:::来提取它。

于 2020-12-11T23:15:06.983 回答