r - 为什么 PLM 会创建大量对象并且无法打开它们

Question

我正在研究一个包含 110 万个观测值 x 41 个变量的大型（但不是巨大的）数据库。数据排列为不平衡的面板。使用这些变量，我指定了三个不同的模型，并将它们分别运行为 1) 固定效应、2) 随机效应和 3) 合并 OLS 回归。

仅包含数据库的原始 .RData 文件约为 15Mb。包含数据库和回归结果（共 9 个回归）的 .RData 权重约为 650Mb。我确实意识到（来自基本文档）

An object of class c("plm","panelmodel").

A "plm" object has the following elements :

coefficients   the vector of coefficients,
vcov           the covariance matrix of the coefficients,
residuals      the vector of residuals,
df.residual    degrees of freedom of the residuals,
formula        an object of class ’pFormula’ describing the model,
model          a data.frame of class ’pdata.frame’ containing the variables usedfor the estimation: the response is in first position and the two indexes in the last positions,
ercomp         an object of class ’ercomp’ providing the estimation of the components of the
errors         (for random effects models only),
call           the call

即便如此，我还是无法理解为什么这些文件应该如此庞大。为了避免在处理plm对象时内存过载，我将它们保存在三个不同的文件中（现在每个文件的权重约为 200Mb）。一小时前我打电话summary来查看固定效应模型的结果，但它还没有显示任何结果。我现在的问题很简单。你觉得这是正常行为吗？我可以做些什么来减小plm对象大小并加快结果检索？

以下是您可能想知道的一些事情：

我使用的数据库是data.table格式
formula回归中的 s 是预先组装的，并包含在plm前面的调用中as.formula()，如此处所建议的。例子：

form<-y~x1+x2+x3+...+xn

mod.fe<-plm(as.formula(form), regr, effect="individual", model="within", index=c("id", "year"))

请让我知道我是否可以提供任何其他信息，并且您可能需要回答这个问题。

编辑

我设法建立了一个与我正在研究的具有相似特征的小型数据库。这里是：

structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 
10L, 10L, 11L, 11L), year = structure(c(1L, 2L, 1L, 2L, 3L, 4L, 
1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L), .Label = c("2000", "2001", "2002", 
"2003"), class = "factor"), study = c(3.37354618925767, 4.18364332422208, 
5.32950777181536, 4.17953161588198, 5.48742905242849, 5.73832470512922, 
6.57578135165349, 5.69461161284364, 6.3787594194582, 4.7853001128225, 
7.98380973690105, 8.9438362106853, 9.07456498336519, 7.01064830413663, 
10.6198257478947, 9.943871260471, 9.84420449329467, 8.52924761610073, 
3.52184994489138, 4.4179415601997, 5.35867955152904, 3.897212272657, 
5.38767161155937, 4.9461949594171, 3.62294044317139, 4.58500543670032, 
7.10002537198388, 6.76317574845754, 6.83547640374641, 6.74663831986349
), ethnic = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
1L, 1L, 2L, 2L), .Label = c("hispanic", "black", "chinese"), class = "factor"), 
    sport = c(0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), health = structure(c(1L, 
    1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 
    3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("none", 
    "drink", "both", "smoke"), class = "factor"), gradec = c(2.72806403942929, 
    3.10067738633308, 4.04728186632456, 2.19701362539883, 1.73115878111307, 
    5.35879931359977, 5.79613840739381, 5.07050219214859, 4.26224490644077, 
    3.53554192927934, 6.10515669475491, 7.18032957183198, 6.73191149590581, 
    6.49512764543435, 6.4783689354808, 6.19974636196512, 5.54014977312232, 
    6.72545652880344, 1.00223129492982, 1.08994269214495, 3.06702680106689, 
    1.70103126320561, 4.82973481729635, 3.14010240687364, 3.8068435242348, 
    5.01254268106181, 5.66497772013949, 4.16303452633342, 4.2751229553617, 
    3.05652055248093), event = c(1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 
    0), evm3 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0), evm2 = c(0, 
    0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 1, 0, 0, 1, 1, 0, 0, 0, 0), evm1 = c(0, 1, 0, 1, 1, 1, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 
    1, 0, 0, 0, 0), evp1 = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1), 
    evp2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1), evp3 = c(0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 1, 0), ndm3 = c(1, 1, 1, 1, 1, 0, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 
    1, 1, 1, 1), ndm2 = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1), ndm1 = c(1, 
    0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 
    0, 0, 1, 0, 0, 0, 1, 0, 1, 0), ndp1 = c(0, 1, 0, 0, 0, 1, 
    0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 
    1, 0, 1, 0, 0), ndp2 = c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0), 
    ndp3 = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 
    1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1)), .Names = c("id", 
"year", "study", "ethnic", "sport", "health", "gradec", "event", 
"evm3", "evm2", "evm1", "evp1", "evp2", "evp3", "ndm3", "ndm2", 
"ndm1", "ndp1", "ndp2", "ndp3"), class = "data.frame", row.names = c(NA, 
30L))

我使用的公式和plm调用是：

form<-gradec~year+study+ethnic+sport+health+event+evm3+evm2+evm1+evp1+evp2+evp3+ndm3+ndm2+ndm1+ndp1+ndp2+ndp3

plm.f<-plm(as.formula(form), data, effect="individual", model="within", index=c("id", "year"))

使用object.size()@BenBolker 的建议，我发现调用生成了一个plm权重为 64.5Kb 的对象，而原始数据帧的大小为 6.9Kb，这意味着结果大约是输入矩阵的 10 倍。然后我在这里设置了下面@zx8754 建议的选项，但不幸的是它们没有效果。当我终于打电话时summary(plm.f)，我收到了错误消息：

Error in crossprod(t(X), beta) : non-conformable arguments

我最终也用我的大型数据库得到了它，但只是经过数小时的计算。这里建议问题可能是由于系数矩阵是奇异的。is.matrix.singular()然而，在包中发现的奇点测试matrixcalc结果证明并非如此。

您可能想知道的另外几件事：

year,是ethnic因子health
除最后一个变量外，公式中的变量或多或少是不言自明的。event是在特定时间发生的假定的创伤事件。如果某年发生事件，则编码为 1，否则编码为 0。如果这些事件之一发生在前一年（减 1），则该变量evm1等于 1，否则为 0。同样，evp1如果事件发生在下一年（加 1），则为 1，否则为 0。变量ndm.和ndp.工作方式相同，但当距离不可观察（因为某个人的时间段太短）时它们被编码为 1，否则为 0。如此紧密相连的变量的存在引发了对完美共线性的怀疑。然而，如上所述，测试表明该矩阵是非奇异的。

让我再告诉一次，如果有人能回答这个问题，我将非常感激。

score 0 · Accepted Answer

关于错误信息Error in crossprod(t(X), beta) : non-conformable arguments：

正如建议的那样，这可能是由于模型矩阵中的奇异性。请记住，固定效应模型的模型矩阵是转换后的数据（转换后的数据框）。

因此，您将需要检查转换后数据的奇异性。即使原始数据不是线性相关的，固定效应变换也会导致线性相关（奇点）！该plm软件包有一个关于该问题的很好的文档?detect.lindep，我将在此部分重复（仅一个示例）：

### Example 1 ###
# prepare the data
data(Cigar)
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)

# setup a pFormula and a model frame
pform <- pFormula(price ~ 0 + cpi + fact1 + fact2)
mf <- model.frame(pform, data = Cigar.p)

# no linear dependence in the pooling model's model matrix
# (with intercept in the formula, there would be linear depedence)
detect.lindep(model.matrix(pform, data = mf, model = "pooling"))

# linear dependence present in the FE transformed model matrix
modmat_FE <- model.matrix(pform, data = mf, model = "within")
detect.lindep(modmat_FE)
mod_FE <- plm(pform, data = Cigar.p, model = "within")
detect.lindep(mod_FE) 
alias(mod_FE) # => fact1 == -1*fact2
plm(pform, data = mf, model = "within")$aliased # "fact2" indicated as aliased

所以你应该运行你的函数来检测你得到的模型的转换数据的线性依赖性model.matrix(you_model)。您可以使用 plm: 提供的函数detect.lindep，alias或任何适用于矩阵的函数。

您还可以查看您的 plm 模型对象： your_model$aliased查看估计中是否删除了某些变量。

r - 为什么 PLM 会创建大量对象并且无法打开它们

1 回答 1

Related

Reference