我正在研究一个包含 110 万个观测值 x 41 个变量的大型(但不是巨大的)数据库。数据排列为不平衡的面板。使用这些变量,我指定了三个不同的模型,并将它们分别运行为 1) 固定效应、2) 随机效应和 3) 合并 OLS 回归。
仅包含数据库的原始 .RData 文件约为 15Mb。包含数据库和回归结果(共 9 个回归)的 .RData 权重约为 650Mb。我确实意识到(来自基本文档)
An object of class c("plm","panelmodel").
A "plm" object has the following elements :
coefficients the vector of coefficients,
vcov the covariance matrix of the coefficients,
residuals the vector of residuals,
df.residual degrees of freedom of the residuals,
formula an object of class ’pFormula’ describing the model,
model a data.frame of class ’pdata.frame’ containing the variables usedfor the estimation: the response is in first position and the two indexes in the last positions,
ercomp an object of class ’ercomp’ providing the estimation of the components of the
errors (for random effects models only),
call the call
即便如此,我还是无法理解为什么这些文件应该如此庞大。为了避免在处理plm
对象时内存过载,我将它们保存在三个不同的文件中(现在每个文件的权重约为 200Mb)。一小时前我打电话summary
来查看固定效应模型的结果,但它还没有显示任何结果。我现在的问题很简单。你觉得这是正常行为吗?我可以做些什么来减小plm
对象大小并加快结果检索?
以下是您可能想知道的一些事情:
- 我使用的数据库是
data.table
格式 formula
回归中的 s 是预先组装的,并包含在plm
前面的调用中as.formula()
,如此处所建议的。例子:
form<-y~x1+x2+x3+...+xn
mod.fe<-plm(as.formula(form), regr, effect="individual", model="within", index=c("id", "year"))
请让我知道我是否可以提供任何其他信息,并且您可能需要回答这个问题。
编辑
我设法建立了一个与我正在研究的具有相似特征的小型数据库。这里是:
structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L,
10L, 10L, 11L, 11L), year = structure(c(1L, 2L, 1L, 2L, 3L, 4L,
1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L), .Label = c("2000", "2001", "2002",
"2003"), class = "factor"), study = c(3.37354618925767, 4.18364332422208,
5.32950777181536, 4.17953161588198, 5.48742905242849, 5.73832470512922,
6.57578135165349, 5.69461161284364, 6.3787594194582, 4.7853001128225,
7.98380973690105, 8.9438362106853, 9.07456498336519, 7.01064830413663,
10.6198257478947, 9.943871260471, 9.84420449329467, 8.52924761610073,
3.52184994489138, 4.4179415601997, 5.35867955152904, 3.897212272657,
5.38767161155937, 4.9461949594171, 3.62294044317139, 4.58500543670032,
7.10002537198388, 6.76317574845754, 6.83547640374641, 6.74663831986349
), ethnic = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L,
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
1L, 1L, 2L, 2L), .Label = c("hispanic", "black", "chinese"), class = "factor"),
sport = c(0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), health = structure(c(1L,
1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("none",
"drink", "both", "smoke"), class = "factor"), gradec = c(2.72806403942929,
3.10067738633308, 4.04728186632456, 2.19701362539883, 1.73115878111307,
5.35879931359977, 5.79613840739381, 5.07050219214859, 4.26224490644077,
3.53554192927934, 6.10515669475491, 7.18032957183198, 6.73191149590581,
6.49512764543435, 6.4783689354808, 6.19974636196512, 5.54014977312232,
6.72545652880344, 1.00223129492982, 1.08994269214495, 3.06702680106689,
1.70103126320561, 4.82973481729635, 3.14010240687364, 3.8068435242348,
5.01254268106181, 5.66497772013949, 4.16303452633342, 4.2751229553617,
3.05652055248093), event = c(1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0,
0), evm3 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0), evm2 = c(0,
0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 1, 0, 0, 0, 0), evm1 = c(0, 1, 0, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1, 0, 0, 0, 0), evp1 = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1),
evp2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1), evp3 = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0), ndm3 = c(1, 1, 1, 1, 1, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0,
1, 1, 1, 1), ndm2 = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1), ndm1 = c(1,
0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 1, 0), ndp1 = c(0, 1, 0, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
1, 0, 1, 0, 0), ndp2 = c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0),
ndp3 = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1)), .Names = c("id",
"year", "study", "ethnic", "sport", "health", "gradec", "event",
"evm3", "evm2", "evm1", "evp1", "evp2", "evp3", "ndm3", "ndm2",
"ndm1", "ndp1", "ndp2", "ndp3"), class = "data.frame", row.names = c(NA,
30L))
我使用的公式和plm
调用是:
form<-gradec~year+study+ethnic+sport+health+event+evm3+evm2+evm1+evp1+evp2+evp3+ndm3+ndm2+ndm1+ndp1+ndp2+ndp3
plm.f<-plm(as.formula(form), data, effect="individual", model="within", index=c("id", "year"))
使用object.size()
@BenBolker 的建议,我发现调用生成了一个plm
权重为 64.5Kb 的对象,而原始数据帧的大小为 6.9Kb,这意味着结果大约是输入矩阵的 10 倍。然后我在这里设置了下面@zx8754 建议的选项,但不幸的是它们没有效果。当我终于打电话时summary(plm.f)
,我收到了错误消息:
Error in crossprod(t(X), beta) : non-conformable arguments
我最终也用我的大型数据库得到了它,但只是经过数小时的计算。
这里建议问题可能是由于系数矩阵是奇异的。is.matrix.singular()
然而,在包中发现的奇点测试matrixcalc
结果证明并非如此。
您可能想知道的另外几件事:
year
,是ethnic
因子health
- 除最后一个变量外,公式中的变量或多或少是不言自明的。
event
是在特定时间发生的假定的创伤事件。如果某年发生事件,则编码为 1,否则编码为 0。如果这些事件之一发生在前一年(减 1),则该变量evm1
等于 1,否则为 0。同样,evp1
如果事件发生在下一年(加 1),则为 1,否则为 0。变量ndm.
和ndp.
工作方式相同,但当距离不可观察(因为某个人的时间段太短)时它们被编码为 1,否则为 0。如此紧密相连的变量的存在引发了对完美共线性的怀疑。然而,如上所述,测试表明该矩阵是非奇异的。
让我再告诉一次,如果有人能回答这个问题,我将非常感激。