我有一个大型数据集,我一直在使用线性混合模型(约 600000 个观察值)进行分析。通常,我估计最小二乘均值并使用 lsmeans 包中的lsmeans进行成对对比。我还使用自由度的估计值来测试每个对比的显着性,使用t统计量,因为很多时候我的数字实际上非常低。但是,我不能在我的数据集上应用lsmeans。下面我展示了一个示例,其中我模拟了不同大小的数据集。
library(lme4)
library(lsmeans)
SmallData<-data.frame(R=rnorm(10000,23,4),A=sample(letters[1:2],10000,replace=TRUE),B=sample(letters[11:23],10000,replace=TRUE))
MediumData<-data.frame(R=rnorm(100000,23,4),A=sample(letters[1:2],100000,replace=TRUE),B=sample(letters[11:23],10000,replace=TRUE))
LargeData<-data.frame(R=rnorm(600000,23,4),A=sample(letters[1:2],600000,replace=TRUE),B=sample(letters[11:23],10000,replace=TRUE))
modelSmall<-lmer(R ~ B + (1|A),data=SmallData)
modelMedium<-lmer(R ~ B + (1|A),data=MediumData)
modelLarge<-lmer(R ~ B + (1|A),data=LargeData)
Small.lsm<-lsmeans(modelSmall,~B)
Small.lsm
$lsmeans
B lsmean SE df lower.CL upper.CL
k 22.99691 0.05649338 74.85 22.88436 23.10945
l 22.99451 0.05656906 75.68 22.88184 23.10719
Results are averaged over the levels of: C
Confidence level used: 0.95
$contrasts
contrast estimate SE df t.ratio p.value
k - l 0.002397375 0.0799509 9992.92 0.03 0.9761
Results are averaged over the levels of: C
Small.lsm<-lsmeans(modelSmall,pairwise~B|C)
Small.lsm
$lsmeans
C = m:
B lsmean SE df lower.CL upper.CL
k 22.98629 0.07980879 294.10 22.82923 23.14336
l 23.01980 0.07946976 286.64 22.86338 23.17621
C = n:
B lsmean SE df lower.CL upper.CL
k 23.00752 0.07996624 292.46 22.85014 23.16490
l 22.96923 0.08053307 304.92 22.81075 23.12770
Confidence level used: 0.95
$contrasts
C = m:
contrast estimate SE df t.ratio p.value
k - l -0.03350181 0.1126268 9996.00 -0.297 0.7661
C = n:
contrast estimate SE df t.ratio p.value
k - l 0.03829656 0.1134942 9994.74 0.337 0.7358
Medium.lsm<-lsmeans(modelMedium,pairwise~B|C) # Memory overload (~70 Gb RAM)
Large.lsm<-lsmeans(modelLarge,pairwise~B|C)
Large.lsm<-lsmeans(modelLarge,pairwise~B|C)
*** caught segfault ***
address 0x7f9de00ff000, cause 'memory not mapped'
Traceback:
1: t(ZZ) %*% EE %*% ZZ
2: t(ZZ) %*% EE %*% ZZ
3: .get_SigmaG(object, details)
4: get_SigmaG.lmerMod(object, details)
5: get_SigmaG(object, details)
6: pbkrtest::vcovAdj.lmerMod(object, 0)
7: lsm.basis.merMod(object, trms, xlev, grid, ...)
8: lsm.basis(object, trms, xlev, grid, ...)
9: ref.grid(object = <S4 object of class "lmerMod">, by = "C", contr = "pairwise")
10: do.call("ref.grid", rgargs)
11: lsmeans.default(object, specs, ...)
12: lsmeans.character.default(object, specs = all.vars(specs[-2]), by = by, contr = contr.spec, ...)
13: lsmeans.character(object, specs = all.vars(specs[-2]), by = by, contr = contr.spec, ...)
14: lsmeans(object, specs = all.vars(specs[-2]), by = by, contr = contr.spec, ...)
15: lsmeans.formula(modelLarge, pairwise ~ B | C)
16: lsmeans(modelLarge, pairwise ~ B | C)
我也尝试过lmerTest包并使用了 difflsmeans函数,它也崩溃了。我可以使用一般线性假设检验(来自multcomp包的glht )获得平均估计值,但此过程不会估计特定对比的自由度。您是否知道适用于大型数据集的这些程序的替代方法?
非常非常感谢你。