1

我有一个大型数据集,我一直在使用线性混合模型(约 600000 个观察值)进行分析。通常,我估计最小二乘均值并使用 lsmeans 包中的lsmeans进行成对对比。我还使用自由度的估计值来测试每个对比的显着性,使用t统计量,因为很多时候我的数字实际上非常低。但是,我不能在我的数据集上应用lsmeans。下面我展示了一个示例,其中我模拟了不同大小的数据集。

library(lme4)
library(lsmeans)

SmallData<-data.frame(R=rnorm(10000,23,4),A=sample(letters[1:2],10000,replace=TRUE),B=sample(letters[11:23],10000,replace=TRUE))
MediumData<-data.frame(R=rnorm(100000,23,4),A=sample(letters[1:2],100000,replace=TRUE),B=sample(letters[11:23],10000,replace=TRUE))
LargeData<-data.frame(R=rnorm(600000,23,4),A=sample(letters[1:2],600000,replace=TRUE),B=sample(letters[11:23],10000,replace=TRUE))

modelSmall<-lmer(R ~ B + (1|A),data=SmallData)
modelMedium<-lmer(R ~ B + (1|A),data=MediumData)
modelLarge<-lmer(R ~ B + (1|A),data=LargeData)

Small.lsm<-lsmeans(modelSmall,~B)

Small.lsm

$lsmeans
 B   lsmean         SE    df lower.CL upper.CL
 k 22.99691 0.05649338 74.85 22.88436 23.10945
 l 22.99451 0.05656906 75.68 22.88184 23.10719

Results are averaged over the levels of: C
Confidence level used: 0.95

$contrasts
 contrast    estimate        SE      df t.ratio p.value
 k - l    0.002397375 0.0799509 9992.92    0.03  0.9761

Results are averaged over the levels of: C

Small.lsm<-lsmeans(modelSmall,pairwise~B|C)

Small.lsm

$lsmeans
C = m:
 B   lsmean         SE     df lower.CL upper.CL
 k 22.98629 0.07980879 294.10 22.82923 23.14336
 l 23.01980 0.07946976 286.64 22.86338 23.17621

C = n:
 B   lsmean         SE     df lower.CL upper.CL
 k 23.00752 0.07996624 292.46 22.85014 23.16490
 l 22.96923 0.08053307 304.92 22.81075 23.12770

Confidence level used: 0.95

$contrasts
C = m:
 contrast    estimate        SE      df t.ratio p.value
 k - l    -0.03350181 0.1126268 9996.00  -0.297  0.7661

C = n:
 contrast    estimate        SE      df t.ratio p.value
 k - l     0.03829656 0.1134942 9994.74   0.337  0.7358


Medium.lsm<-lsmeans(modelMedium,pairwise~B|C) # Memory overload (~70 Gb RAM)

Large.lsm<-lsmeans(modelLarge,pairwise~B|C)

Large.lsm<-lsmeans(modelLarge,pairwise~B|C)
 *** caught segfault ***
address 0x7f9de00ff000, cause 'memory not mapped'

Traceback:
 1: t(ZZ) %*% EE %*% ZZ
 2: t(ZZ) %*% EE %*% ZZ
 3: .get_SigmaG(object, details)
 4: get_SigmaG.lmerMod(object, details)
 5: get_SigmaG(object, details)
 6: pbkrtest::vcovAdj.lmerMod(object, 0)
 7: lsm.basis.merMod(object, trms, xlev, grid, ...)
 8: lsm.basis(object, trms, xlev, grid, ...)
 9: ref.grid(object = <S4 object of class "lmerMod">, by = "C", contr = "pairwise")
10: do.call("ref.grid", rgargs)
11: lsmeans.default(object, specs, ...)
12: lsmeans.character.default(object, specs = all.vars(specs[-2]),     by = by, contr = contr.spec, ...)
13: lsmeans.character(object, specs = all.vars(specs[-2]), by = by,     contr = contr.spec, ...)
14: lsmeans(object, specs = all.vars(specs[-2]), by = by, contr = contr.spec,     ...)
15: lsmeans.formula(modelLarge, pairwise ~ B | C)
16: lsmeans(modelLarge, pairwise ~ B | C)

我也尝试过lmerTest包并使用了 difflsmeans函数,它也崩溃了。我可以使用一般线性假设检验(来自multcomp包的glht )获得平均估计值,但此过程不会估计特定对比的自由度。您是否知道适用于大型数据集的这些程序的替代方法?

非常非常感谢你。

4

0 回答 0