r - 如何将复杂的模型输出整齐地应用于 data.table 一个因素

Question

我在对象上使用normalmixEMR 中的函数（算法）data.table。

在整个表上运行它是一个简单的过程。这会输出一个mixEM列表对象，其中 $posterior 项目是最感兴趣的。我可以使用 cbind 以一种公认有点尴尬的方式将其映射回数据，如下所示：

library(data.table)
library(ggplot2)
library(mixtools)
set.seed(100)

faithfulDT <- data.table(faithful)
faithfulDT[, factorAB := rep(c('a', 'b'), .N)]
# Make some data

qplot(data = faithfulDT, x = eruptions, fill = factor) + facet_grid(factor ~.)
# graph the distribution

faithfulMix <- faithfulDT[, normalmixEM(eruptions)]
cbind(faithfulDT, data.table(faithfulMix$posterior)) # to join posterior probabilities to values. I'm ASSUMING this is the best way to do it?
plot(faithfulMix, whichplots = 2)
# model and graph without factorAB

但是，我正在努力将bydata.table 的参数有效地包含在此工作流程中。我的目标是运行一个normalmixEM函数。实际上，我想在数据子集上运行两个分离的、隔离的模型，然后根据其潜在的“拆分-应用-组合”策略，在其末尾的两列中将每个值放在一个单独的列中。byfactorABdata.table

library(data.table)
library(ggplot2)
library(mixtools)
set.seed(100)

faithfulDT <- data.table(faithful)
faithfulDT[, factorAB := rep(c('a', 'b'), .N)]
# Make some data

qplot(data = faithfulDT, x = eruptions, fill = factor) + facet_grid(factor ~.)
# graph the distribution

faithfulMix <- faithfulDT[, normalmixEM(eruptions)]
cbind(faithfulDT, data.table(faithfulMix$posterior)) # to join posterior probabilities to values. I'm ASSUMING this is the best way to do it?
plot(faithfulMix, whichplots = 2)
# model and graph without factorAB

faithfulMixAB <- faithfulDT[, normalmixEM(eruptions), by = factorAB]
# model and graph with factorAB - attempt by

faithfulMixAB <- faithfulDT[, normalmixEM(.SD$eruptions), by = factorAB]
# model and graph with factorAB - attempt by and .SD

faithfulMixAB <- faithfulDT[, normalmixEM(.SD), by = factorAB, .SDcols = "eruptions"]
# model and graph with factorAB - attempt by and .SD and .SDcols

faithfulMixAB <- faithfulDT[, lapply(.SD, normalmixEM), by = factorAB, .SDcols = "eruptions"]
# model and graph by factorAB - lapply
faithfulMixAB
# partial success?

faithfulMixABAssign <- faithfulDT[, mixMDL := lapply(.SD, normalmixEM), by = factorAB, .SDcols = "eruptions"]
# model and graph by factorAB - lapply and try to assign
faithfulMixABAssign
# even more partial success?

显然，在这里我已经成功地解决了一个似乎具有正确数字的解决方案，但在很大程度上是任意位置。

factorAB在包含拆分的情况下，我在这个工作流程中遗漏了什么，它将整理输出？显然，我需要为这cbind部分工作找到一个替代品，但是我目前的输出一开始就是一团糟。我可以改善 FlilymixAB 的输出来促进这一点吗？可能完全跳过这个并直接从 data.table 中运行的函数分配后验值？

编辑

在@eddi 和朋友 IRL 的帮助下，我现在的处境是：

faithfulDT[, mixPostFull.1 := normalmixEM(eruptions)$posterior[,1]]
faithfulDT[, mixPostFull.2 := normalmixEM(eruptions)$posterior[,2]]

它表示在不拆分因子的情况下运行模型的两个后列。和：

faithfulDT[, mixPostAB.1 := normalmixEM(eruptions)$posterior[,1], by = factorAB]
faithfulDT[, mixPostAB.2 := normalmixEM(eruptions)$posterior[,2], by = factorAB]

其中有两列，但确实按因子拆分，这实际上是我想要做的。

我认为这两者都是需要的，因为后对象实际上是 2 个向量，一个表示记录在 1 和组中的概率，另一个表示它在第二个中。

Eddi，您当前的答案有 2 列，但我认为这些与上面列出的不对应。如果有的话，这些值会略有不同：

eruptions waiting factorAB mixPostFull.1 mixPostFull.2  mixPostAB.1  mixPostAB.2
  1:     3.600      79        a  5.376906e-10  1.000000e+00 1.581467e-11 1.000000e+00
  2:     1.800      54        b  9.999998e-01  1.723648e-07 1.000000e+00 2.112761e-09
  3:     3.333      74        a  1.755506e-06  9.999982e-01 1.405098e-07 9.999999e-01
  4:     2.283      62        b  9.999406e-01  5.939085e-05 9.999974e-01 2.599843e-06
  5:     4.533      85        a  2.215050e-25  1.000000e+00 3.658846e-29 1.000000e+00
 ---                                                                                 
268:     4.117      81        b  6.337730e-18  1.000000e+00 9.658721e-10 1.000000e+00
269:     2.150      46        a  9.999912e-01  8.828998e-06 9.999828e-01 1.724380e-05
270:     4.417      90        b  3.320219e-23  1.000000e+00 1.461450e-12 1.000000e+00
271:     1.817      46        a  9.999998e-01  2.012672e-07 9.999995e-01 4.981589e-07
272:     4.467      74        b  3.912776e-24  1.000000e+00 4.818983e-13 1.000000e+00

我真正需要的是一种不必重复运行模型的方法。我很确定我可以在某个地方捏造 ':='，但我现在没有时间。稍后会回到它。

一段时间后 所以我在前面忽略了我不能只是重新运行模型来获得第二列，因为除了显然效率很低之外，由于算法的性质，除非我设置种子，因为每次运行都有一个不同的起点，所以它会在一次运行到下一次运行中得到一个稍微不同的答案。

score 1 · Accepted Answer

我想你正在寻找这样的东西：

faithfulDT[, {
               result = as.vector(normalmixEM(eruptions)$posterior);
               faithfulDT[, paste0('result.', factorAB) := result];
               NULL
             }
           , by = factorAB]
faithfulDT
#     eruptions waiting factorAB     result.a     result.b
#  1:     3.600      79        a 1.581719e-11 1.000000e+00
#  2:     1.800      54        b 1.405263e-07 9.999974e-01
#  3:     3.333      74        a 3.660230e-29 9.531090e-01
#  4:     2.283      62        b 5.986926e-33 3.630698e-05
#  5:     4.533      85        a 9.999983e-01 6.384911e-12
# ---                                                     
#268:     4.117      81        b 6.545978e-07 1.000000e+00
#269:     2.150      46        a 2.342451e-06 1.562445e-06
#270:     4.417      90        b 1.000000e+00 1.000000e+00
#271:     1.817      46        a 1.724380e-05 1.000000e+00
#272:     4.467      74        b 4.981589e-07 1.000000e+00

在评论和 OP 中的讨论之后，结果是所需的答案是：

faithfulDT[, c('mixAB.1', 'mixAB.2') := as.data.table(normalmixEM(eruptions)$posterior)
           , by = factorAB]

r - 如何将复杂的模型输出整齐地应用于 data.table 一个因素

1 回答 1

Related

Reference