r - 如何使用 data.table 处理空/不完整的子集

Question

我有一个包含年份、国家和公司标识符的数据面板。我想将 logit 模型拟合到使用data.table. 如果我在每个年份国家子集中有足够的条目来拟合模型，我没有问题，但是如果年份国家子集中没有足够的数据，则会glm引发错误并且我无法拟合所有模型. （我得到基本上相同的错误lm。）

里面有解决办法data.table吗？或者我应该在上游整理我的数据以确保没有数据不足的年份国家子集？

谢谢！

library(data.table)

# similar data
DT <- data.table(year=rep(2001:2010, each=100),
                 country=rep(rep(1:10, each=10), 10), 
                 firm=rep(1:100, 10), 
                 y=round(runif(100)), 
                 x=runif(100)
                 )
setkey(DT, year, country)

# no problems if there are enough data per year-country subset
DT2 <- DT[, as.list(coef(glm(y ~ x), family="binomial")), by="year,country"]

# but `lm` throws and error if there are missing data
DT[(DT$year == 2001) & (DT$country == 1), "y"] <- NA
DT3 <- DT[, as.list(coef(glm(y ~ x, family="binomial"))), by="year,country"]

产量

> DT3 <- DT[, as.list(coef(glm(y ~ x, family="binomial"))), by="year,country"]
Error in family$linkfun(mustart) : 
  Argument mu must be a nonempty numeric vector

score 4 · Accepted Answer

这个怎么样？

fn <- function(x, y) {
  if (length(na.omit(y)) == 0)
    NULL
  else
    as.list(coef(glm(y ~ x, family="binomial")))
}

DT3 <- DT[, fn(x, y), by="year,country"]

您显然可以fn为您的特定目的定制错误检查。

更新。如果您想fn潜在地处理数据表中的多个列，这是一个解决方案：

fn <- function(df) {
  if (length(na.omit(df$y)) == 0)
    NULL
  else
    as.list(coef(glm(df$y ~ df$x, family="binomial")))
}

DT3 <- DT[, fn(.SD), by="year,country"]

从马修编辑

这不是你应该使用的方式data.table。无需定义函数。只需像这样直接使用变量：

DT3 <- DT[, 
  if (length(na.omit(y)) == 0)
    NULL
  else
    as.list(coef(glm(y ~ x, family="binomial")))
, by="year,country"]

不建议重复df$insidefn()和 call ，除非您确实使用了所有列，例如 using 。有相当大的多行是很常见的。fn(.SD)data.table.SD.SDcols{ ... }j

r - 如何使用 data.table 处理空/不完整的子集

1 回答 1

Related

Reference