0

我有一个由因子和数字变量组成的大型数据框(数字变量包含 NA)。我想找到因子变量之一的不同水平的多个数值变量的观察次数。我没有单独处理每个数值变量,而是尝试使用带有点表示法或 cbind 的聚合函数来表示我希望使用 length() 进行分组和计数的数值变量。但是,当我这样做时,这个聚合为每个变量提供了相同数量的观察值,我知道这是错误的。是否存在不适用于多个变量的聚合和长度?

这是一个说明问题的简单示例... var1 在所有组中都应该有 n=3,当我单独处理它时它会这样做,但是使用点表示法或 cbind 它只是假设 var2 的 n 值。

    df <- data.frame(group=c("a","b","c","a","b","c","a","b","c"), var1=1:9, var2=c(1,2,3,NA,5,6,7,8,9))
    aggregate(var1 ~ group, df, length) 
    aggregate(var2 ~ group, df, length) 
    aggregate(. ~ group, df, length)
    aggregate(cbind(var1,var2) ~ group, df, length)
4

1 回答 1

4

Perhaps this helps:

df <- data.frame(group=c("a","b","c","a","b","c","a","b","c"),
                 var1=1:9, var2=c(1,2,3,NA,5,6,7,8,9))

with(df, length(cbind(var1, var2)))

> with(df, length(cbind(var1, var2)))
[1] 18

length() treats cbind(var1, var2) as a matrix, which is just a vector with dimensions, hence you get the length reported as prod(nrow(mat), ncol(mat)) where mat is the resulting matrix.

Ideally you'd use nrow() instead of length(), but perhaps more widely applicable is the NROW() function, which will treat a vector as a 1-column matrix for purposes of evaluating the function. nrow() won't work for a vector input

> nrow(1:10)
NULL

E.g. try these:

aggregate(cbind(var1,var2) ~ group, df, NROW)
aggregate(var1 ~ group, df, NROW)

> aggregate(cbind(var1,var2) ~ group, df, NROW)
  group var1 var2
1     a    2    2
2     b    3    3
3     c    3    3
> aggregate(var1 ~ group, df, NROW)
  group var1
1     a    3
2     b    3
3     c    3

and as you have NA, you probably don't want the incomplete cases removed, which would happen by default. This is seen above and hence why the number of rows for group a is 2. For that add na.action = na.pass to the call:

aggregate(cbind(var1,var2) ~ group, df, NROW, na.action = na.pass)

> aggregate(cbind(var1,var2) ~ group, df, NROW, na.action = na.pass)
  group var1 var2
1     a    3    3
2     b    3    3
3     c    3    3

The issues is that in building up the data frame to pass to aggregate.data.frame, the usual model frame generation process takes place and aggregate.formula has the na.action argument set to na.omit by default - which is standard behaviour in modelling functions that use formula interfaces.

If you want to count the number of non-NA values per variable then you need a completely different approach, perhaps using is.na(), as in

foo <- function(x) sum(!is.na(x))
aggregate(cbind(var1,var2) ~ group, df, foo, na.action = na.pass)

> aggregate(cbind(var1,var2) ~ group, df, foo, na.action = na.pass)
  group var1 var2
1     a    3    2
2     b    3    3
3     c    3    3

Which works by counting the number of non-NA values through coercion of first TRUE -> FALSE via ! and then resulting TRUEs are converted to 1 and FALSEs to 0, which sum() then adds for us.

于 2013-05-17T20:34:04.547 回答