r - 在 R 数据框中按组应用计算

Question

我有这样的数据：

object category country
495647 1        RUS  
477462 2        GER  
431567 3        USA  
449136 1        RUS  
367260 1        USA  
495649 1        RUS  
477461 2        GER  
431562 3        USA  
449133 2        RUS  
367264 2        USA  
...

其中一个对象以不同的形式出现，(category, country)并且国家共享一个类别列表。

我想在其中添加另一列，这将是每个国家/地区的类别权重 - 出现在某个类别的类别中的对象数量，标准化为一个国家/地区内的总和为 1（仅对唯一(category, country)对进行总和）。

我可以做类似的事情：

aggregate(df$object, list(df$category, df$country), length)

然后从那里计算权重，但是直接在原始数据上执行此操作的更有效和优雅的方法是什么。

所需的示例输出：

object category country weight
495647 1        RUS     .75
477462 2        GER     .5 
431567 3        USA     .5 
449136 1        RUS     .75
367260 1        USA     .25
495649 1        RUS     .75
477461 3        GER     .5
431562 3        USA     .5
449133 2        RUS     .25
367264 2        USA     .25
...

对于独特的(category, country)配对，上述内容在国家/地区内总计为一个。

score 3 · Accepted Answer

专门针对最后一句话做出回应：“什么是直接在原始数据上执行此操作的更有效和更优雅的方式。”，恰好data.table有一个新功能。

install.packages("data.table", repos="http://R-Forge.R-project.org")
# Needs version 1.8.1 from R-Forge.  Soon to be released to CRAN.

使用您的数据DT：

> DT[, countcat:=.N, by=list(country,category)]     # add 'countcat' column
    category country countcat
 1:        1     RUS        3
 2:        2     GER        1
 3:        3     USA        2
 4:        1     RUS        3
 5:        1     USA        1
 6:        1     RUS        3
 7:        3     GER        1
 8:        3     USA        2
 9:        2     RUS        1
10:        2     USA        1

> DT[, weight:=countcat/.N, by=country]     # add 'weight' column
    category country countcat weight
 1:        1     RUS        3   0.75
 2:        2     GER        1   0.50
 3:        3     USA        2   0.50
 4:        1     RUS        3   0.75
 5:        1     USA        1   0.25
 6:        1     RUS        3   0.75
 7:        3     GER        1   0.50
 8:        3     USA        2   0.50
 9:        2     RUS        1   0.25
10:        2     USA        1   0.25

:=通过引用数据添加一列，是一个“旧”功能。新功能是它现在可以按组工作。 .N是一个符号，它保存每组中的行数。

这些操作是内存高效的，应该扩展到大数据；例如1e8，1e9行。

如果您不想包含中间列countcat，请稍后将其删除。同样，这是一种高效的操作，无论表的大小如何（通过在内部移动指针）都可以立即工作。

> DT[,countcat:=NULL]     # remove 'countcat' column
    category country weight
 1:        1     RUS   0.75
 2:        2     GER   0.50
 3:        3     USA   0.50
 4:        1     RUS   0.75
 5:        1     USA   0.25
 6:        1     RUS   0.75
 7:        3     GER   0.50
 8:        3     USA   0.50
 9:        2     RUS   0.25
10:        2     USA   0.25
>

score 2 · Accepted Answer

实际上我前段时间问了一个类似的问题。data.table 非常适合这一点，尤其是现在实现了 := by group，并且不再需要自连接 - 如上所示。基数 R 的最佳解决方案是ave(). tapply()也可以使用。

这类似于上面的解决方案，使用ave(). 但是，我强烈建议您查看 data.table。

df$count <- ave(x = df$object, df$country, df$category, FUN = length)
df$weight <- ave(x = df$count, df$country, FUN = function(x) x/length(x))

score 1 · Accepted Answer

我没有看到一种可读的方式在一行中做到这一点。但它可以非常紧凑。

# Use table to get the counts.
counts <- table(df[,2:3])
# Normalize the table
weights <- t(t(counts)/colSums(counts))
# Use 'matrix' selection by names.
df$weight <- weights[as.matrix(df[,2:3])]

r - 在 R 数据框中按组应用计算

3 回答 3

Related

Reference