1

所以我有这些数据,我想做的是创建一个变量来反映每个给定年份中地位最高的群体。每个组可以具有以下状态:* 1 = 垄断,* 2 = 主导,* 3 = 高级,* 4 = 初级或 * 5 = 受歧视。1 或 2 组将自动获得最高状态,因为每个国家/地区在任何给定年份都只有一个组保持该状态。但是,有些国家/地区有多个属于 3 的组(有时 3 也是该国家当年可以达到的最高组状态),在这种情况下,我希望规模最大的组是编码为具有最高地位的人。我该怎么做呢?

数据

 D1 <- data.frame(row = c(1, 2, 3, 4, 5, 6, 7 , 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
           country = c("US", "US", "US", "US", "US", "US", "US", "US","US", "US", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada"),
           year = c(1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995), 
           group = c("White", "White", "White", "White", "White", "Latino", "Latino", "Latino", "Latino", "Latino","English", "English", "English", "English", "English", "French", "French", "French", "French", "French"), 
           groupstatus = c("1", "1", "1", "3", "3", "5", "5","5", "3", "3", "2", "2", "2", "3", "3", "3", "3", "3", "3", "4"), 
           groupsize= c(0.7, 0.7, 0.7, 0.7, 0.7, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2))

期望的输出

D1 <- data.frame(row = c(1, 2, 3, 4, 5, 6, 7 , 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20), 
country = c("US", "US", "US", "US", "US", "US", "US", "US","US", "US", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada", "Canada"), 
year = c(1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995, 1991, 1992, 1993, 1994, 1995), 
group = c("White", "White", "White", "White", "White", "Latino", "Latino", "Latino", "Latino", "Latino","English", "English", "English", "English", "English", "French", "French", "French", "French", "French"),            
groupstatus = c("1", "1", "1", "3", "3", "5", "5","5", "3", "3", "2", "2", "2", "3", "3", "3", "3", "3", "3", "4"), 
groupsize= c(0.7, 0.7, 0.7, 0.7, 0.7, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2), 
highest= c(1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0))
4

1 回答 1

1

这是一种方法data.table

我们将“data.frame”转换为“data.table”(setDT(D1))。按“国家”和“年份”分组,我们根据“组状态”中值 1 和 2 的存在创建一个二进制列“最高”(也可以一步完成,但为了更容易理解,我将其拆分向上)。

在下一步中,按相同的列分组,我们检查“groupstatus”中的所有元素是否为 3 ( all(groupstatus==3))。如果是这种情况,我们会得到最大“groupsize”的逻辑索引groupsize==max(groupsize)else'FALSE' ( !any(highest)) 和 'groupstatus' 为 3 ( groupstatus==3)。生成的逻辑向量可以通过 更改为“数字”行索引.I。我们提取行索引列 ( $V1) 并使用它将“最高”中的值更改为 1。

 setDT(D1)[, highest := +(groupstatus %in% 1:2) , .(country, year)]
 indx <- D1[, .I[if(all(groupstatus==3)) groupsize==max(groupsize) 
     else !any(highest)& groupstatus==3], .(country, year)]$V1
 D1[indx, highest := 1L]
于 2015-08-06T08:42:21.147 回答