r - 以数据帧内和跨数据帧的组为条件时非唯一实例的频率

Question

我正在分析包含个人在给定年份工作的公司信息的就业数据，每一年都是一个单独的数据框。

我希望能够快速识别在给定年份为多家公司工作过的个人，以及在一年内为多家公司工作过的个人。我的目标是计算给定公司在年内（单个数据框）和跨年经历“退出”（员工更换公司）的次数的一些频率。

数据帧的结构如下：

year1 <- data.frame(individual=c("1", "2", "3", "4", "2", "6", "7", "3", "9", "10"),
                firm=c("A", "B", "C", "D", "A", "C", "D", "B", "B", "C"))

year2 <- data.frame(individual=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"),
                firm=c("A", "B", "D", "D", "A", "C", "D", "A", "B", "C"))

通过搜索个人和公司之间的所有非唯一关联，我相当确定如何在给定的一年内做到这一点，但我不知道如何在多个数据对象/年份上做到这一点。同样，我对公司而不是特定个人的“退出”频率感兴趣。

我的理想输出是每家公司员工总数的频率/比例如下：

exit(withinyear)_byfirm
exit(betweenyear)_byfirm

score 0 · Accepted Answer

计数，而不是比例：

within <- function(y) {
  # A vector of length > 1 in the aggregate function means that the person has
  # changed jobs.
  # `[` ignores the value 0 if there are other values present, and returns a
  # zero-length vector if not.  Often a source of confusion, but perfect here.
  table(levels(y$firm)[aggregate(firm~individual, data=y,
                                 function(x) {z<- unique(x)                 
                                              if(length(z) > 1) head(z, -1) else 0})$firm])
}

between <- function(year1, year2) {
  # Last place worked in year1
  y1 <- rbind(do.call(rbind, by(year1, year1$individual, FUN=tail, 1)))

  # First place worked in year2
  y2 <- rbind(do.call(rbind, by(year2, year2$individual, FUN=head, 1)))

  # Combine these and look for duplicate individuals with the prior function
  y <- rbind(y1, y2)
  within(y)
}

结果：

> within(year1)

B C 
1 1 

> within(year2)
character(0)

> between(year1, year2)

A B 
1 1

r - 以数据帧内和跨数据帧的组为条件时非唯一实例的频率

1 回答 1

Related

Reference