0

我正在分析包含个人在给定年份工作的公司信息的就业数据,每一年都是一个单独的数据框。

我希望能够快速识别在给定年份为多家公司工作过的个人,以及在一年内为多家公司工作过的个人。我的目标是计算给定公司在年内(单个数据框)和跨年经历“退出”(员工更换公司)的次数的一些频率。

数据帧的结构如下:

year1 <- data.frame(individual=c("1", "2", "3", "4", "2", "6", "7", "3", "9", "10"),
                firm=c("A", "B", "C", "D", "A", "C", "D", "B", "B", "C"))

year2 <- data.frame(individual=c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"),
                firm=c("A", "B", "D", "D", "A", "C", "D", "A", "B", "C"))

通过搜索个人和公司之间的所有非唯一关联,我相当确定如何在给定的一年内做到这一点,但我不知道如何在多个数据对象/年份上做到这一点。同样,我对公司而不是特定个人的“退出”频率感兴趣。

我的理想输出是每家公司员工总数的频率/比例如下:

exit(withinyear)_byfirm
exit(betweenyear)_byfirm
4

1 回答 1

0

计数,而不是比例:

within <- function(y) {
  # A vector of length > 1 in the aggregate function means that the person has
  # changed jobs.
  # `[` ignores the value 0 if there are other values present, and returns a
  # zero-length vector if not.  Often a source of confusion, but perfect here.
  table(levels(y$firm)[aggregate(firm~individual, data=y,
                                 function(x) {z<- unique(x)                 
                                              if(length(z) > 1) head(z, -1) else 0})$firm])
}

between <- function(year1, year2) {
  # Last place worked in year1
  y1 <- rbind(do.call(rbind, by(year1, year1$individual, FUN=tail, 1)))

  # First place worked in year2
  y2 <- rbind(do.call(rbind, by(year2, year2$individual, FUN=head, 1)))

  # Combine these and look for duplicate individuals with the prior function
  y <- rbind(y1, y2)
  within(y)
}

结果:

> within(year1)

B C 
1 1 

> within(year2)
character(0)

> between(year1, year2)

A B 
1 1 
于 2013-04-17T02:51:21.187 回答