0

我正在处理一些格式奇怪的调查数据(由其他人收集和记录)。它记录了调查样带的物种丰度,但它只列出了在给定样带中观察到的物种,而不是记录的所有可能物种。我花了一些时间弄清楚如何使用 tidyr 重新塑造数据,以便在每次调查期间为每个物种都有列,而未记录的物种则用 0 填充。这是一个简短的、可重复的示例:

#This works:
Survey <- as.factor(c(rep("Survey 1",10),rep("Survey 2",10),rep("Survey 3",10)))
Species <- as.factor(c(c("A","B","C","D","E","U","V","W","X","Y"),c("A","C","E","G","I","K","M","O","Q","S"),c("B","D","F","H","J","L","N","P","R","T")))
Abundance <- ceiling(runif(30,1,50))

working.df<-cbind.data.frame(Survey,Species,Abundance)

working.spread<-working.df %>%
  group_by(Survey) %>%
  spread(Species,Abundance,drop=F,fill=0)

不幸的是,真实数据并非如此简单。在某些情况下,他们在一次调查中记录了同一物种的多行,以便他们可以记录我不感兴趣的附加变量的信息。我只关心每次调查的总丰度。所以这是一个真实数据可能看起来的例子(注意 Species2 开头的双“A”):

#This doesn't work:    
Species2 <- as.factor(c(c("A","A","C","D","E","U","V","W","X","Y"),c("A","C","E","G","I","K","M","O","Q","S"),c("B","D","F","H","J","L","N","P","R","T")))

not.working.df<-cbind.data.frame(Survey,Species2,Abundance)

not.working.spread<-not.working.df %>%
  group_by(Survey) %>%
  spread(Species2,Abundance,drop=F,fill=0) 

因此,当列出两个相同的物种时,spread 参数不再起作用,并返回熟悉的错误:

Error: Duplicate identifiers for rows (1, 2)

在真正的数据集中,我得到了很多重复的错误(这只是几个数据集之一),所以我不想手动修复这个问题,当然:

Error: Duplicate identifiers for rows (206, 216), (1532, 1544), (1052, 1595), (1324, 1330), (191, 212), (194, 211), (1392, 1600), (19, 37), (1404, 1599), (199, 215), (1073, 1596), (1074, 1597), (43, 44, 45), (455, 456), (380, 381, 382, 383), (447, 448), (413, 414, 415, 416, 417, 418), (303, 304), (1015, 1016), (897, 898, 1593), (1306, 1307), (1041, 1594), (1076, 1598), (1425, 1426), (49, 64), (198, 214) 

我想做的是对重复标识符的 Abundance 字段求和。我知道这里有类似的问题,我已经仔细研究了其中的许多问题,但我还没有找到解决方案。我一直在努力通过传播达到这一点,看起来我只是一个简单的函数命令就可以让它工作......任何建议都将不胜感激。或者,如果我完全错过了这个问题的现有答案,请指出我的方向。

干杯

4

1 回答 1

1

谢谢,aosmith,为我指出了总结线程的方向——这成功了。这是工作解决方案:

not.working.spread<-not.working.df %>%
  group_by(Survey,Species2) %>%
  summarize(Abundance = sum(Abundance)) %>%
  spread(Species2,Abundance,drop=F,fill=0)
于 2016-10-03T17:44:28.427 回答