2

我有一个数据框,其中有四个栖息地的样本超过八个月。每个月从每个栖息地收集十个样本。计算每个样本中物种的个体数量。下面的代码生成一个类似结构的较小数据帧。

# Pseudo data
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)

df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)

我想按月汇总所有采样物种的个体总数。我正在使用ddply(首选),但我愿意接受其他建议。

我得到的最接近的是将每列的总和相加,如此处所示。

library(plyr)
ddply(df, ~ Month, summarize, tot_by_mon = sum(Species1) + sum(Species2) + sum(Species3))

#   Month tot_by_mon
# 1   Jan         84
# 2   Feb         92
# 3   Mar         67

这行得通,但我想知道是否有一种通用方法来处理“未知”物种数量的案例。也就是说,第一个物种总是从第 4 列开始,但最后一个物种可能在第 10 列或第 42 列。我不想将实际物种名称硬编码到摘要函数中。请注意,物种名称差异很大,例如 Doryflav 和 Pheibica。

4

4 回答 4

4

与@user's answer with data.table's 类似melt,您可以使用 tidyr 来重塑gather

library(tidyr)
library(dplyr)
gather(df, Species, Value, matches("Species")) %>% 
  group_by(Month) %>% summarise(z = sum(Value))

# A tibble: 3 x 2
   Month     z
  <fctr> <int>
1    Jan    90
2    Feb    81
3    Mar    70

如果您按位置而不是“匹配”模式知道列...

gather(df, Species, Value, -(1:3)) %>% 
  group_by(Month) %>% summarise(z = sum(Value))

(使用@akrun 的set.seed(123)示例数据显示的结果。)

于 2017-11-01T14:15:34.337 回答
3

这是另一个解决方案,data.table无需知道“物种”列的名称:

library(data.table)

DT = melt(setDT(df), id.vars = c("Habitat", "Month", "Sample"))    
DT[, .(tot_by_mon=sum(value)), by = "Month"]

或者如果你想要它紧凑,这里有一个单行:

melt(setDT(df), 1:3)[, .(tot_by_mon=sum(value)), by = "Month"]

结果:

   Month tot_by_mon
1:   Jan         90
2:   Feb         81
3:   Mar         70

数据:(设置种子以使示例可重现)

set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2), levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)

df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)
于 2017-10-31T20:36:33.103 回答
2

假设Speciess 列都以 开头Species,您可以通过前缀选择它们并使用 求和group_by %>% do

library(tidyverse)
df %>% 
    group_by(Month) %>% 
    do(tot_by_mon = sum(select(., starts_with('Species')))) %>% 
    unnest()

# A tibble: 3 x 2
#   Month tot_by_mon
#  <fctr>      <int>
#1    Jan         63
#2    Feb         67
#3    Mar         58

如果列名不遵循模式,您可以按列位置进行选择,例如,如果 Species 列从第 4 列到数据框的末尾:

df %>% 
    group_by(Month) %>% 
    do(tot_by_mon = sum(select(., 4:ncol(.)))) %>% 
    unnest()

# A tibble: 3 x 2
#   Month tot_by_mon
#  <fctr>      <int>
#1    Jan         63
#2    Feb         67
#3    Mar         58
于 2017-10-31T19:26:54.643 回答
2

这是另一个选项,data.table无需重塑为“长”格式

library(data.table)
setDT(df)[, .(tot_by_mon = Reduce(`+`, lapply(.SD, sum))), Month,
          .SDcols = Species1:Species3]
#   Month tot_by_mon
#1:   Jan         90
#2:   Feb         81
#3:   Mar         70

或者tidyverse,我们也可以使用map高效的函数

library(dplyr)
library(purrr)
df %>% 
  group_by(Month) %>%
  nest(starts_with('Species')) %>%
  mutate(tot_by_mon = map_int(data, ~sum(unlist(.x)))) %>% 
  select(-data)
# A tibble: 3 x 2
#    Month tot_by_mon
#   <fctr>      <int>
#1    Jan         90
#2    Feb         81
#3    Mar         70

数据

set.seed(123)
Habitat <- factor(c(rep("Dry",6), rep("Wet",6)), levels = c("Dry","Wet"))
Month <- factor(rep(c(rep("Jan",2), rep("Feb",2), rep("Mar",2)),2),
                        levels=c("Jan","Feb","Mar"))
Sample <- rep(c(1,2),6)
Species1 <- rpois(12,6)
Species2 <- rpois(12,6)
Species3 <- rpois(12,6)

df <- data.frame(Habitat,Month, Sample, Species1, Species2, Species3)
于 2017-11-01T03:01:10.533 回答