r - R: How to aggregate data but keep informations from non-aggregated columns?

Question

Searched for a solution for two days to no avail so far.

I have bird observations from different observation points. The observers write down the species, where they have seen them, and for how long.

Now it happens that from different points, observations are taken from the same area, but we only want to process the maximum value per species in an area.

So first, i aggregated the data by observation point, species and area, and summed up the time.

dt.agg <- aggregate(time ~ observp + species + time, dt, sum)

UUPS: completly wrong command:

should have been:

dt.agg <- aggregate(time ~ observp + species + area, dt, sum)



   observp species area time
1       1a  Rm    A1        43.878488
2       1c  Rm    A1       296.152707
3        2  Rm    A1        29.546790
4       1a Swm    A1        34.127713
5       1b Swm    A1        11.076880
6        2 Swm    A1         8.771703

This worked ok. But now, I only need the maximum value for time for a species in an area, BUT i also need to know from which observation point these numbers were taken.

In my example, row 2 should be kept for Rm in A1, while rows 1 and 3 should be dropped. The same applies to row 4 (keep) and 5 + 6 (drop)

When i just do another aggregate with species and area over time and max, the info for the observation point is lost.

Can someone please show me a way to achieve this?

Cheers

Bernd

(now with a new account and no reputation .. thank you ... google!)

p.s. Please feel free to give this question a better headline

UPDATE: trying to post the dput(head(dt,100))-sample as suggested. The original dataset has over 1300 rows. Hope thats what you want to have.

    structure(list(species = structure(c(3L, 3L, 3L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 
5L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 
5L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 5L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 5L, 5L, 5L, 
3L, 3L, 3L, 3L, 3L, 3L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("Bf", 
"Gr", "Rm", "Row", "Swm", "Wf", "Wsb", "Wst", "Ww"), class = "factor"), 
    area = structure(c(35L, 19L, 34L, 34L, 32L, 19L, 34L, 35L, 
    10L, 36L, 10L, 14L, 13L, 25L, 27L, 28L, 34L, 19L, 14L, 14L, 
    34L, 1L, 12L, 13L, 15L, 3L, 3L, 34L, 34L, 34L, 14L, 14L, 
    13L, 13L, 1L, 1L, 1L, 11L, 1L, 8L, 21L, 22L, 22L, 9L, 9L, 
    9L, 5L, 9L, 3L, 22L, 27L, 26L, 21L, 26L, 21L, 27L, 3L, 9L, 
    20L, 20L, 9L, 26L, 34L, 30L, 3L, 2L, 3L, 4L, 20L, 3L, 37L, 
    16L, 17L, 18L, 14L, 35L, 34L, 34L, 34L, 36L, 4L, 4L, 3L, 
    3L, 17L, 17L, 38L, 36L, 10L, 38L, 36L, 10L, 38L, 37L, 35L, 
    30L, 16L, 15L, 17L, 5L), .Label = c("A1", "A10", "A11", "A12", 
    "A13", "A14", "A15", "A16", "A17", "A18", "A2", "A3", "A4", 
    "A5", "A6", "A7", "A8", "A9", "O1", "O10", "O11", "O12", 
    "O13", "O14", "O15", "O16", "O17", "O18", "O19", "O2", "O20", 
    "O21", "O22", "O3", "O4", "O5", "O7", "O8", "O9"), class = "factor"), 
    observp = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L), .Label = c("1a", "1b", "1c", "2", "3", "4"), class = "factor"), 
    time = c(36.37086972, 2.730715967, 1.891286914, 3.782573827, 
    4.496276059, 5.461431934, 18.91286914, 13.22577081, 5.823001976, 
    5.392743201, 3.882001317, 16.97305991, 6.094384821, 5.274262222, 
    5.462035947, 2.089427691, 7.565147654, 21.84572774, 25.45958986, 
    16.97305991, 7.565147654, 4.875387532, 8.885792099, 4.062923214, 
    6.636122805, 7.038317277, 10.55747592, 7.565147654, 7.565147654, 
    3.782573827, 25.45958986, 25.45958986, 12.18876964, 12.18876964, 
    19.50155013, 19.50155013, 9.750775065, 39.20627398, 4.875387532, 
    6.423076843, 2.436283538, 1.823249104, 1.823249104, 16.72889022, 
    41.82222555, 33.45778044, 12.30932064, 117.1022315, 3.519158639, 
    1.823249104, 27.31017974, 11.11346598, 4.872567077, 11.11346598, 
    4.872567077, 5.462035947, 3.519158639, 16.72889022, 14.86012871, 
    8.916077225, 25.09333533, 22.22693195, 3.782573827, 5.184879322, 
    10.55747592, 8.509038411, 10.55747592, 17.70988435, 5.944051483, 
    3.519158639, 17.69229328, 34.70586347, 5.966017168, 3.092236431, 
    2.828843318, 6.612885403, 3.782573827, 3.782573827, 7.565147654, 
    5.392743201, 17.70988435, 17.70988435, 3.519158639, 2.346105759, 
    11.93203434, 11.93203434, 2.386548395, 0.898790534, 0.64700022, 
    2.386548395, 0.898790534, 0.64700022, 2.684866944, 6.634609979, 
    1.239916013, 1.944329746, 3.2536747, 3.732819078, 6.711769315, 
    2.307997621)), .Names = c("species", "area", "observp", "time"
), row.names = c(NA, 100L), class = "data.frame")

score 2 · Accepted Answer

您还可以看看另一个base功能，by. 输出是一个列表，其中每个元素是不同组合的结果INDICES。

bb <- by(data = df, INDICES = list(df$species, df$area), function(x) x[which.max(x$time), ])
bb
# : Rm
# : A1
# observp species area     time
# 2      1c      Rm   A1 296.1527
# -------------------------------------------------------------------- 
# : Swm
# : A1
# observp species area     time
# 4      1a     Swm   A1 34.12771

如果要将列表转换为data.frame：

df2 <- do.call(rbind, bb)
df2
# observp species area      time
# 2      1c      Rm   A1 296.15271
# 4      1a     Swm   A1  34.12771

另一种选择：

library(plyr)
ddply(.data = df, .variables = .(species, area), subset,
  time == max(time))

score 0 · Accepted Answer

一个例子是

stulevel_agg_2 <- stulevel[, list(a1=mean(ability, na.rm = TRUE), a2=last(school, na.rm=T)),by = grade]

a1, a2是新的列名。last可以取组内的最后一个元素，但需要先加载xts。

r - R: How to aggregate data but keep informations from non-aggregated columns?

2 回答 2

Related

Reference