0

我有多个站点 - 每个站点都访问过多次。我想对数据进行子集化以仅包含来自每个站点的一次访问(但来自该访问的所有观察结果),并且我希望该访问最接近(及时)对所有站点的所有访问的中位日期。

样本数据:

d = data.table(site = c('a', 'a','a','a','b', 'b','b', 'b', 'c', 'c', 'c', 'c'), 
       sex = c('m','f','m','f','m','f','m','f','m','f','m','f'), 
       date = c(127,127, 185, 185, 132,132, 189,189, 119,119, 178, 178), 
       count = c(12, 15, 10, 9, 18, 22,12, 15, 10, 9, 18, 22)) 

我想得到什么:

d = data.table(site = c('a', 'a','b', 'b', 'c', 'c'), 
     sex = c('m','f','m','f','m','f'),
     date = c(127,127, 132,132, 178, 178), 
     count = c(12, 15,18, 22, 18, 22))
4

2 回答 2

1
library(data.table)

d = data.table(site = c('a', 'a', 'b', 'b', 'c', 'c'),
               date = c(127, 185, 132, 189, 119, 178),
               count = c(12, 15, 10, 9, 18, 22))

d.median = d[, median(date)]
d[, {i = which.min(abs(date - d.median));
     list(date = date[i], count = count[i])},
  by = list(sex, site)]
于 2013-04-10T20:09:24.793 回答
1

这是使用averank来自基础 R的一种方法

myRanks <- with(mydf, ave(date, site, FUN = function(x) 
  rank(abs(x - median(date)), ties.method = "first")))
mydf[myRanks == 1, ]
#   site date count
# 1    a  127    12
# 3    b  132    10
# 6    c  178    22

rank用于帮助处理您可能有多个“最接近”中位数的值的情况。

于 2013-04-10T20:16:35.603 回答