1

我想确定在规定时间范围内获得苹果的独特人。我通过如下创建二进制指标“apples”来做到这一点。

names<-c("tom", "mary", "tom", "john", "mary", "tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01", "2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01", "2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi", "apple", "apple", "apple", "orange", "banana", "apple", "kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m", "f","m","f","m","f","m", "m"))
df<-data.frame(names,dates, age, sex, fruit)
df


df$apples<-ifelse(df$fruit=='apple' & df$dates>="2010-04-01" & df$dates<"2010-10-01",1,0)
df

 names      dates age sex  fruit apples
1    tom 2010-02-01  60   m  apple      0
2   mary 2010-05-01  55   f orange      0
3    tom 2010-03-01  60   m banana      0
4   john 2010-07-01  57   m   kiwi      0
5   mary 2010-07-01  55   f  apple      1
6    tom 2010-06-01  60   m  apple      1
7   john 2010-09-01  57   m  apple      1
8   mary 2010-07-01  55   f orange      0
9   john 2010-11-01  57   m banana      0
10  mary 2010-09-01  55   f  apple      1
11   tom 2010-08-01  60   m   kiwi      0
12  mary 2010-11-01  55   f  apple      0
13  john 2010-12-01  57   m orange      0
14  john 2011-01-01  57   m  apple      0

我的问题是玛丽在那里两次。我只想要她在指定时间范围内得到苹果的第一个日期(以及其他所有人在真实数据中的第一个日期)。我想要一个名为“apples1”的第二列,它标记每个人在定义的时间范围内获得苹果的初始日期。

期望的输出:

 names      dates age sex  fruit apples apples1
1    tom 2010-02-01  60   m  apple      0       0
2   mary 2010-05-01  55   f orange      0       0
3    tom 2010-03-01  60   m banana      0       0
4   john 2010-07-01  57   m   kiwi      0       0
5   mary 2010-07-01  55   f  apple      1       1
6    tom 2010-06-01  60   m  apple      1       1
7   john 2010-09-01  57   m  apple      1       1
8   mary 2010-07-01  55   f orange      0       0
9   john 2010-11-01  57   m banana      0       0
10  mary 2010-09-01  55   f  apple      1       0
11   tom 2010-08-01  60   m   kiwi      0       0
12  mary 2010-11-01  55   f  apple      0       0
13  john 2010-12-01  57   m orange      0       0
14  john 2011-01-01  57   m  apple      0       0

我一直在搜索,最接近的是 -仅选择 R 中列的每个唯一值的第一行。但这并不能解决唯一 ID。我也遇到过!重复,但我不想删除玛丽的数据,因为我需要她的日期来跟进她。我可能在这里遗漏了一些非常基本的东西,提前道歉。

4

2 回答 2

1
library(plyr)
df <- df[order(df$dates), ]
ddply(df, "names", transform, 
  apple1 = as.numeric(!duplicated(fruit) & fruit == "apple")
)

注意:我假设 ddply 在通过拆分变量拆分时保留数据帧上的排序。根据我的经验,情况确实如此,但您可以通过更改为执行排序子句的内联函数来稍微修改此解决方案transform,我认为这是不必要的。

于 2013-06-30T02:50:21.220 回答
1

这里有一个data.table解决方案。我同时创建了 2 列。

DT <- data.table(df)
setkeyv(DT,c("names","dates"))
DT[ fruit == "apple" & 
    dates >= "2010-04-01" & 
    dates <  "2010-10-01",
    `:=`(c('apples','apples1') ,
         list(1,
         {ifelse(!duplicated(names),1,0)}))
         ]

   names      dates age sex  fruit apples apples1
 1:  john 2010-07-01  57   m   kiwi     NA      NA
 2:  john 2010-09-01  57   m  apple      1       1
 3:  john 2010-11-01  57   m banana     NA      NA
 4:  john 2010-12-01  57   m orange     NA      NA
 5:  john 2011-01-01  57   m  apple     NA      NA
 6:  mary 2010-05-01  55   f orange     NA      NA
 7:  mary 2010-07-01  55   f  apple      1       1
 8:  mary 2010-07-01  55   f orange     NA      NA
 9:  mary 2010-09-01  55   f  apple      1       0
10:  mary 2010-11-01  55   f  apple     NA      NA
11:   tom 2010-02-01  60   m  apple     NA      NA
12:   tom 2010-03-01  60   m banana     NA      NA
13:   tom 2010-06-01  60   m  apple      1       1
14:   tom 2010-08-01  60   m   kiwi     NA      NA
于 2013-06-30T03:20:39.100 回答