1

我想建立一个新的吸毒人群(Ray 2003)。我的原始数据集大约有 1900 万行,因此循环证明效率低下。这是一个虚拟数据集(用水果而不是药物完成):

    df2

   names      dates age sex  fruit
1    tom 2010-02-01  60   m  apple
2   mary 2010-05-01  55   f orange
3    tom 2010-03-01  60   m banana
4   john 2010-07-01  57   m   kiwi
5   mary 2010-07-01  55   f  apple
6    tom 2010-06-01  60   m  apple
7   john 2010-09-01  57   m  apple
8   mary 2010-07-01  55   f orange
9   john 2010-11-01  57   m banana
10  mary 2010-09-01  55   f  apple
11   tom 2010-08-01  60   m   kiwi
12  mary 2010-11-01  55   f  apple
13  john 2010-12-01  57   m orange
14  john 2011-01-01  57   m  apple

我已经确定了在 04-2010 和 10-2010 之间服用苹果的人:

temp2

  names      dates age sex fruit
6   tom 2010-06-01  60   m apple
5  mary 2010-07-01  55   f apple
7  john 2010-09-01  57   m apple

我想在原始 DF 中创建一个名为“索引”的新列,这是一个人在定义的日期范围内被处方药物的第一个日期。这就是我试图将 temp 中的日期转换为 df$index 的方法:

df2$index<-temp2$dates    
df2$index<-df2$dates == temp2$dates
df2$index<-df2$dates %in% temp2$dates
df2$index<-ifelse(as.Date(df$dates)==as.Date(temp2$dates), as.Date(temp2$dates),NA)

我做的不对——因为这些都不起作用。这是所需的输出。

    df2

   names      dates age sex  fruit      index
1    tom 2010-02-01  60   m  apple       <NA>
2   mary 2010-05-01  55   f orange       <NA>
3    tom 2010-03-01  60   m banana       <NA>
4   john 2010-07-01  57   m   kiwi       <NA>
5   mary 2010-07-01  55   f  apple 2010-07-01
6    tom 2010-06-01  60   m  apple 2010-06-01
7   john 2010-09-01  57   m  apple 2010-09-01
8   mary 2010-07-01  55   f orange       <NA>
9   john 2010-11-01  57   m banana       <NA>
10  mary 2010-09-01  55   f  apple       <NA>
11   tom 2010-08-01  60   m   kiwi       <NA>
12  mary 2010-11-01  55   f  apple       <NA>
13  john 2010-12-01  57   m orange       <NA>
14  john 2011-01-01  57   m  apple       <NA>

一旦我得到了想要的输出,我想从索引日期追溯过去 180 天内是否有人吃过苹果。如果他们没有苹果 - 我想保留他们。如果他们确实有一个苹果(例如,汤姆),我想丢弃他。这是我在所需输出上尝试过的代码:

df4<-df2[df2$fruit!='apple' & df2$index-180,]
df4<-df2[df2$fruit!='apple' & df2$dates<=df2$index-180,] ##neither work for me

我将不胜感激有关这些问题的任何指导 - 甚至是我应该阅读的内容以帮助我学习如何做到这一点。也许我的逻辑有缺陷,我的方法不起作用——如果是这样,请告诉我!先感谢您。

这是我的df:

names<-c("tom", "mary", "tom", "john", "mary",
 "tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01", 
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
 "2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01", 
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
 "apple", "apple", "apple", "orange", "banana", "apple",
 "kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
 "f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2

这是temp2:

data1<-df2[df2$fruit=="apple"& (df2$dates >= "2010-04-01" & df2$dates<  "2010-10-01"), ]
index <- with(data1, order(dates))
temp<-data1[index, ] 
dup<-duplicated(temp$names)
temp1<-cbind(temp,dup)
temp2<-temp1[temp1$dup!=TRUE,]
temp2$dup<-NULL

解决方案

df2 <- df2[with(df2, order(names, dates)), ]
df2$first.date <- ave(df2$date, df2$name, df2$fruit, 
       FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1])                   ##DWin code for assigning index date for each fruit in the pre-period

df2$x<-df2$fruit=='apple' & df2$dates>df2$first.date-180 & df2$dates<df2$first.date    ##assigns TRUE to row that tom is not a new user
ids <- with(df2, unique(names[x == "TRUE"]))                                           ##finding the id which has one value of true
new_users<-subset(df2, !names %in% ids)                                                       ##gets rid of id that has at least one value of true
4

2 回答 2

4

第一次按姓名和日期排序:

df <- df[with(df, order(names, dates)), ]

然后只需选择每个名称中的第一个日期:

df$first.date <- ave(df$date, df$name, FUN="[", 1)

现在您将看到“完全可操作的死星 \w\w 的力量”,呃,ave-函数。您已准备好在该日期范围内的各个“名称”和“水果”中挑选出第一个日期:

> df$first.date <- ave(df$date, df$name, df$fruit, 
         FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1] )
> df
   names      dates age sex  fruit first.date
4   john 2010-07-01  57   m   kiwi 2010-07-01
7   john 2010-09-01  57   m  apple 2010-09-01
9   john 2010-11-01  57   m banana       <NA>
13  john 2010-12-01  57   m orange       <NA>
14  john 2011-01-01  57   m  apple 2010-09-01
2   mary 2010-05-01  55   f orange 2010-05-01
5   mary 2010-07-01  55   f  apple 2010-07-01
8   mary 2010-07-01  55   f orange 2010-05-01
10  mary 2010-09-01  55   f  apple 2010-07-01
12  mary 2010-11-01  55   f  apple 2010-07-01
1    tom 2010-02-01  60   m  apple 2010-06-01
3    tom 2010-03-01  60   m banana       <NA>
6    tom 2010-06-01  60   m  apple 2010-06-01
11   tom 2010-08-01  60   m   kiwi 2010-08-01
于 2013-07-14T00:28:55.733 回答
4

既然你有 1900 万行,我认为你应该尝试一个data.table解决方案。这是我的尝试。结果与@Dwin 结果略有不同,因为我在(开始,结束)之间过滤了我的数据,然后我创建了一个新的索引变量,该变量是每个(名称,水果)在此选定范围内发生的最小日期

library(data.table)
DT <- data.table(df2,key=c('names','dates'))
DT[,dates := as.Date(dates)]
DT[between(dates,as.Date("2010-04-01"),as.Date("2010-10-31")),
   index := as.character(min(dates))
,   by=c('names','fruit')]
##     names      dates age sex  fruit      index
##  1:  john 2010-07-01  57   m   kiwi 2010-07-01
##  2:  john 2010-09-01  57   m  apple 2010-09-01
##  3:  john 2010-11-01  57   m banana         NA
##  4:  john 2010-12-01  57   m orange         NA
##  5:  john 2011-01-01  57   m  apple         NA
##  6:  mary 2010-05-01  55   f orange 2010-05-01
##  7:  mary 2010-07-01  55   f  apple 2010-07-01
##  8:  mary 2010-07-01  55   f orange 2010-05-01
##  9:  mary 2010-09-01  55   f  apple 2010-07-01
## 10:  mary 2010-11-01  55   f  apple         NA
## 11:   tom 2010-02-01  60   m  apple         NA
## 12:   tom 2010-03-01  60   m banana         NA
## 13:   tom 2010-06-01  60   m  apple 2010-06-01
## 14:   tom 2010-08-01  60   m   kiwi 2010-08-01
于 2013-07-14T01:55:39.870 回答