0

我有一个带有 PatientID 和日期的数据框,按 ID 中的日期顺序排序。每个患者通常有几条线路,尽管可能只有一条。例如:

patid   date
1302    2009-01-27
1302    2009-02-05
1302    2009-08-28
1670    2009-03-12
2073    2009-04-03
2073    2010-11-01
2073    2010-12-19
2073    2011-03-06

由此,我想为每个患者生成一个包含开始和结束日期的数据框或 CSV 文件,所以从上面,我有

patid   start       end
1302    2009-01-27  2009-08-28
1670    2009-03-12  2009-03-12
2073    2009-04-03  2011-03-06

我在初始文件中有超过 3000 万行,所以我不想写一个for循环。

我想知道是否有一种有效的方法可以做到这一点,也许从使用aggregate为每个患者推导线数开始?

4

3 回答 3

1

使用sqldf

输入数据:

df=read.table(text="patid   date
          1302    2009-01-27
          1302    2009-02-05
          1302    2009-08-28
          1670    2009-03-12
          2073    2009-04-03
          2073    2010-11-01
          2073    2010-12-19
          2073    2011-03-06",header=T)

代码

 library(sqldf)
 sqldf("select patid,min(date) as start, max(date) as end from df group by patid")

输出:

   patid      start        end
1  1302 2009-01-27 2009-08-28
2  1670 2009-03-12 2009-03-12
3  2073 2009-04-03 2011-03-06
于 2018-10-01T08:45:37.303 回答
1

tidyverse

read.table(text="patid   date
           1302    2009-01-27
           1302    2009-02-05
           1302    2009-08-28
           1670    2009-03-12
           2073    2009-04-03
           2073    2010-11-01
           2073    2010-12-19
           2073    2011-03-06",header=T)%>%
   group_by(patid)%>%
   mutate(date=lubridate::ymd(date))%>%
   summarise(start=min(date),
             end=max(date))
# A tibble: 3 x 3
  patid start      end       
  <int> <date>     <date>    
1  1302 2009-01-27 2009-08-28
2  1670 2009-03-12 2009-03-12
3  2073 2009-04-03 2011-03-06
于 2018-10-01T08:47:16.190 回答
0

使用带有 FUN 的基本 R 函数= 一个简单的自定义函数,以在一个步骤中返回两个输出的aggregate()向量:min()max()

正如您所建议的,您可以使用- 但如下所示,aggregate()您可以一步完成计算每个组min()max()patid

# Read in your sample data, being careful to prevent dates from becoming factors
pdates <- 
  read.table( text="patid   date
                    1302    2009-01-27
                    1302    2009-02-05
                    1302    2009-08-28
                    1670    2009-03-12
                    2073    2009-04-03
                    2073    2010-11-01
                    2073    2010-12-19
                    2073    2011-03-06",
                    header=TRUE, 
                    stringsAsFactors=FALSE) # keep date strings from becoming factors!

aggregate( x = pdates["date"],   # dataframe with column(s) to aggregate
           by = pdates["patid"], # passing dataframe with named column "patid" preserves the column name in the output
           FUN = function(vdate) { 
                   c(start=min(vdate), end=max(vdate))
                 }  
         )

  patid date.start   date.end
1  1302 2009-01-27 2009-08-28
2  1670 2009-03-12 2009-03-12
3  2073 2009-04-03 2011-03-06

编辑:或者,更简单地使用非常有用的基本 Rrange()函数:

aggregate( pdates["date"], by=pdates["patid"], range)

  patid     date.1     date.2
1  1302 2009-01-27 2009-08-28
2  1670 2009-03-12 2009-03-12
3  2073 2009-04-03 2011-03-06
于 2018-10-02T11:16:24.793 回答