r - 从数据集中仅提取第一次出现的行的有效方法是什么？

Question

我有一个包含患者遭遇的数据框，并且只想提取每个患者的最早遭遇（可以使用顺序遭遇 ID 来完成）。我想出的代码有效，但我确信有更有效的方法可以使用 dplyr 执行此任务。你会推荐什么方法？

4 位患者 10 次相遇的示例：

encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)

结果（期望和获得）：

    encounter_ID    patient_ID  gender
    1003            855         0
    1022            721         0
    1013            821         1
    1002            423         1

我的方法

1）提取唯一患者的列表

list.patients <- unique(df$patient_ID)

2）创建一个空数据框来接收我们每个患者第一次遇到的输出

one.encounter <- data.frame()

3）遍历列表中的每个患者以提取他们的第一次遭遇并填充我们的数据框

for (i in 1:length(list.patients)) {
one.patient <- df %>% filter(patient_ID==list.patients[i])
one.patient.ordered <- one.patient[order(one.patient$encounter_ID),]
first.encounter <- head(one.patient.ordered, n=1)
one.encounter <- rbind(one.encounter, first.encounter)
}

score 4 · Accepted Answer

由于 OP 在执行时间方面要求一种有效的方法，因此除了data.table方法之外，这里还有一个答案基准。

#Unit: milliseconds
#            expr        min         lq       mean     median         uq        max neval
#          OP(df) 1354.49200 1398.15245 1481.16068 1467.31151 1531.93056 2124.05586   100
#        Mike(df)  587.33074  606.33194  649.87766  621.65719  658.96548 1076.12302   100
#   Fernandes(df)  177.80735  182.97910  206.64074  185.91444  198.83281  430.96393   100
#       `5th`(df)   60.55170   64.98082   77.55248   67.73171   71.54677  208.47656   100
#       SmitM(df)   52.70000   53.93696   59.05506   54.84035   58.92260  175.24284   100
#   Jan_Boyer(df)   30.70666   33.44665   43.04396   34.46983   35.69736  223.02998   100
#  data_table(df)   11.51547   12.38410   14.60907   13.08038   15.25540   43.71229   100
# Moody_dplyr(df)  234.08792  241.02003  260.19283  245.20301  259.82435  517.03117   100
# Moody_baseR(df)   67.05192   72.00578   89.50914   74.64688   77.58169  299.56125   100

代码和数据

library(microbenchmark)
library(tidyverse)
library(data.table)

n <- 1e6
set.seed(1)
df <- data.frame(encounter_ID = sample(1000:1999, size = n, replace = TRUE), 
                 patient_ID = sample(700:900, n, TRUE), 
                 gender = sample(0:1, n, TRUE))

benchmark <- microbenchmark(
  OP(df),
  Mike(df),
  Fernandes(df),
  `5th`(df),
  SmitM(df),
  Jan_Boyer(df),
  data_table(df),
  Moody_dplyr(df),
  Moody_baseR(df)
)

autoplot(benchmark)

到目前为止的解决方案。

Mike <- function(df) {
  df %>%  
    arrange(patient_ID, encounter_ID) %>% 
    group_by(patient_ID) %>% 
    filter(row_number()==1)
}

SmitM <- function(df) {
  df %>% 
    group_by(patient_ID, gender) %>% 
    summarise(encounter_ID = min(encounter_ID))
}

Fernandes <- function(df) {
  x <- dplyr::arrange(df, encounter_ID)
  x[!duplicated(x$patient_ID),]
}

`5th` <- function(df) {
  df_ordered <- df[order(df$patient_ID, df$encounter_ID), ]
  df_ordered[match(unique(df_ordered$patient_ID), df_ordered$patient_ID), ]
}

Jan_Boyer <- function(df) {
  df <- df[order(df$encounter_ID),] 
  df[!duplicated(df$patient_ID),]
}

data_table <- function(df) {
  setDT(df, key = 'encounter_ID')
  df[df[, .I[1], by = patient_ID]$V1]
}

OP <- function(df) {
  list.patients <- unique(df$patient_ID)
  one.encounter <- data.frame()

  for (i in 1:length(list.patients)) {
    one.patient <- df %>% filter(patient_ID == list.patients[i])
    one.patient.ordered <- one.patient[order(one.patient$encounter_ID), ]
    first.encounter <- head(one.patient.ordered, n = 1)
    one.encounter <- rbind(one.encounter, first.encounter)
  } 
}

Moody_dplyr <- function(df) {
  df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
}

Moody_baseR <- function(df) {
  subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))
}

score 4 · Accepted Answer

这是一个基本的 R 解决方案，可以在没有 dplyr 的情况下有效地做到这一点

duplicated将它遇到的具有某个患者 ID 的第一行编码为FALSE，并且所有后续行具有相同的患者 ID 为TRUE（这里，我通过添加!before来反转它duplicated），因此如果您可以使用它来仅选择第一次遇到'已经通过遇到_ID 订购了你的数据框

df <- df[order(df$encounter_ID),] #order dataframe by encounter id
#subset to rows that are not duplicates of a previous encounter for that patient
first <- df[!duplicated(df$patient_ID),]

score 3 · Accepted Answer

通常，如果您对操作进行矢量化，R 的工作速度最快。因此，当您要求更有效的方法来解决这个问题时，问题是您的意思是什么？

为了说明这一点，我向您展示了一个解决方案base R并运行microbenchmark：

microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3())
Unit: microseconds
     expr    min      lq     mean  median     uq     max neval
 myfun1() 3997.1 4416.10 6086.848 5129.65 6215.6 64014.4   100
 myfun2()  834.7  993.50 1404.901 1083.95 1247.5 20456.2   100
 myfun3()  133.3  162.75  258.533  193.75  233.8  3561.7   100

您的解决方案是myfun1()，@SmitM dplyr-version 是myfun2()，我的解决方案 ( myfun3) 如下所示：

df_ordered=df[order(df$patient_ID,df$encounter_ID),]
df_ordered[match(unique(df_ordered$patient_ID),df_ordered$patient_ID),]

现在你可以选择你最喜欢的：dplyr解决方案非常好读，我认为也可以导出到其他编程语言。解决方案非常快，但通常不那么好阅读，并且据base R我所知无法导出为其他语言。

我在base R这里发布 -version 是因为它读起来比较好，因为每个函数都像它所说的那样 - 虽然dplyr看起来仍然更好。

score 3 · Accepted Answer

你可以试试：

df2 <- df %>% 
          group_by(patient_ID, gender) %>% 
          summarise(encounter_ID = min(encounter_ID))

score 1 · Accepted Answer

像这样在下面的dplyr代码中，我将按两个 id 排序，然后按患者分组。在过滤器语句中使用row_numer()==1将获取encouter_id每个患者最小的值，因为您按变量和 group_by 患者 ID 进行了排序。：

encounter_ID <- c(1021, 1022, 1013, 1041, 1007, 1002, 1003, 1043, 1085, 1077)
patient_ID <- c(855,721,821,855,423,423,855,721,423,855)
gender <- c(0,0,1,0,1,1,0,0,1,0)
df <- data.frame(encounter_ID, patient_ID, gender)

library(dplyr)



df2 <- df %>%  
        arrange(patient_ID, encounter_ID) %>% 
        group_by(patient_ID) %>% 
        filter(row_number()==1)

score 1 · Accepted Answer

另外的选择

x = dplyr::arrange(df, encounter_ID)
x[!duplicated(x$patient_ID),]
#  encounter_ID patient_ID gender
#1         1002        423      1
#2         1003        855      0
#4         1013        821      1
#6         1022        721      0

score 1 · Accepted Answer

你可以使用top_n：

library(dplyr)
df %>% group_by(patient_ID) %>% top_n(-1,encounter_ID)
# # A tibble: 4 x 3
# # Groups:   patient_ID [4]
#   encounter_ID patient_ID gender
#          <dbl>      <dbl>  <dbl>
# 1         1022        721      0
# 2         1013        821      1
# 3         1002        423      1
# 4         1003        855      0

它不是超级快，但它是惯用的dplyr方式。

有了基础R，这要快得多：

subset(df, as.logical(ave(encounter_ID, patient_ID, FUN = function(x) x == min(x))))

r - 从数据集中仅提取第一次出现的行的有效方法是什么？

7 回答 7

Related

Reference