r - 从 R 中的区间 [start, stop] 数据估计密度

Question

描述

这个问题的动机来自临床/流行病学研究，其中研究经常招募患者，然后跟踪他们不同的时间长度。

研究开始时的年龄分布通常很有趣，并且很容易评估，但是在研究期间的任何时间，偶尔都会对年龄分布感兴趣。

我的问题是，有没有一种方法可以从诸如 [age_start, age_stop] 之类的区间数据中估计这样的密度，而无需如下扩展数据？长格式方法看起来很不优雅，更不用说它的内存使用了！

使用来自生存包的数据的可重现示例

#### Prep Data ###
library(survival)
library(ggplot2)
library(dplyr)

data(colon, package = 'survival')
# example using the colon dataset from the survival package
ccdeath <- colon %>%
  # use data on time to death (not recurrence)
  filter(etype == 2) %>%
  # age at end of follow-up (death or censoring)
  mutate(age_last = age + (time / 365.25))

#### Distribution Using Single Value ####
# age at study entry
ggplot(ccdeath, aes(x = age)) +
  geom_density() +
  labs(title = "Fig 1.",
       x = "Age at Entry (years)",
       y = "Density")

#### Using Person-Month Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ ., 
                        data = ccdeath,
                        cut = seq(from = floor(min(ccdeath$age)),
                                  to = ceiling(max(ccdeath$age_last)),
                                  by = 1/12))

nrow(ccdeath_cp) # over 50,000 rows

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
  geom_density() +
  labs(title = "Figure 2: Density based on approximate person-months",
       x = "Age (years)",
       y = "Density")

#### Using Person-Day Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ ., 
                        data = ccdeath,
                        cut = seq(from = floor(min(ccdeath$age)),
                                  to = ceiling(max(ccdeath$age_last)),
                                  by = 1/365.25))

nrow(ccdeath_cp) # over 1.5 million rows!

# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
  geom_density() +
  labs(title = "Figure 3: Density based on person-days",
       x = "Age (years)",
       y = "Density")

图 2 图 3

注意：虽然我将这个问题标记为“生存”，因为我认为它会吸引熟悉该领域的人，但我对这里的事件发生时间不感兴趣，只是研究所有时间的总体年龄分布。

score 0 · Accepted Answer

而不是计算越来越精细的时间间隔，您可以只保留特定年龄患者数量的累积计数

setDT(ccdeath)
x <- rbind(
  ccdeath[, .(age = age, num_patients = 1)],
  ccdeath[, .(age = age_last, num_patients = -1)]
)[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = unique(age))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

锯齿模式是因为假设每个患者都从整数年龄开始。对如何平滑这一点有一些想法并提出了这个想法 - 将相等的概率分配给给定age和之间的一组均匀间隔的年龄age+1。你得到这样的东西，

smooth_param <- 12
x <- rbindlist(lapply(
  (1:smooth_param-0.5)/smooth_param,
  function(s) {
    rbind(
      ccdeath[, .(age = age+s, num_patients = 1/smooth_param)],
      ccdeath[, .(age = age_last+s, num_patients = -1/smooth_param)]
    )
  }
))[, .(num_patients = sum(num_patients)), keyby = age]

cccdeath <- x[x[, .(age = sort(unique(age)))], on = 'age']
cccdeath[, num_patients := cumsum(num_patients)]
ggplot(cccdeath, aes(x = age, y = num_patients)) + geom_step()

score 0 · Accepted Answer

我会沿着这些思路进行：

如果您有兴趣了解t研究天数后的年龄分布，则年龄将只是入学年龄加上t天数。您需要处理那些已经死亡或已被右删失的例外情况。在您的示例中，您似乎在人们离开研究时将他们的年龄“冻结”了。就我个人而言，我认为未经审查的幸存者的年龄分布在生存分析中更有用，但我将坚持你在这个例子中的设置。

那时每位患者的两种可能性t是登记时的年龄加上t是否t小于随访时间。否则，年龄将是入学年龄加上随访时间。

如果将其包装在一个函数中，您可以看到年龄分布在整个研究过程中如何变化。为了完整起见，我们将始终绘制一个微弱的入学年龄密度，以及一条表示当前平均年龄的线：

age_distribution <- function(df, t = 0)
{
  df %>% 
    mutate(age_at_t = age + ifelse(time > t, t, time) / 365.25) %>%
    ggplot() +
    geom_density(aes(age), linetype = 2, colour = "gray50") +
    geom_density(aes(age_at_t)) +
    geom_vline(aes(xintercept = mean(age_at_t)), color = "red", linetype = 2) +
    labs(x = paste("Age at day", t, "of study"),
         y = "Density",
         title = paste("Age distribution after", t, "days in study"))
}

因此，例如：

age_distribution(ccdeath, 0)

1年后：

age_distribution(ccdeath, 365)

5年后：

age_distribution(ccdeath, 5 * 365.25)

为了完整起见，删除审查/死亡患者的等效功能如下：

age_distribution <- function(df, t = 0)
{
  df %>% 
    filter(time > t) %>%
    mutate(age_at_t = age + t / 365.25) %>%
    ggplot() +
    geom_density(data = df, aes(age), linetype = 2, colour = "gray50") +
    geom_density(aes(age_at_t)) +
    geom_vline(aes(xintercept = mean(age_at_t)), color = "red", linetype = 2) +
    labs(x = paste("Age at day", t, "of study"),
         y = "Density",
         title = paste("Age distribution after", t, "days in study"))
}

所以我们可以看到经过 5 年的跟踪，分布的形状是如何变化的：

age_distribution(ccdeath, 5 * 365.25)

这更清楚地表明，最初的队列中老年人的损失不成比例。

r - 从 R 中的区间 [start, stop] 数据估计密度

描述

使用来自生存包的数据的可重现示例

2 回答 2

Related

Reference