描述
这个问题的动机来自临床/流行病学研究,其中研究经常招募患者,然后跟踪他们不同的时间长度。
研究开始时的年龄分布通常很有趣,并且很容易评估,但是在研究期间的任何时间,偶尔都会对年龄分布感兴趣。
我的问题是,有没有一种方法可以从诸如 [age_start, age_stop] 之类的区间数据中估计这样的密度,而无需如下扩展数据?长格式方法看起来很不优雅,更不用说它的内存使用了!
使用来自生存包的数据的可重现示例
#### Prep Data ###
library(survival)
library(ggplot2)
library(dplyr)
data(colon, package = 'survival')
# example using the colon dataset from the survival package
ccdeath <- colon %>%
# use data on time to death (not recurrence)
filter(etype == 2) %>%
# age at end of follow-up (death or censoring)
mutate(age_last = age + (time / 365.25))
#### Distribution Using Single Value ####
# age at study entry
ggplot(ccdeath, aes(x = age)) +
geom_density() +
labs(title = "Fig 1.",
x = "Age at Entry (years)",
y = "Density")
#### Using Person-Month Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ .,
data = ccdeath,
cut = seq(from = floor(min(ccdeath$age)),
to = ceiling(max(ccdeath$age_last)),
by = 1/12))
nrow(ccdeath_cp) # over 50,000 rows
# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
geom_density() +
labs(title = "Figure 2: Density based on approximate person-months",
x = "Age (years)",
y = "Density")
#### Using Person-Day Level Data ####
# create counting-process/person-time dataset
ccdeath_cp <- survSplit(Surv(age, age_last, status) ~ .,
data = ccdeath,
cut = seq(from = floor(min(ccdeath$age)),
to = ceiling(max(ccdeath$age_last)),
by = 1/365.25))
nrow(ccdeath_cp) # over 1.5 million rows!
# distribution of age at person-month level
ggplot(ccdeath_cp, aes(x = age)) +
geom_density() +
labs(title = "Figure 3: Density based on person-days",
x = "Age (years)",
y = "Density")

注意:虽然我将这个问题标记为“生存”,因为我认为它会吸引熟悉该领域的人,但我对这里的事件发生时间不感兴趣,只是研究所有时间的总体年龄分布。





