r - R：序列数据的整齐聚合和阶梯函数的可视化

Question

我有一些患者数据，其中个别患者会随着时间的推移改变治疗组。我的目标是可视化组更改的顺序，并将这些数据汇总到每个治疗组的“顺序配置文件”中。

对于我想展示的每个治疗组，它通常发生在治疗周期中的时间（比如在开始或结束时）。为了解释不同的序列长度，我想在 0（最开始）和 1（结束）之间标准化这些配置文件。

我想找到一个有效的数据准备和可视化。

最小示例

数据结构

library(dplyr)
library(purrr)
library(ggplot2)

# minimal data
cj_df_raw <- tibble::tribble(
  ~id, ~group,
    1,    "A",
    1,    "B",
    2,    "A",
    2,    "B",
    2,    "A"
  )

# compute "intervals" for each person [start, end]
cj_df_raw %>% 
  group_by(id) %>% 
  mutate(pos = row_number(),
         len = length(id),
         start = (pos - 1) / len,
         end = pos / len) %>% 
  filter(group == "A")
#> # A tibble: 3 x 6
#> # Groups:   id [2]
#>      id group   pos   len start   end
#>   <dbl> <chr> <int> <int> <dbl> <dbl>
#> 1     1 A         1     2 0     0.5  
#> 2     2 A         1     3 0     0.333
#> 3     2 A         3     3 0.667 1

（因此，Id 1 在其序列的前 50% 中位于 A 组，而 Id 2 在其序列的前 33% 和后 33% 中位于 A 组。这意味着，2 个 Id 在 0-33% 之间序列中，1 在 33-50% 之间，0 在 50-66% 之间，1 在 66% 以上。）

这是我想要达到的结果，但我错过了有效转换数据的机会。

期望的结果

profile_treatmen_a <- tibble::tribble(
    ~x, ~y,
     0, 0L,
  0.33, 2L,
   0.5, 1L,
  0.66, 0L,
     1, 1L,
     1, 0L
  )

profile_treatmen_a %>% 
  ggplot(aes(x, y)) +
  geom_step(direction = "vh") +
  expand_limits(x = c(0, 1), y = 0)

（理想情况下，曲线下的区域会被遮蔽）

理想的解决方案：通过 ggridges

可视化的目标是同时比较许多治疗组的“序列概况”。如果我可以相应地准备数据，我想使用 ggridges-package 对治疗组进行惊人的视觉比较。

library(ggridges)

data.frame(group = rep(letters[1:2], each=20),
           mean = rep(2, each=20)) %>% 
  mutate(count = runif(nrow(.))) %>% 
  ggplot(aes(x=count, y=group, fill=group)) +
  geom_ridgeline(stat="binline", binwidth=0.5, scale=0.9)

score 2 · Accepted Answer

您可以建立辅助间隔，然后绘制直方图。由于每个患者都属于A组或B组，因此两组的总和为 100%。使用这些辅助间隔，您还可以轻松切换到其他geoms.

library(tidyverse, warn.conflicts = FALSE)
library(ggplot2)

# create sample data
set.seed(42)

id <- 1:10 %>% map(~ rep(x = .x, times = runif(n = 1, min = 1, max = 6))) %>%
  unlist()
group <- sample(x = c("A", "B"), size = length(id), replace = TRUE) %>%
  as_factor()
df <- tibble(id, group)
glimpse(df)
#> Observations: 37
#> Variables: 2
#> $ id    <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5,...
#> $ group <fct> A, B, B, A, A, B, B, A, A, B, B, A, B, B, A, B, A, B, A,...

# tidy data
df <- df %>%
  group_by(id) %>%
  mutate(from = (row_number() - 1) / n(),
         to = row_number() / n()) %>%
  ungroup() %>%
  rowwise() %>%
  mutate(list = seq(from + 1/60, to, 1/60) %>% list()) %>%
  unnest()

# plot
df %>%
  ggplot(aes(x = list, fill = group)) +
  geom_histogram(binwidth = 1/60) +
  ggthemes::theme_hc()

Created on 2018-09-16 by the [reprex package](http://reprex.tidyverse.org) (v0.2.0).

score 2 · Accepted Answer

我试图回答..虽然它可能不是最好/最快/最有效的方法，但我认为它可能会帮助你努力。

library(data.table)
# compute "intervals" for each person [start, end]
df <- cj_df_raw %>% 
  group_by(id) %>% 
  mutate(pos = row_number(),
         len = length(id),
         from = (pos - 1) / len,
         to = pos / len,
         value = 1)

dt <- as.data.table(df)
setkey(dt, from, to)

#create intervals
dt.interval <- data.table(from = seq( from = 0, by = 0.01, length.out = 100),
                          to = seq( from = 0.01, by = 0.01, length.out = 100))

#perform overlap join on intervals
dt2 <- foverlaps( dt.interval, dt, type = "within", nomatch = NA)[, sum(value), by = c("i.from", "group")]
#some melting ans casting to fill in '0' on empty intervals
dt3 <- melt( dcast(dt2, ... ~ group, fill = 0), id.vars = 1 )

#plot
ggplot( dt3 ) +
  geom_step( aes( x = i.from, y = value, color = variable ) ) + 
  facet_grid( .~variable )

r - R：序列数据的整齐聚合和阶梯函数的可视化

最小示例

数据结构

期望的结果

理想的解决方案：通过 ggridges

2 回答 2

Related

Reference