r - R - 请在多状态马尔可夫和块引导方面需要帮助

Question

首先，我必须坚持，我对 R 非常陌生，以前从未使用过马尔可夫分析或 Bootstrap。我一直在研究这些问题，但找不到解决方案，所以决定发布这个问题。

我有一个动物运动数据，它由以 1、2、3 等数字编码的不同状态组成。我想运行多状态马尔可夫来产生转移概率矩阵，但因为我的数据包含每个主题的复制（例如，动物 1 测试 3 次，动物 2 测试 3 次，动物 3 测试 4 次），每个受试者包含一个面板数据（时间 0-2）。这是我的数据的示例：

data <- read.csv("test1.csv", header=T)
data
   Animal Time DV
1    1     0    1
2    1     1    2
3    1     2    3
4    1     0    1
5    1     1    3
6    1     2    2
7    1     0    3
8    1     1    1
9    1     2    1
10   2     0    2
11   2     1    1
12   2     2    2
13   2     0    2
14   2     1    3
15   2     2    1
16   2     0    2
17   2     1    2
18   2     2    1
19   3     0    2
20   3     1    1
21   3     2    1
22   3     0    2
23   3     1    1
24   3     2    2
25   3     0    1
26   3     1    2
27   3     2    1
28   3     0    2
29   3     1    3
30   3     2    3

由于每个主题都包含复制，我想在执行 msm 之前运行引导程序以重新采样主题。我查找了运行引导程序和马尔可夫分析的代码，但是在编写脚本为 qmatrix 创建初始值时，它返回了以下错误：

Q <- rbind(c(0.33, 0.33, 0.33), 
            c(0.33, 0.33, 0.33), 
            c(0.33, 0.33, 0.33))

Q.crude <- crudeinits.msm(DV ~ Time, Animal, data=data, qmatrix=Q)

Error in msm.check.times(time, subject, state) : 
     Observations within subjects 1, 2, 3 are not ordered by time

有人可以建议如何解决这个问题吗？此外，我计划使用以下脚本进行引导，但不确定它是否正确以及应该为“l”放置什么。

boot.f <- function(data){
              msm(DV ~ Time, subject=Animal, data = data, qmatrix = Q.crude, 
              gen.inits=T, death=F, exacttimes=T)}

boot <- tsboot(data, boot.f, R=1000, l=?, sim="fixed")

我的最终目标是获得每个转换的转换概率和 SD 的平均值。请如果有人可以阐明或就如何实现这一目标提出任何建议，我将非常感激。

score 4 · Accepted Answer

由 function 给出的错误消息crudeinits.msm（也将由 function 给出msm）是由于函数期望数据被您传递的第二个参数索引（包称为“主题”）作为Animal. 由于您的数据包含同一动物的复制品，因此格式与包的预期不符。

这是一些代码，实现了您想要的所有内容，包括引导程序。实现是相当初步的，例如使用“for 循环”，为简单起见，并且显然可以优化（例如，通过并行运行引导程序）。请注意，msm执行参数估计的调用包含在 try-catch 中，因为有时估计最终会失败（我猜是因为这里考虑的动物数量很少）。一个重要的细节是我已将选项设置obstype为等于 1，对应于“面板数据”的情况，其中每个时间序列都是在常规时间瞬间观察到的，因为这似乎是您的数据的情况; 见文档msm详情。对于您提供的数据，需要进行一些设置，以添加与“主题”字段对应的标识（如下面的代码中所述）。对于分析，数据是通过对每只动物替换 3 个时间序列进行采样而获得的。

# File containing the example data that was provided in the question
Data <- read.csv("test1.csv", header = TRUE)

# Add the ids of the replicates for each individual
addReplIds = function(D) {
    # Get the indices of the boundaries
    ind_bnds <- which(diff(D$Time) < 0)
    return (cbind(repl = unlist(mapply(
        rep, 
        x = 1:(length(ind_bnds) + 1), 
        length.out = diff(c(0, ind_bnds, nrow(D))), 
        SIMPLIFY = FALSE)), D))
}
library(dplyr)
Data <- as.data.frame(Data %>% group_by(Animal) %>% do(addReplIds(.)))
# Combine the animal and the replicate ids to identify a "sample" (a time-series)
Data <- mutate(Data, sample_id = paste(Animal, repl, sep = "."))

# Pack header data, linking each "sample" to the animal to which it belongs.
Header_data <- subset(Data, Time == 0, select = c("Animal", "sample_id"))

# Number of bootstrap iterations
N_bootp <- 1000
# Number of time-series to be sampled per animal
n_time_series_per_animal <- 3
# The duration of each time-series
t_max <- 2

library(msm)
lst_Bootp_results <- list()
for (i in seq(1, N_bootp)) {
    # Obtain the subject ids to be included in the data sample
    Data_sample <- as.data.frame(
        Header_data %>% group_by(Animal) %>% 
            do(sample_n(., n_time_series_per_animal, replace = TRUE)))
    # Add a column representing the "subject" (index for each time-series in 
    # this data sample)
    Data_sample <- cbind(Data_sample, subject = 1:nrow(Data_sample))
    # Add the actual data
    Data_sample <- merge(Data, Data_sample, by = c("Animal", "sample_id"))
    # Sort the data by time (as required by the `msm` package)
    Data_sample <- arrange(Data_sample, subject, Time)

    P_mat <- tryCatch({
        # Estimation
        Q_0 <- matrix(data = 1 / 3, nrow = 3, ncol = 3)
        model <- msm(DV ~ Time, subject = subject, data = Data_sample, 
                     qmatrix = Q_0, obstype = 1, gen.inits = TRUE)

        # Obtain the estimated transition probability matrix (over one time-unit)
        P_model <- pmatrix.msm(model)
        class(P_model) <- "matrix"
        P_model
    }, error = function(e) {
        warning(sprintf("[ERROR] %s", e), call. = FALSE, immediate. = TRUE)
        return (NULL)
    })

    if (!is.null(P_mat) && all(is.finite(P_mat)) && all(abs(rowSums(P_mat) - 1) < 1e-3))
        lst_Bootp_results[[i]] <- cbind(ind_bootp = i, 
                                        current_state = rownames(P_mat), 
                                        as.data.frame(P_mat))
}
cat(sprintf("Estimation failed in %d / %d of the bootstrap samples\n", 
            sum(sapply(lst_Bootp_results, is.null)), N_bootp))
Bootp_results <- do.call(rbind, lst_Bootp_results)

由于这是一个 3 状态模型，每个状态的转移概率可以用 3 顶点单纯形表示（使用 package ggtern），因此可以使用以下代码绘制结果：

# Generate figure
library(ggtern)
library(ggplot2)

Bootp_plot <- Bootp_results
Bootp_plot[, "current_state"] <- paste("When in ", Bootp_plot[, "current_state"], sep = "")
colnames(Bootp_plot)[3:5] = c("S1", "S2", "S3")
# Filter out points in the boundaries, otherwise the confidence regions 
# cannot be estimated by 'ggtern'
Bootp_plot <- subset(Bootp_plot, (S1 != 0) & (S2 != 0) & (S3 != 0))
cat(sprintf("Plotting %d data points (from %d)\n", nrow(Bootp_plot), nrow(Bootp_results)))

ggtern(data = Bootp_plot, aes(x = S1, y = S2, z = S3)) + 
    geom_point(size = rel(2), alpha = 0.5) +
    geom_confidence(breaks = c(0.5, 0.9, 0.95)) + 
    facet_wrap(~ current_state, nrow = 1) + 
    ggtitle(sprintf("Experimental data (%d time-series per individual, %d bootstrap samples)\n", 
                    n_time_series_per_animal, N_bootp)) + 
    labs(fill = "") + theme_rgbw() + labs(shape = "")
ggsave("bootstrap_results-data.pdf", height = 5, width = 9)

生产：

其中线条对应于 50%、90% 和 95% 的置信区域（参见 package 的文档ggtern）。

最后，如果您想从引导结果中检索统计信息，这里有一些代码。它计算 95% 置信区间的下限值和上限值，以及中值，就像在进行引导时一样；尽管我建议使用置信区间，但修改以获得转移概率的平均值和标准差是微不足道的：

# To calculate summary statistics, melt the data
Bootp_results <- melt(Bootp_results, id.vars = c("ind_bootp", "current_state"), 
                      variable.name = "next_state", value.name = "prob")
Bootp_stats <- as.data.frame(
    Bootp_results %>% group_by(current_state, next_state) %>% 
        summarize(lower_prob = quantile(prob, probs = 0.025, names = FALSE), 
                  median_prob = median(prob), 
                  upper_prob = quantile(prob, probs = 0.975, names = FALSE))
)

生产：

current_state next_state lower_prob median_prob upper_prob
      State 1    State 1 0.27166496   0.4635482  0.7735892
      State 1    State 2 0.12566126   0.3105540  0.4474771
      State 1    State 3 0.05316077   0.2012626  0.3948478
      State 2    State 1 0.24483769   0.4193336  0.6249328
      State 2    State 2 0.15762565   0.3148918  0.4466980
      State 2    State 3 0.06223002   0.2612689  0.5133920
      State 3    State 1 0.17428479   0.4434651  0.7183298
      State 3    State 2 0.06044651   0.2599676  0.4684195
      State 3    State 3 0.06399818   0.2777778  0.5997379

r - R - 请在多状态马尔可夫和块引导方面需要帮助

1 回答 1

Related

Reference