1

首先,我试图对包含许多个人的大型数据集进行二次抽样,但每个人都需要不同的子样本大小。我正在比较两个时间段,所以我想通过两个时间段中每个人的最小数据点对每个人进行二次抽样。其次,我有多个指标(主要是各种方法)来计算每个人、每个时间段(我在下面提供了一个示例)。第三,我想为这些指标引导 1,000 次重复。我也想为人口做这件事(通过对个体进行平均)。我有一个我在下面尝试过的例子,但这可能还有一段路要走。我对函数或 for 循环持开放态度 - 我无法概念化哪个更适合这个问题。(如果我的代码效率不高,我会提前道歉——我是通过谷歌搜索自学的。)

# Example dataset
Data <- data.frame(
ID = sample(c("A", "B", "C", "D"), 50, replace = TRUE),
Act = sample(c("eat", "sleep", "play"), 50, replace = TRUE),
Period = sample(c("pre", "post"), 50, replace = TRUE)
)

# Separate my data by period
DataPre <- as.data.frame(Data[ which(Data $Period == "pre"), ])
DataPost <- as.data.frame(Data[ which(Data $Period == "post"), ])

# Get the minimum # observations for each ID across both periods
Num <- Data %>% 
group_by(ID, Period) %>% 
summarise(number=n()) %>%
group_by(ID) %>%
summarise(min=min(number)) 

# Function to get the mean proportion per ID
meanAct <- function(x){
x %>%
group_by(ID, Act) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
}

以下是如果只有一个 ID(不同的子采样要求不同),我将如何进行子采样。我不知道如何指定按 ID 对不同的数量进行二次抽样,然后复制每个。

# See "8888" Here I want to subsample the Num$Min for each ID
DataResults <- function(x, rep){
reps <- replicate(rep, meanAct(x[sample(1:nrow(x), 8888, replace=FALSE),]))
meanfreq <- apply(simplify2array(reps[3, 1:2]), 1, mean)
sd <- apply(simplify2array(reps[3, 1:2]), 1, sd)
lower <- meanfreq - 1.96*(sd/sqrt(8888))
upper <- meanfreq + 1.96*(sd/sqrt(8888))
meanAct <- as.vector(reps[[1]])
output <- data.frame(meanAct, meanfreq, sd, lower, upper)
print(output)
}

# Print results
DataResults(DataPre, 1000)
DataResults(DataPost, 1000)

# Somehow I get the mean for the population by averaging across all IDs
DataMeanGroup <- DataMean %>%
group_by(Period) %>%
summarise (mean = mean(prop))

我正在寻找的结果是基于子采样(每个人的最小数据点)和引导 1000 次重复的每个人的每个活动的方法。此外,如果可能的话,通过对个体进行平均(再次来自二次抽样和自举)来获得总体平均值。

编辑:附加信息: 最终结果应该允许我比较每个 ID 在两个时间段内进行每项活动的时间比例(例如,比较 A 在前后吃饭的时间百分比等)。但是,由于数据过多,所以在该期间进行了二次抽样,因此我们比较了相同数量的观察值。代码在我脑海中运行的方式是(1)对观察进行二次抽样,以便我们比较两个时期内每个 ID 的相同数量的观察,(2)计算每个 ID 中每个活动的比例时间段,(3)重复该子样本计算 1,000 次,以便我们最终得到的比例代表总观测值。

4

1 回答 1

2

考虑泛化您的子采样函数以接收传入的数据帧的子集,其中可以通过IDPeriodby的每个唯一配对对数据帧进行切片。但首先使用(内联聚合)按每个IDPeriod计算。下面的所有代码都使用base R(即,没有其他包):MinNumave

数据和功能

# Example dataset (WITH MORE ROWS)
set.seed(11919)
Data <- data.frame(
  ID = sample(c("A", "B", "C", "D"), 500, replace = TRUE),
  Act = sample(c("eat", "sleep", "play"), 500, replace = TRUE),
  Period = sample(c("pre", "post"), 500, replace = TRUE)
)

# MIN NUM PER ID AND PERIOD GROUPING (NESTED ave FOR COUNT AND MIN AGGREGATIONS)
Data$Min_Num <- with(Data, ave(ave(1:nrow(Data), ID, Period, FUN=length), ID, FUN=min))

# Function to get the mean proportion per ID
meanAct <- function(x){
  within(x, { 
        n <- ave(1:nrow(x), ID, Act, FUN=length)
        freq <- n / sum(n)
  })
}

DataResults <- function(df, rep){
  reps <- replicate(rep, meanAct(df[sample(1:nrow(df), df$Min_Num[1], replace=FALSE),]))
  mean_freq <- apply(simplify2array(reps["freq", ]), 1, mean)    # ADJUSTED [] INDEXING
  sd <- apply(simplify2array(reps["freq", ]), 1, sd)             # ADJUSTED [] INDEXING
  lower <- mean_freq - 1.96*(sd/sqrt(df$Min_Num[1]))
  upper <- mean_freq + 1.96*(sd/sqrt(df$Min_Num[1]))
  mean_act <- as.vector(reps[[2]])                               # ADJUSTED [[#]] NUMBER 
  id <- df$ID[1]                                                 # ADD GROUP INDICATOR
  period <- df$Period[1]                                         # ADD GROUP INDICATOR

  output <- data.frame(id, period, mean_act, mean_freq, sd, lower, upper)
  return(output)
}

加工

# BY CALL
df_list <- by(Data, Data[c("ID", "Period")], function(sub) DataResults(sub, 1000))

# BIND ALL DFs INTO ONE DF
final_df <- do.call(rbind, df_list)
head(final_df, 10)
#    id period mean_act  mean_freq          sd      lower      upper
# 1   A   post    sleep 0.02157354 0.005704140 0.01992512 0.02322196
# 2   A   post      eat 0.02151701 0.005720058 0.01986399 0.02317003
# 3   A   post    sleep 0.02171393 0.005808156 0.02003546 0.02339241
# 4   A   post      eat 0.02164184 0.005716603 0.01998982 0.02329386
# 5   A   post     play 0.02174095 0.005678416 0.02009996 0.02338193
# 6   A   post      eat 0.02181380 0.005716590 0.02016178 0.02346581
# 7   A   post    sleep 0.02172458 0.005691051 0.02007995 0.02336922
# 8   A   post    sleep 0.02174288 0.005666839 0.02010524 0.02338052
# 9   A   post     play 0.02166234 0.005673047 0.02002291 0.02330177
# 10  A   post     play 0.02185057 0.005813680 0.02017050 0.02353065

总结

# SUMMARIZE FINAL DF (MEAN PROP BY ID AND ACT)
agg_df <- aggregate(mean_freq ~ id + mean_act, final_df, mean)
agg_df
#    id mean_act  mean_freq
# 1   A      eat 0.02172782
# 2   B      eat 0.01469706
# 3   C      eat 0.01814771
# 4   D      eat 0.01696995
# 5   A     play 0.02178283
# 6   B     play 0.01471497
# 7   C     play 0.01819898
# 8   D     play 0.01688828
# 9   A    sleep 0.02169912
# 10  B    sleep 0.01470978
# 11  C    sleep 0.01818944
# 12  D    sleep 0.01697438

# SUMMARIZE FINAL DF (MEAN PROP BY ID AND PERIOD)
agg_df <- aggregate(mean_freq ~ id + period, final_df, mean)
agg_df
#   id period  mean_freq
# 1  A   post 0.02173913
# 2  B   post 0.01470588
# 3  C   post 0.01818182
# 4  D   post 0.01694915
# 5  A    pre 0.02173913
# 6  B    pre 0.01470588
# 7  C    pre 0.01818182
# 8  D    pre 0.01694915

# SUMMARIZE FINAL DF (MEAN PROP BY ID)
agg_df <- aggregate(mean_freq ~ id, final_df, mean)
agg_df
#   id  mean_freq
# 1  A 0.02173913
# 2  B 0.01470588
# 3  C 0.01818182
# 4  D 0.01694915
于 2019-01-19T16:59:53.833 回答