r - 在 dplyr 中模拟时间序列，而不是使用 for 循环

Question

所以，虽然dplyrlag和in 很棒，但我想模拟一个时间序列，比如人口增长。lead我的旧学校代码看起来像：

tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
  tdf$pop[i] = 1.1*tdf$pop[i-1]
}

产生

  time    pop
1    1 50.000
2    2 55.000
3    3 60.500
4    4 66.550
5    5 73.205

我觉得必须有一种dplyr或tidyverse方法来做到这一点（尽管我很喜欢我的 for 循环）。

但是，像

tdf <- data.frame(time=1:5, pop=50) %>%
  mutate(pop = 1.1*lag(pop))

这本来是我的第一个猜测只是产生

我觉得我遗漏了一些明显的东西……那是什么？

注意 - 这是一个简单的例子 - 我的真实例子使用多个参数，其中许多是随时间变化的（我正在模拟不同 GCM 场景下的预测），因此，tidyverse 被证明是一个强大的工具，可以将我的模拟结合在一起.

score 9 · Accepted Answer

Reduce（或它的 purrr 变体，如果你愿意的话）是你想要的累积函数，这些函数还没有cum*编写版本：

data.frame(time = 1:5, pop = 50) %>%
    mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))

##   time    pop
## 1    1 50.000
## 2    2 55.000
## 3    3 60.500
## 4    4 66.550
## 5    5 73.205

或发出咕噜声，

data.frame(time = 1:5, pop = 50) %>%
    mutate(pop = accumulate(pop, ~.x * 1.1))

##   time    pop
## 1    1 50.000
## 2    2 55.000
## 3    3 60.500
## 4    4 66.550
## 5    5 73.205

score 5 · Accepted Answer

如果起始值为pop50，那么pop = 50 * 1.1^(0:4)将为您提供接下来的四个值。使用您的代码，您可以执行以下操作：

data.frame(time=1:5, pop=50) %>%
  mutate(pop = pop * 1.1^(1:n() - 1))

或者，

base = 50

data.frame(time=1:5) %>%
  mutate(pop = base * 1.1^(1:n()-1))

score 4 · Accepted Answer

如果您将它们作为包含所有参数的列表传递给模拟函数，Purrr 的累积函数可以处理随时间变化的索引。但是，要使其正常工作需要一些争论。这里的诀窍是，accumulate() 可以在列表和向量列上工作。您可以使用tidyr函数 nest() 将列分组到包含当前人口状态和参数的列表向量中，然后在结果列表列上使用累积 ()。这解释起来有点复杂，所以我包含了一个演示，以恒定增长率或随时间变化的随机增长率模拟逻辑增长。我还提供了一个示例，说明如何使用 dpylr+purrr+tidyr 来模拟给定模型的多个复制。

library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)

# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector. 
# This example function is a Ricker population growth model. 
logistic_growth = function(.x, .y, growth, comp) {
  pop = .x$pop[1]
  growth = .y$growth[1]
  comp  = .y$comp[1]
  # Note: this uses the state from .x, and the parameter values from .y.
  # The first observation will use the first entry in the vector for .x and .y
  new_pop = pop*exp(growth - pop*comp)
  .y$pop[1] = new_pop
  return(.y)
}

# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps  = 100
pop_init = 1
growth = 0.5
comp = 0.05

#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init, 
                   growth=growth,comp =comp)


# here, the combination of nest() and group_by() split the data into individual 
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
  group_by(time)%>%
  nest(pop, growth, comp,.key = state)%>%
  ungroup()%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init, 
                   growth=rnorm(n_steps, growth,0.1),comp=comp)

out2 = test2 %>%
  group_by(time)%>%
  nest(pop, growth, comp,.key = state)%>%
  ungroup()%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
  mutate(growth=rnorm(n_steps*10, growth,0.1))

out3 = test3 %>%
  group_by(rep)%>%
  group_by(rep,time)%>%
  nest(pop, growth, comp,.key = state)%>%
  group_by(rep)%>%
  mutate(
    state = accumulate(state,logistic_growth))%>%
  unnest()

print(qplot(time, pop, data=out1)+
        geom_line() +
        geom_point(data= out2, col="red")+
        geom_line(data=out2, col="red")+
        geom_point(data=out3, col="red", alpha=0.1)+
        geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))

score 2 · Accepted Answer

这里的问题是将dplyr其作为一组向量运算运行，而不是一次评估一个术语。在这里，1.1*lag(pop)被解释为“计算所有 pop 的滞后值，然后将它们全部乘以 1.1”。由于您set pop=50所有步骤的滞后值都是 50。

dplyr确实有一些用于顺序评估的辅助函数；标准功能cumsum,cumprod等都可以工作，而一些新功能（请参阅?cummean）都可以在dplyr. 在您的示例中，您可以使用以下方法模拟模型：

tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
    mutate(pop = pop*cumprod(growth_rate))


time    pop     growth_rate
1       50.000  1.0
2       55.000  1.1
3       60.500  1.1
4       66.550  1.1
5       73.205  1.1

请注意，我在此处添加了增长率作为列，并将第一个增长率设置为 1。您也可以这样指定它：

tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
    mutate(pop = pop*cumprod(lead(growth_rate,default=1))

这清楚地表明，增长率列是指当前时间步与前一个时间步的增长率。

您可以通过这种方式进行多少种不同的模拟是有限制的，但是使用列中指定的累积函数和参数的某种组合来构建大量离散时间生态模型应该是可行的。

score 1 · Accepted Answer

地图功能怎么样，即

tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))

r - 在 dplyr 中模拟时间序列，而不是使用 for 循环

5 回答 5

Related

Reference