2

I am implementing a statistical method from an academic paper (see the end for a citation) in R. I think there's a way to do one of the steps without using a loop, but I'm having trouble deciding how to attack it.

This method operates on a data frame with three variables: x, n, and p. It can only operate if p[i] <= p[i+1] for all i. If a pair of points violates that, they are smoothed out by setting both p[i] and p[i+1] equal to their weighted average (n[i]*p[i]+n[i+1]*p[i+1])/(n[i]+n[i+1]) This smoothing is iterated until the p_i are a nondecreasing sequence.

The problem with this smooth is that a) loops are bad form in R, and b) if there are multiple points in a row such that p_i > p_(i+1) >= p_(i+2), the method can fail to terminate or take a very long time to converge. For instance, if a sequence like so happens:

x  n  p
2  10 0.6
5  10 0.5
10 10 0.5

the smooth will set the first two values of p to 0.55, then the second two to 0.525, then set the first two to 0.5325, and so on and loop forever (or if I'm lucky reach the limit of significance in a bajillion iterations). There should be a mathematically equivalent but more efficient way to do this by identifying adjacent decreasing data points and averaging them as a group, but I'm not sure how to approach that in R.

If you need more background, the paper in question is Martin A. Hamilton, Rosemarie C. Russo, Robert V. Thurston. "Trimmed Spearman-Karber method for estimating median lethal concentrations in toxicity bioassays." Environ. Sci. Technol., 1977, 11 (7), pp 714–719. I'm referring to the "first step" section on page 716.

4

2 回答 2

2

据我了解该算法,您需要找到p减少的位置,并从每个位置开始,找出(累积)加权平均值减少的时间,以便p可以逐块更新。如果没有某种循环,我看不出如何做到这一点。某些解决方案可能会将循环隐藏在lapply或等效但恕我直言之下,这是那些足够复杂的算法之一,我更喜欢一个好的旧循环。您可能会在效率上有所损失,但代码读起来很好。我的尝试,使用while循环:

smooth.p <- function(df) {

   while (any(diff(df$p) < 0)) {

      # where does it start decreasing
      idx <- which(diff(df$p) < 0)[1]

      # from there, compute the cumulative weighted average
      sub <- df[idx:nrow(df), ]
      cuml.wavg <- cumsum(sub$n * sub$p) / cumsum(sub$n)

      # and see for how long it is decreasing
      bad.streak.len <- rle(diff(cuml.wavg) <= 0)$lengths[1]

      # these are the indices for the block to average
      block.idx <- seq(from = idx, length = bad.streak.len + 1)

      # compute and apply the average p
      df$p[block.idx] <- sum(df$p[block.idx] * df$n[block.idx]) /
                     sum(df$n[block.idx])
   }
   return(df)
}

这是一些数据,包括您建议的粗略补丁:

df <- data.frame(x = 1:9,
                 n = rep(1, 9),
                 p = c(0.1, 0.3, 0.2, 0.6, 0.5, 0.5, 0.8, 1.0, 0.9))
df
#   x n   p
# 1 1 1 0.1
# 2 2 1 0.3
# 3 3 1 0.2
# 4 4 1 0.6
# 5 5 1 0.5
# 6 6 1 0.5
# 7 7 1 0.8
# 8 8 1 1.0
# 9 9 1 0.9

和输出:

smooth.p(df)
#   x n         p
# 1 1 1 0.1000000
# 2 2 1 0.2500000
# 3 3 1 0.2500000
# 4 4 1 0.5333333
# 5 5 1 0.5333333
# 6 6 1 0.5333333
# 7 7 1 0.8000000
# 8 8 1 0.9500000
# 9 9 1 0.9500000
于 2012-07-11T01:53:06.633 回答
0

按照上面的 Glen_b,汉密尔顿的论文中描述的内容等同于gpavaCRAN 包中的内容isotone

于 2013-09-22T15:08:15.580 回答