3

如果之前有人问过这个问题,或者我错过了一些明显的事情,我提前道歉。

我有两个数据集,“旧数据”和“新数据”

set.seed(0)
olddata <- data.frame(x = rnorm(10, 0,5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -5:5, z = -5:5)

我从旧数据创建模型,并希望从新数据中预测值

mymodel <- lm(y ~ x+z, data = olddata)
predict.lm(mymodel, newdata)

但是,我想将“newdata”中的变量范围限制为训练模型的变量范围。

我当然可以这样做:

 newnewdata <- subset(newdata, 
                      x < max(olddata$x) & x > min(olddata$x) &
                      z < max(olddata$z) & z > max(olddata$z))

但这在许多方面都变得棘手。有没有更少重复的方法来做到这一点?

4

2 回答 2

3

似乎您中的所有值newdata都已经在适当的范围内,因此没有什么可以设置的。如果我们扩大 的范围newdata

set.seed(0)
olddata <- data.frame(x = rnorm(10, 0,5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -10:10, z = -10:10)

newdata
     x   z
1  -10 -10
2   -9  -9
3   -8  -8
4   -7  -7
5   -6  -6
6   -5  -5
7   -4  -4
8   -3  -3
9   -2  -2
10  -1  -1
11   0   0
12   1   1
13   2   2
14   3   3
15   4   4
16   5   5
17   6   6
18   7   7
19   8   8
20   9   9
21  10  10

然后我们需要做的就是确定每个变量的范围olddata,然后循环遍历与列一样多的subset迭代newdata

ranges <- sapply(olddata, range, na.rm = TRUE)

for(i in 1:ncol(newdata)) {
  col_name <- colnames(newdata)[i]

  newdata <- subset(newdata, 
    newdata[,col_name] >= ranges[1, col_name] &
      newdata[,col_name] <= ranges[2, col_name])
}

newdata
    x  z
4  -7 -7
5  -6 -6
6  -5 -5
7  -4 -4
8  -3 -3
9  -2 -2
10 -1 -1
11  0  0
12  1  1
13  2  2
14  3  3
15  4  4
16  5  5
17  6  6
于 2013-05-02T03:53:27.760 回答
2

Here is an approach using the *apply family (using SchaunW's newdata):

set.seed(0)
olddata <- data.frame(x = rnorm(10, 0, 5), y = runif(10, 0, 5), z = runif(10,-10,10))
newdata <- data.frame(x = -10:10, z = -10:10)

minmax <- sapply(olddata[-2], range)
newdata[apply(newdata, 1, function(a) all(a > minmax[1,] & a < minmax[2,])), ]

Some care is required because I have assumed the columns of olddata (after dropping the second column) are identical to newdata.

Brevity comes at the cost of speed. After increasing nrow(newdata) to 2000 to emphasis the difference I found:

       test replications elapsed relative user.self sys.self user.child sys.child
1  orizon()          100   2.193   27.759     2.191    0.002          0         0
2 SchaunW()          100   0.079    1.000     0.075    0.004          0         0

My guess at the main cause is that repeated subsetting avoids testing whether rows meet the criteria examined after they are excluded.

于 2013-05-02T04:33:12.770 回答