r - 从查找表返回变长向量的高效函数

Question

我有三个数据源：

types<-c(1,3,3)
places<-list(c(1,2,3),1,c(2,3))
lookup.counts<-as.data.frame(matrix(runif(9,min=0,max=10),nrow=3,ncol=3))
assigned.places<-rep.int(0,length(types))

“类型”向量中的数字告诉我给定观察的“类型”是什么。地点列表中的向量告诉我可以在哪些地方找到观察结果（有些观察结果只在一个地方找到，而另一些则在所有地方都可以找到）。根据定义，对于每个观察，类型中有一个条目，位置中有一个列表。Lookup.counts 告诉我每个地方有多少每种类型的观察值（从另一个数据源生成）。

我想根据lookup.counts 生成的概率将每个观察随机分配到一个地方。使用 for 循环它看起来像“

for (i in 1:length(types)){
  row<-types[i]
  columns<-places[[i]]
  this.obs<-lookup.counts[row,columns] #the counts of this type in each place
  total<-sum(this.obs)
  this.obs<-this.obs/total #the share of observations of this type in these places
  pick<-runif(1,min=0,max=1)

  #the following should really be a 'while' loop, but regardless it needs help
  for(j in 1:length(this.obs[])){
    if(this.obs[j] > pick){
      #pick is less than this county so assign
      pick<- 100 #just a way of making sure an observation doesn't get assigned twice
      assigned.places[i]<-colnames(lookup.counts)[j]
    }else{
      #pick is greater, move to the next category
      pick<- pick-this.obs[j]
    }
  }
}

我一直在尝试以某种方式对其进行矢量化，但我对“places”和“this.obs”的可变长度感到困惑

当然，在实践中，lookup.counts 表要大一些（500 x 40），并且我有一些 900K 的观察结果，其中位置列表的长度为 1 到 39。

score 2 · Accepted Answer

要矢量化内部循环，您可以使用sample或sample.int从具有规定概率的多个替代方案中进行选择。除非我读错了你的代码，否则你需要这样的东西：

assigned.places[i] <- sample(colnames(this.obs), 1, prob = this.obs)

我有点惊讶你正在使用它colnames(lookup.counts)。这不应该是子集columns吗？似乎我错过了什么，或者您的代码中有错误。

列表的不同长度是矢量化外部循环的严重障碍。也许您可以使用该Matrix包将该信息存储为稀疏矩阵。然后，您可以简单地将概率乘以该向量，以排除那些不在给定观察的位置列表中的列。但是由于您可能仍会使用apply上述示例代码，因此您不妨保留该列表并使用某种形式的apply来对其进行迭代。

总体结果可能看起来像这样：

assigned.places <- colnames(lookup.counts)[
  apply(cbind(types, places), 1, function(x) {
    sample(x[[2]], 1, prob=lookup.counts[x[[1]],x[[2]]])
  })
]

cbindand的使用apply并不是特别漂亮，但似乎有效。每个x都是两个项目的列表，x[[1]]即类型和x[[2]]相应的位置。lookup.counts我们像您一样使用这些索引。然后，在选择我们在下标中使用的列之一的索引时，我们使用找到的计数作为相对概率。只有在将所有这些数字组合成单个向量之后apply，索引才会转换为基于的名称colnames。

如果你不把事情放在一起，你可以检查事情是否更快cbind，而是只迭代索引：

assigned.places <- colnames(lookup.counts)[
  sapply(1:length(types), function(i) {
    sample(places[[i]], 1, prob=lookup.counts[types[i],places[[i]]])
  })
]

score 1 · Accepted Answer

这似乎也有效：

# More convenient if lookup.counts is a matrix.
lookup.counts<-matrix(runif(9,min=0,max=10),nrow=3,ncol=3)
colnames(lookup.counts)<-paste0('V',1:ncol(lookup.counts))

# A function that does what the for loop does for each i
test<-function(i) {
  this.places<-colnames(lookup.counts)[places[[i]]]
  this.obs<-lookup.counts[types[i],this.places]
  sample(this.places,size=1,prob=this.obs)
}

# Applies the function for all i
sapply(1:length(types),test)

r - 从查找表返回变长向量的高效函数

2 回答 2

Related

Reference