r - 使用 R 中的 data.frame 的列从转换矩阵中提取值

Question

我有一个转换矩阵，从一个状态到另一个状态的成本，例如

cost <- data.frame( a=c("aa","ab"),b=c("ba","bb"))

（假设字符串“aa”是从 a 移动到 a 的成本）

我有一个data.frame, 状态为：

transitions <- data.frame( from=c("a","a","b"), to=c("a","b","b") )

我希望能够在转换中添加一列，每个转换的成本，所以它最终是：

  from to cost
1    a  a   aa
2    a  b   ab
3    b  b   bb

我确信有一种 R-ish 方法可以做到这一点。我最终使用了一个 for 循环：

n <- dim(data)[1]
v <- vector("numeric",n)
for( i in 1:n ) 
{ 
    z<-data[i,c(col1,col2),with=FALSE]
    za <- z[[col1]]
    zb <- z[[col2]]
    v[i] <- dist[za,zb]
}
data <- cbind(data,d=v)
names(data)[dim(data)[2]] <- colName
data

但这感觉非常难看，而且速度非常慢——在 2M 行上大约需要 20 分钟data.frame（计算同一个表元素之间距离的操作不到一秒）。

是否有一个简单、快速、一两行的命令可以让我得到上面的成本列？

score 3 · Accepted Answer

更新：考虑已知状态

data.table解决方案：

require(utils)
require(data.table)

## Data generation
N <- 2e6
set.seed(1)
states <- c("a","b")
cost <- data.frame(a=c("aa","ab"),b=c("ba","bb"))
transitions <- data.frame(from=sample(states, N, replace=T), 
                            to=sample(states, N, replace=T))

## Expanded cost matrix construction
f <- expand.grid(states, states)
f <- f[order(f$Var1, f$Var2),]
f$cost <- unlist(cost)

## Prepare data.table
dt <- data.table(transitions)
setkey(dt, from, to)

## Routine itself  
dt[,cost:=as.character("")] # You don't need this line if cost is numeric
apply(f, 1, function(x) dt[J(x[1],x[2]),cost:=x[3]])

有 2M 行，transitions大约需要 0.3 秒才能进行。

score 2 · Accepted Answer

这是一种方法：（至少这适用于这个例子，我相信它也适用于更大的数据。如果没有，请写回一个例子）

# load both cost and transition with stringsAsFactors = FALSE
# so that strings are NOT by default loaded as factors
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb"), stringsAsFactors=F)
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b"), 
                                      stringsAsFactors = FALSE)

# convert cost to vector: it'll have names a1, a2, b1, b2. we'll exploit that.
cost.vec <- unlist(cost)
# convert "to" to factor and create id column with "from" and as.integer(to)
# the as.integer(to) will convert it into its levels
transitions$to <- as.factor(transitions$to)
transitions$id <- paste0(transitions$from, as.integer(transitions$to))

# now, you'll have a1, a2 etc.. here as well, just match it with the vector
transitions$val <- cost.vec[!is.na(match(names(cost.vec), transitions$id))]

#   from to id val
# 1    a  a a1  aa
# 2    a  b a2  ab
# 3    b  b b2  bb

您当然可以删除id. 如果这在任何情况下都不起作用，请告诉我。我会尝试修复它。

score 2 · Accepted Answer

从阿伦的回答开始，我选择了：

library(reshape)
cost <- data.frame( a = c("aa","ab"), b = c("ba","bb") )
transitions <- data.frame(from = c("a","a","b"), to = c("a","b","b") )
row.names(cost) <- c("a","b") #Normally get this from the csv file
cost$from <- row.names(cost)
m <- melt(cost, id.vars=c("from"))
m$transition = paste(m$from,m$variable)
transitions$transition=paste(transitions$from,transitions$to)
merge(m, transitions, by.x="transition",by.y="transition")

还有几行，但我有点不相信因子排序作为索引。这也意味着当它们是 data.tables 时，我可以这样做：

setkey(m,transition)
setkey(transitions,transition)
m[transitions]

我没有进行基准测试，但是在大型数据集上，我非常有信心 data.table 合并将比合并或矢量扫描方法更快。

r - 使用 R 中的 data.frame 的列从转换矩阵中提取值

3 回答 3

Related

Reference