r - R：计算多个（！）列中字符串的多次出现

Question

并且已经感谢大量关于我的（尚未被问到的）问题的文章，这让我走得更远！

但是，我还是忍不住，提出了另一个计数问题：

我有一个大约 30,000 行和 5 列的大数据集，里面填满了名字。df 中总共有大约 14,000 个不同的名称。现在我感兴趣的是名称在一行中的共现，但与名称是否在第 1、2、3 列等无关。

作为一个例子，矩阵看起来像这样（可能是可怕的编码）：

testmatrix<- matrix(nrow=52, ncol=5)


for (i in 1:5) {

    testmatrix[,i]<-(sample(letters, replace=T))

    }

data<-as.data.frame(testmatrix)

然后，我的最终矩阵应该有（在测试示例中）26 行和 26 列（在“真实”数据集 14,000x14,000 中），并且所有共现。我可以使用aggregate（我认为），但是我必须为每个列对（1-2、1-3、1-4 等）生成大量 dfs - 也许有一个独特且更简单的这样做的功能（甚至可能也在 plyr 包中？）。

已经谢谢大家了，我希望这对你来说很容易；）

最好的，艾尔

score 2 · Accepted Answer

这样的事情可能会帮助你开始......

# an example matrix of letters
m <- matrix(sample(letters, 30, replace=T), nrow=6, ncol=5)
m

# the unique values in the matrix
vals <- sort(unique(as.vector(m)))

# rearrange the data so that each value is a column
bigm <- t(apply(m, 1, function(row) match(vals, row, nomatch=0)))
colnames(bigm) <- vals
bigm

# count the co-occurences of each value (diagonal is total number of rows with that value)
crossprod(bigm>0)

score 1 · Accepted Answer

我想不出一个可爱的功能性方法来做到这一点，但它出奇的快。

x <- matrix(as.character(sample(1:14000,30000*5,replace=T)),30000,5)
countmat <- matrix(0,14000,14000,dimnames=list(as.character(1:14000),as.character(1:14000)))
for(i in 1:nrow(x))
  {
    xc <- table(x[i,],x[i,])
    countmat[rownames(xc),colnames(xc)] <- countmat[rownames(xc),colnames(xc)]+xc
  }

编辑：

然后我意识到有一种可爱的功能方法可以做到这一点，但它对我的机器来说太占用内存了

x <- matrix(as.character(sample(1:14000,30000*5,replace=T)),30000,5)
cx <- adply(x,.margins=1,.fun=function(x)table(x,x))

r - R：计算多个（！）列中字符串的多次出现

2 回答 2

Related

Reference