r - 删除重复项，使用“cookie crumbs”来记住它被删除的原因

Question

我有一个大约 500,000 行和 45 列的数据集。我想删除彼此重复的行，就像 R“unique()”函数所做的那样（保留第一次出现，删除其余的），但是对于被删除的每一行，我想存储哪一行我保持它等于。

我将用另一种方式说同样的话，因为我觉得这有点难以解释。对于我的数据集中的每一组相同的行（称为 S），我只想将其中一个保留在数据集中（称为保留的 K）。我想丢弃其他大小（S）-1 相同的行（称它们为 D）。对于 D 中的每个元素，我想知道 K 的索引。

我可以使用 for 循环来做到这一点，但我想知道是否有更优雅的方式来使用 unique()、duplicated() 等。请注意，我使用变量名“pioneers”表示 K，“dupes”表示 D，“dupes.i”表示 D 的索引。

pioneers <- unique(genos.varying)
dupes.i <- duplicated(genos.varying)
dupes <- genos.varying[dupes.i,]

# note -- look at the rowname of the pioneer to see where it was in the 'original' dataset
which.pioneer.by.dupes <- matrix(data=NA, nrow=nrow(dupes))

for(d in 1:nrow(dupes)) {

    for(p in 1:nrow(pioneers)) {

        if (all(pioneers[p,] == dupes[d,])) {

            which.pioneer.by.dupes[d] <- p
        }
    }
}

感谢您提供的任何建议！

此外，这是一个练习数据集，以防人们更容易使用：

genos.varying <- matrix(c(1,2,3,7,6,4,1,2,3,4,3,6), ncol = 3, byrow=TRUE)

输出应该类似于以下内容：

Keep rows 1,2, and 4.  Row 3 is a duplicate of row 1.

score 2 · Accepted Answer

使用逐行哈希的解决方案：

library(digest)
g <- matrix(c(1,2,3,7,6,4,1,2,3,4,3,6, 1,2,3, 7,6,4), ncol = 3, byrow=TRUE)
df <- as.data.frame(g)
df$digest <- apply(g,1,digest)

keep <- sort(as.integer(by(df, df$digest, function(x) rownames(x)[1])))
cat('keeping rows ', paste0(keep, collapse=', '), '\n')

res <- by(df, df$digest, function(x) {
    set <- sort(as.integer(rownames(x)))
    if (length(set) > 1)
      cat('row(s) ', paste0(set[-1], collapse=', '), ' are duplicates of row ', set[1], '\n')
    set
 })

输入是：

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    7    6    4
[3,]    1    2    3
[4,]    4    3    6
[5,]    1    2    3
[6,]    7    6    4

输出是：

keeping rows  1, 2, 4
row(s)  6  are duplicates of row  2 
row(s)  3, 5  are duplicates of row  1

r - 删除重复项，使用“cookie crumbs”来记住它被删除的原因

1 回答 1

Related

Reference