r - 在 data.frame 中查找并合并重复行但忽略列顺序

Question

我有一个包含 1,000 行和 3 列的 data.frame。它包含大量重复项，我使用 plyr 来组合重复的行并为每个组合添加一个计数，如该线程中所述。

这是我现在拥有的一个示例（如果我需要从那里开始，我仍然拥有带有所有重复项的原始 data.frame）：

   name1    name2    name3     total
1  Bob      Fred     Sam       30
2  Bob      Joe      Frank     20
3  Frank    Sam      Tom       25
4  Sam      Tom      Frank     10
5  Fred     Bob      Sam       15

但是，列顺序无关紧要。我只想知道有多少行有相同的三个条目，以任何顺序。如何合并包含相同条目的行，忽略顺序？在这个例子中，我想合并第 1 行和第 5 行，以及第 3 行和第 4 行。

score 4 · Accepted Answer

Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.

Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...

dd$lookup=apply(dd[,c("name1","name2","name3")],1,
                                  function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]

You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!

(Minor update to help OP:) And if you want a cleaned-up version of the final answer:

outdf = with(ee,data.frame(name1,name2,name3,
                           total=newtotal,stringsAsFactors=FALSE))

This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.

score 4 · Accepted Answer

Sort the index columns, then use ddply to aggregate and sum:

Define the data:

dat <- "   name1    name2    name3     total
1  Bob      Fred     Sam       30
2  Bob      Joe      Frank     20
3  Frank    Sam      Tom       25
4  Sam      Tom      Frank     10
5  Fred     Bob      Sam       15"

x <- read.table(text=dat, header=TRUE)

Create a copy:

xx <- x

Use apply to sort the columns, then aggregate:

xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
  name1 name2 name3 total
1   Bob Frank   Joe    20
2   Bob  Fred   Sam    45
3 Frank   Sam   Tom    35

r - 在 data.frame 中查找并合并重复行但忽略列顺序

2 回答 2

Related

Reference