2

I have a data.frame with only three columns but with many thousands of rows. The first and the second columns report numerical ID, and their combination indicate a link (e.g. A-B equal to B-A).

Now, I'd like to delete all rows that are duplicates for the link, selecting the row with the highest value in the third column.

Below a short example:

My input data.frame:

1   2    100
102 100  20000
100 102  23131
10  19 124444
10  15   1244
19  10   1242
10  19   5635
2   1    666
1   2     33
100 110     23

what I aim to get:

100 102  23131
10  19 124444
10  15   1244
2   1    666
100 110     23

I' trying to find the solution in R, otherwise postgreSQL would be fine too. Thanks a lot!

4

4 回答 4

3

The idea is similar to this one. You can create two additional columns using pmin an pmax to group as follows:

A data.table solution. But if you don't want data.table, then you can still use the idea. However, it is highly improbable you get faster than data.table solution with just R code.

# assuming your data.frame is DF
require(data.table)
DT <- data.table(DF)
# get min of V1,V2 on one column and max on other (for grouping)
DT[, `:=`(id1=pmin(V1, V2), id2=pmax(V1, V2))]
# get max of V3
DT.OUT <- DT[, .SD[which.max(V3), ], by=list(id1, id2)]
# remove the id1 and id2 columns
DT.OUT[, c("id1", "id2") := NULL]

#     V1  V2     V3
# 1:   2   1    666
# 2: 100 102  23131
# 3:  10  19 124444
# 4:  10  15   1244
# 5: 100 110     23
于 2013-03-19T09:44:13.597 回答
2

Here is an option in base R, mostly for sharing alternatives. As it involves transposing and sorting, it is, indeed, likely to be slow on your "thousands of rows" dataset. It assumes your data.frame is called "mydf":

myAggs <- as.data.frame(t(apply(mydf[1:2], 1, sort)))
mydf[1:2] <- myAggs
aggregate(V3 ~ ., mydf, max)
#    V1  V2     V3
# 1   1   2    666
# 2  10  15   1244
# 3  10  19 124444
# 4 100 102  23131
# 5 100 110     23
于 2013-03-19T09:56:58.450 回答
1

In postgresql ..

If your original table was constructed with three columns of integers - a, b, c - then you can use conditional functions to establish a unique key of max(a, b), min(a, b):

select 
 case when a>=b then a else b end as key1, 
 case when a>=b then b else a end as key2,  
 c from table;

You can then use 'group' to get the maximum C for each key (key1, key2):

select 
 key1, 
 key2, 
 max(c) as max_c 
 from (
  select 
  case when a>=b then a else b end as key1, 
  case when a>=b then b else a end as key2,  
  c from table
 ) query
 group by key1, key2;
于 2013-03-19T09:59:14.287 回答
1

Postgresql:

select distinct on (1, 2)
    least(a, b), greatest(a, b), c
from data_frame
order by 1, 2, c desc
于 2013-03-26T14:14:12.890 回答