2

我一直在努力解决如何在 R 中仅选择重复的 data.frame 行。例如,我的 data.frame 是:

age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
Names=c("John","John","John", "Harry", "Paul", "Paul", "Paul", "Khan", "Khan", "Khan", "Sam", "Joe")
village <- data.frame(Names, age, height)

 Names age height
 John  18   76.1
 John  19   77.0
 John  20   78.1
 Harry  21   78.2
 Paul  22   78.8
 Paul  23   79.7
 Paul  24   79.9
 Khan  25   81.1
 Khan  26   81.2
 Khan  27   81.8
 Sam  28   82.8
 Joe  29   83.5

我想看到如下结果:

Names age height
John  18   76.1
John  19   77.0
John  20   78.1
Paul  22   78.8
Paul  23   79.7
Paul  24   79.9
Khan  25   81.1
Khan  26   81.2
Khan  27   81.8

谢谢你的时间...

4

5 回答 5

6

使用duplicated两次的解决方案:

village[duplicated(village$Names) | duplicated(village$Names, fromLast = TRUE), ]


   Names age height
1   John  18   76.1
2   John  19   77.0
3   John  20   78.1
5   Paul  22   78.8
6   Paul  23   79.7
7   Paul  24   79.9
8   Khan  25   81.1
9   Khan  26   81.2
10  Khan  27   81.8

另一种解决方案by

village[unlist(by(seq(nrow(village)), village$Names, 
                  function(x) if(length(x)-1) x)), ]
于 2013-01-11T08:42:43.593 回答
3
village[ duplicated(village),]
于 2013-01-11T08:38:49.147 回答
1

我使用重复的“最整洁”找到@Sven 的答案,但您也可以通过许多其他方式做到这一点。这里还有两个:

  1. table()通过将列表大于 1 的名称与第一列中存在的名称匹配来使用和子集:

    village[village$Names %in% names(which(table(village$Names) > 1)), ]
    
  2. 用于ave()以稍微不同的方式“制表”,但以相同的方式进行子集化:

    village[with(village, ave(as.numeric(Names), Names, FUN = length) > 1), ]
    
于 2013-01-11T09:48:36.833 回答
1

或者,您可以在 dplyr 管道中使用分组和汇总。

它的代码行数更多,计算成本可能更高。但是,优点是您可以通过多列的复合键找到重复的行,而不是仅从一列中找到重复的行。

library(tidyverse)


a <- c(8, 18, 19, 19, 19, 20, 30, 32, 32)
b <- c(1950, 1965, 1981, 1971, 1981, 1999, 1969, 1994, 1999)
c <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
df <- data.frame(a, b, c)    
df

# Description:df[,3] [9 × 3]
# a
# <dbl>
# b
# <dbl>
# c
# <dbl>
# 8 1950    1       
# 18    1965    2       
# 19    1981    3       
# 19    1971    4       
# 19    1981    5       
# 20    1999    6       
# 30    1969    7       
# 32    1994    8       
# 32    1999    9       
# 9 rows

df[duplicated(df$a) | duplicated(df$a, fromLast = T), ]

# Description:df[,3] [5 × 3]
#  
#  
# a
# <dbl>
# b
# <dbl>
# c
# <dbl>
# 3 19  1981    3   
# 4 19  1971    4   
# 5 19  1981    5   
# 8 32  1994    8   
# 9 32  1999    9   
# 5 rows

df[duplicated(df$a, df$b) | duplicated(df$a, df$b, fromLast = T), ]

# Description:df[,3] [5 × 3]
#  
#  
# a
# <dbl>
# b
# <dbl>
# c
# <dbl>
# 3 19  1981    3   
# 4 19  1971    4   
# 5 19  1981    5   
# 8 32  1994    8   
# 9 32  1999    9   
# 5 rows

df %>%
  group_by(a, b) %>%
  summarise(a = a, b = b, c = c, n = n()) %>%
  subset(n > 1) %>%
  select(a, b, c)
# 
# A tibble:2 x 3
# Groups:a, b [1]
# a
# <dbl>
# b
# <dbl>
# c
# <dbl>
# 19    1981    3       
# 19    1981    5       
# 2 rows


df[duplicated(df, incomparables = c(c)), ]
# Error: argument 'incomparables != FALSE' is not used (yet)
# This error occurs even with no libraries loaded.

我可能遗漏了duplicated()括号中使用的方式,但我无法弄清楚。

此外, dplyr 返回一个 tibble,删除索引,这对您来说可能是一个缺点。

于 2021-12-09T18:23:09.473 回答
0

我想出了一个使用嵌套 sapply 的解决方案:

> village_dups = 
village[unique(unlist(which(sapply(sapply(village$Names,function(x) 
which(village$Names==x)),function(y) length(y)) > 1))),]
> village_dups
   Names age height
1   John  18   76.1
2   John  19   77.0
3   John  20   78.1
5   Paul  22   78.8
6   Paul  23   79.7
7   Paul  24   79.9
8   Khan  25   81.1
9   Khan  26   81.2
10  Khan  27   81.8
于 2018-02-28T19:12:47.353 回答