r - 样本在 R 中的组内具有相同数量的每个性别

Question

第一件事，第一件事。这是我的数据：

lat <- c(12, 12, 58, 58, 58, 58, 58, 45, 45, 45, 45, 45, 45, 64, 64, 64, 64, 64, 64, 64)
long <- c(-14, -14, 139, 139, 139, 139, 139, -68, -68, -68, -68, -68, 1, 1, 1, 1, 1, 1, 1, 1)
sex <- c("M", "M", "M", "M", "F", "M", "M", "F", "M", "M", "M", "F", "M", "F", "M", "F", "F", "F", "F", "M")
score <- c(2, 6, 3, 6, 5, 4, 3, 2, 3, 9, 9, 8, 6, 5, 6, 7, 5, 7, 5, 1)

data <- data.frame(lat, long, sex, score)

数据应如下所示：

   lat long sex score
1   12  -14   M     2
2   12  -14   M     6
3   58  139   M     3
4   58  139   M     6
5   58  139   F     5
6   58  139   M     4
7   58  139   M     3
8   45  -68   F     2
9   45  -68   M     3
10  45  -68   M     9
11  45  -68   M     9
12  45  -68   F     8
13  45    1   M     6
14  64    1   F     5
15  64    1   M     6
16  64    1   F     7
17  64    1   F     5
18  64    1   F     7
19  64    1   F     5
20  64    1   M     1

我束手无策，试图弄清楚这一点。变量是纬度、经度、性别和分数。我希望每个位置的男性和女性数量相等（即具有相同的经度和纬度）。例如，第二个位置（第 3 到 7 行）只有一名女性。这个女性应该被保留，其余个体中的一名男性也应该被保留（也许通过随机抽样）。有些位置只有一种性别的信息，例如第一个位置（第 1 行和第 2 行）只有男性数据。应该删除该位置的行（因为没有女性）。一切按计划进行，最终数据集应如下所示：

   lat2 long2 sex2 score2
1    58   139    F      5
2    58   139    M      4
3    45   -68    F      2
4    45   -68    M      3
5    45   -68    M      9
6    45   -68    F      8
7    64     1    M      6
8    64     1    F      5
9    64     1    F      7
10   64     1    M      1

任何帮助，将不胜感激。

score 5 · Accepted Answer

这是一个解决方案lapply：

data[unlist(lapply(with(data, split(seq.int(nrow(data)), paste(lat, long))),
        # 'split' splits the sequence of row numbers (indices) along the unique
        # combinations of 'lat' and 'long'
        # 'lapply' applies the following function to all sub-sequences
        function(x) {
          # which of the indices are for males:
          male <- which(data[x, "sex"] == "M")
          # which of the indices are for females:
          female <- which(data[x, "sex"] == "F")
          # sample from the indices of males:
          s_male <- sample(male, min(length(male), length(female)))
          # sample from the indices of females:
          s_female <- sample(female, min(length(male), length(female)))
          # combine both sampled indices:
          x[c(s_male, s_female)]                
        })), ]
# The function 'lappy' returns a list of indices which is transformed to a vector
# using 'unlist'. These indices are used to subset the original data frame.

结果：

   lat long sex score
9   45  -68   M     3
11  45  -68   M     9
12  45  -68   F     8
8   45  -68   F     2
7   58  139   M     3
5   58  139   F     5
20  64    1   M     1
15  64    1   M     6
19  64    1   F     5
16  64    1   F     7

score 2 · Accepted Answer

下面是一个快速的方法，它涉及创建一个经纬组合的临时列。我们根据这一列拆分DF，计算每个拆分中的M/F，适当采样，然后重新组合。

# First, We call the dataframe something other than "data" ;) 
mydf <- data.frame(lat, long, sex, score)

# create a new data frame with a temporary column, which concatenates the lat & long. 
mydf.new <- data.frame(mydf, latlong=paste(mydf$lat, mydf$long, sep=","))

# Split the data frame according to the lat-long location
mydf.splat <- split(mydf.new, mydf.new$latlong)

# eg, taking a look at one of our tables:
mydf.splat[[4]]

sampled <- 
  lapply(mydf.splat, function(tabl) {
    Ms <- sum(tabl$sex=="M")
    Fs <- sum(tabl$sex=="F")

    if(Fs == 0 || Ms ==0)    # If either is zero, we drop that location
      return(NULL)

    if(Fs == Ms)   # If they are both equal, no need to sample. 
      return(tabl)

    # If number of Females less than Males, return all Females 
    #    and sample from males in ammount equal to Females
    if (Fs < Ms)   
      return(tabl[c(which(tabl$sex=="F"), sample(which(tabl$sex=="M"), Fs)),   ])

    if (Ms < Fs)  # same as previous, but for Males < Femals
      return(tabl[c(which(tabl$sex=="M"), sample(which(tabl$sex=="F"), Ms)),   ])

    stop("hmmm... something went wrong.")  ## We should never hit this line, but just in case. 
  })

# Flatten into a single table
mydf.new <- do.call(rbind, sampled)

# Clean up
row.names(mydf.new) <- NULL  # remove the row names that were added
mydf.new$latlong <- NULL     # remove the temporary column that we added

结果

mydf.new

#    lat long sex score
# 1   45  -68   F     2
# 2   45  -68   F     8
# 3   45  -68   M     9
# 4   45  -68   M     3
# 5   58  139   F     5
# 6   58  139   M     3
# 7   64    1   M     6
# 8   64    1   M     1
# 9   64    1   F     7
# 10  64    1   F     5

score 2 · Accepted Answer

这会将值作为列表元素返回：

spl <- split(data, interaction(data$lat, data$long) ,drop=TRUE)
           # interaction creates all the two way pairs from those two vectors
           # drop is needed to eliminate the dataframes with no representation
res <- lapply(spl, function(x) { #First find the nuber of each gender to select
                        N=min(table(x$sex))  # then sample each sex separately
   rbind( x[ x$sex=="M" & row.names(x) %in% sample(row.names(x[x$sex=="M",] ), N) , ],
                 # One (or both) of these will be "sampling" all of that sex.
          x[ x$sex=="F" & row.names(x) %in% sample(row.names(x[x$sex=="F", ]), N) , ] )
                                } )
res
#------------
$`45.-68`  
   lat long sex score
9   45  -68   M     3
11  45  -68   M     9
8   45  -68   F     2
12  45  -68   F     8

$`12.-14` # So there were no women in this group and zero could be matched
[1] lat   long  sex   score
<0 rows> (or 0-length row.names)

$`45.1`
[1] lat   long  sex   score
<0 rows> (or 0-length row.names)

$`64.1`
   lat long sex score
15  64    1   M     6
20  64    1   M     1
16  64    1   F     7
17  64    1   F     5

$`58.139`
  lat long sex score
7  58  139   M     3
5  58  139   F     5

,,,，但如果你想要它作为数据框，你可以使用do.call(rbind, res)：

> do.call(rbind, res)
          lat long sex score
45.-68.10  45  -68   M     9
45.-68.11  45  -68   M     9
45.-68.8   45  -68   F     2
45.-68.12  45  -68   F     8
64.1.15    64    1   M     6
64.1.20    64    1   M     1
64.1.17    64    1   F     5
64.1.18    64    1   F     7
58.139.6   58  139   M     4
58.139.5   58  139   F     5

r - 样本在 R 中的组内具有相同数量的每个性别

3 回答 3

结果

Related

Reference