r - 如何计算分组数据集的中位数？

Question

我的数据集如下：

salary  number
1500-1600   110
1600-1700   180
1700-1800   320
1800-1900   460
1900-2000   850
2000-2100   250
2100-2200   130
2200-2300   70
2300-2400   20
2400-2500   10

如何计算该数据集的中位数？这是我尝试过的：

x <- c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)
colnames <- "numbers"
rownames <- c("[1500-1600]", "(1600-1700]", "(1700-1800]", "(1800-1900]", 
              "(1900-2000]", "(2000,2100]", "(2100-2200]", "(2200-2300]",
              "(2300-2400]", "(2400-2500]")
y <- matrix(x, nrow=length(x), dimnames=list(rownames, colnames))
data.frame(y, "cumsum"=cumsum(y))

            numbers cumsum
[1500-1600]     110    110
(1600-1700]     180    290
(1700-1800]     320    610
(1800-1900]     460   1070
(1900-2000]     850   1920
(2000,2100]     250   2170
(2100-2200]     130   2300
(2200-2300]      70   2370
(2300-2400]      20   2390
(2400-2500]      10   2400

在这里，您可以看到中途频率为2400/2= 1200。它介于1070和之间1920。因此，中间类是(1900-2000]组。您可以使用以下公式来获得此结果：

中位数 = L + h/f (n/2 - c)

在哪里：

L是中值类的下类边界
h是中值类的大小，即中值类的上下类边界之间的差异
f是中值类的频率
c是中值类的先前累积频率
n/2是总数. 观测值除以 2（即总和f / 2）

或者，中值类由以下方法定义：

在累积频率列中找到 n/2。

获取它所在的类。

在代码中：

> 1900 + (1200 - 1070) / (1920 - 1070) * (2000 - 1900)    
[1] 1915.294

现在我想做的是让上面的表达式更优雅——即1900+(1200-1070)/(1920-1070)*(2000-1900)。我怎样才能做到这一点？

score 7 · Accepted Answer

由于您已经知道公式，因此创建一个函数来为您进行计算应该很容易。

在这里，我创建了一个基本函数来帮助您入门。该函数有四个参数：

frequencies：频率向量（第一个示例中的“数字”）
intervals：列matrix数与频率长度相同的 2 行，第一行是下类边界，第二行是上类边界。或者，“ intervals”可能是您的一列，data.frame您可以指定sep（也可能是trim）让函数自动为您创建所需的矩阵。
sepintervals: 在你的 " " 列中的分隔符data.frame。
trim：在尝试强制转换为数字矩阵之前需要删除的字符的正则表达式。函数中内置了一种模式：trim = "cut". 这会将正则表达式模式设置为从输入中删除 (, ), [, and ]。

这是功能（注释显示我如何使用您的说明将其组合在一起）：

GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
  # If "sep" is specified, the function will try to create the 
  #   required "intervals" matrix. "trim" removes any unwanted 
  #   characters before attempting to convert the ranges to numeric.
  if (!is.null(sep)) {
    if (is.null(trim)) pattern <- ""
    else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
    else pattern <- trim
    intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
  }

  Midpoints <- rowMeans(intervals)
  cf <- cumsum(frequencies)
  Midrow <- findInterval(max(cf)/2, cf) + 1
  L <- intervals[1, Midrow]      # lower class boundary of median class
  h <- diff(intervals[, Midrow]) # size of median class
  f <- frequencies[Midrow]       # frequency of median class
  cf2 <- cf[Midrow - 1]          # cumulative frequency class before median class
  n_2 <- max(cf)/2               # total observations divided by 2

  unname(L + (n_2 - cf2)/f * h)
}

这是一个data.frame可以使用的示例：

mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800", 
    "1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300", 
    "2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L, 
    850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"), 
    class = "data.frame", row.names = c(NA, -10L))
mydf
#       salary number
# 1  1500-1600    110
# 2  1600-1700    180
# 3  1700-1800    320
# 4  1800-1900    460
# 5  1900-2000    850
# 6  2000-2100    250
# 7  2100-2200    130
# 8  2200-2300     70
# 9  2300-2400     20
# 10 2400-2500     10

现在，我们可以简单地做：

GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294

以下是对一些虚构数据起作用的函数示例：

set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
#           Var1 Freq
# 1   (1.9,11.7]    8
# 2  (11.7,21.5]    8
# 3  (21.5,31.4]    8
# 4  (31.4,41.2]   15
# 5    (41.2,51]   13
# 6    (51,60.8]    5
# 7  (60.8,70.6]   11
# 8  (70.6,80.5]   15
# 9  (80.5,90.3]   11
# 10  (90.3,100]    6

### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231

### ... and the output of median on the original vector
median(x)
# [1] 49.5

顺便说一句，对于您提供的示例数据，我认为您的一个范围中有一个错误（除了一个用逗号分隔之外，所有范围都用破折号分隔），因为strsplit默认情况下使用正则表达式来拆分开，你可以使用这样的功能：

x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
            "(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
            "(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294

score 4 · Accepted Answer

我这样写是为了清楚地解释它是如何解决的。附加了一个更紧凑的版本。

library(data.table)

#constructing the dataset with the salary range split into low and high
salarydata <- data.table(
  salaries_low = 100*c(15:24),
  salaries_high = 100*c(16:25),
  numbers = c(110,180,320,460,850,250,130,70,20,10)
)

#calculating cumulative number of observations
salarydata <- salarydata[,cumnumbers := cumsum(numbers)]
salarydata
   # salaries_low salaries_high numbers cumnumbers
   # 1:         1500          1600     110        110
   # 2:         1600          1700     180        290
   # 3:         1700          1800     320        610
   # 4:         1800          1900     460       1070
   # 5:         1900          2000     850       1920
   # 6:         2000          2100     250       2170
   # 7:         2100          2200     130       2300
   # 8:         2200          2300      70       2370
   # 9:         2300          2400      20       2390
   # 10:         2400          2500      10       2400

#identifying median group
mediangroup <- salarydata[
  (cumnumbers - numbers) <= (max(cumnumbers)/2) & 
  cumnumbers >= (max(cumnumbers)/2)]
mediangroup
   # salaries_low salaries_high numbers cumnumbers
   # 1:         1900          2000     850       1920

#creating the variables needed to calculate median
mediangroup[,l := salaries_low]
mediangroup[,h := salaries_high - salaries_low]
mediangroup[,f := numbers]
mediangroup[,c := cumnumbers- numbers]
n = salarydata[,sum(numbers)]

#calculating median
median <- mediangroup[,l + ((h/f)*((n/2)-c))]
median
   # [1] 1915.294

紧凑版 -

编辑：更改为@AnandaMahto 建议的功能。此外，使用更通用的变量名称。

library(data.table)

#Creating function

CalculateMedian <- function(
   LowerBound,
   UpperBound,
   Obs
)
{
   #calculating cumulative number of observations and n
   dataset <- data.table(UpperBound, LowerBound, Obs)

   dataset <- dataset[,cumObs := cumsum(Obs)]
   n = dataset[,max(cumObs)]

   #identifying mediangroup and dynamically calculating l,h,f,c. We already have n.
   median <- dataset[
      (cumObs - Obs) <= (max(cumObs)/2) & 
      cumObs >= (max(cumObs)/2),

      LowerBound + ((UpperBound - LowerBound)/Obs) * ((n/2) - (cumObs- Obs))
   ]

   return(median)
}


# Using function
CalculateMedian(
  LowerBound = 100*c(15:24),
  UpperBound = 100*c(16:25),
  Obs = c(110,180,320,460,850,250,130,70,20,10)
)
# [1] 1915.294

score 3 · Accepted Answer

(Sal <- sapply( strsplit(as.character(dat[[1]]), "-"), 
                                 function(x) mean( as.numeric(x) ) ) )
 [1] 1550 1650 1750 1850 1950 2050 2150 2250 2350 2450
require(Hmisc)
wtd.mean(Sal, weights = dat[[2]])
[1] 1898.75
wtd.quantile(Sal, weights=dat[[2]], probs=0.5)

对加权中位数的泛化可能需要寻找具有此类的包。

score 0 · Accepted Answer

0

您是否尝试过median或者apply(yourobject,2,median)如果它是matrixor data.frame？

于 2013-09-19T07:16:06.680 回答

score 0 · Accepted Answer

这种方式呢？为每个薪级创建向量，假设在每个波段上分布均匀。然后从这些向量中制作一个大向量，并取中位数。与您相似，但结果略有不同。我不是数学家，所以方法可能不正确。

dat <- matrix(c(seq(1500, 2400, 100), seq(1600, 2500, 100), c(110, 180, 320, 460, 850, 250, 130, 70, 20, 10)), ncol=3)
median(unlist(apply(dat, 1, function(x) { ((1:x[3])/x[3])*(x[2]-x[1])+x[1] })))

返回 1915.353

score -3 · Accepted Answer

我认为这个概念应该对你有用。

$salaries = array(
       array("1500","1600"),
       array("1600","1700"),
       array("1700","1800"),
       array("1800","1900"),
       array("1900","2000"),
       array("2000","2100"),
       array("2100","2200"),
       array("2200","2300"),
       array("2300","2400"),
       array("2400","2500"),
      );
 $numbers = array("110","180","320","460","850","250","130","70","20","10");
 $cumsum = array();
 $n = 0;
 $count = 0;
 foreach($numbers as $key=>$number){    
$cumsum[$key] = $number;    
$n += $number;
if($count > 0){
    $cumsum[$key] += $cumsum[$key-1];       
}
++$count;
 }

 $classIndex = 0;
 foreach($cumsum as $key=>$cum){
if($cum < ($n/2)){
 $classIndex = $key+1;
}
 }
 $classRange = $salaries[$classIndex];
 $L = $classRange[0];
 $h = (float) $classRange[1] - $classRange[0];
 $f = $numbers[$classIndex];
 $c = $numbers[$classIndex-1];

 $Median = $L + ($h/$f)*(($n/2)-$c);
 echo $Median;

r - 如何计算分组数据集的中位数？

6 回答 6

Related

Reference