由于您已经知道公式,因此创建一个函数来为您进行计算应该很容易。
在这里,我创建了一个基本函数来帮助您入门。该函数有四个参数:
frequencies
:频率向量(第一个示例中的“数字”)
intervals
:列matrix
数与频率长度相同的 2 行,第一行是下类边界,第二行是上类边界。或者,“ intervals
”可能是您的一列,data.frame
您可以指定sep
(也可能是trim
)让函数自动为您创建所需的矩阵。
sep
intervals
: 在你的 " " 列中的分隔符data.frame
。
trim
:在尝试强制转换为数字矩阵之前需要删除的字符的正则表达式。函数中内置了一种模式:trim = "cut"
. 这会将正则表达式模式设置为从输入中删除 (, ), [, and ]。
这是功能(注释显示我如何使用您的说明将其组合在一起):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
这是一个data.frame
可以使用的示例:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
现在,我们可以简单地做:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
以下是对一些虚构数据起作用的函数示例:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
顺便说一句,对于您提供的示例数据,我认为您的一个范围中有一个错误(除了一个用逗号分隔之外,所有范围都用破折号分隔),因为strsplit
默认情况下使用正则表达式来拆分开,你可以使用这样的功能:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294