r - 如何降低样本数据保持一列的相同频率分布

Question

我正在尝试对数据集进行下采样并保持与其中一列相同的频率分布。方法是 1) 确定基线频率分布，2) 使用基线频率分布对行范围进行采样，3) 使用采样的行范围从基线数据框中选择行，4) 比较基线和下采样频率分布. 这里有两个例子。在这两个示例中，具有最高概率的事件被过采样，而其余事件被欠采样。

重新采样数据的函数保持一列的频率相同

sampFreq<-function(df,col,ns) {
  x<-as.factor(df[,col])
  freq_x<-table(x)
  prob_x<-freq_x/sum(freq_x)
  df_prob = prob_x[as.factor(df[,col])]
  nr=nrow(df)
  samp_rows = sample(1:nr,ns,replace=FALSE,prob=df_prob)
  return(df[samp_rows,])
}

示例 1

步骤 1) 指定目标频率分布 2) 转换为概率 3) 生成具有目标频率分布的数据 4) 使用上述函数对数据进行下采样

cfreq_1=c(1,2,3,4,5,4,3,2,1)
freq_1 = matrix(cfreq_1, nrow = 1, ncol = length(cfreq_1), byrow = TRUE,
               dimnames = list(c("row1" ),
                               c(as.character(4+(1:length(cfreq_1))))))
pr_1=freq_1/sum(freq_1)
set.seed(31)
ns=5000
df_1a<-data.frame(nbr = sample(4+(1:length(pr_1)),ns,
                               replace=TRUE,prob=pr_1),
                  ord=1:ns)
df_1b<-sampFreq(df_1a, "nbr", 1000)

5) 获取模拟和下采样数据的频率 6) 根据维度名称的数值对频率进行排序

tb_1a<-table(df_1a$nbr)
tb_1b<-table(df_1b$nbr)
s_tb_1a<-tb_1a[order(as.numeric(attr(tb_1a,"dimnames")[[1]]))]
s_tb_1b<-tb_1b[order(as.numeric(attr(tb_1b,"dimnames")[[1]]))]

7）绘制指定的概率，以及来自数据和下采样的概率

plot(as.numeric(attr(pr_1,"dimnames")[[2]]),pr_1,log="y",ylim=c(.01,.3),
     cex=1.5,pch=15,col="black",type="o", lty=2, 
     xlab='event',ylab='Probability',main="Example 1, Oversample high prob, undersample low")
points(as.numeric(attr(tb_1a,"dimnames")[[1]]),s_tb_1a/sum(s_tb_1a),
       cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_1b,"dimnames")[[1]]),s_tb_1b/sum(s_tb_1b),
       cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
       col=c("black","blue","red"),lty=c(2,2,1))
grid()

请注意，概率最高的事件被过采样，而其他事件被采样不足（红色曲线）。

在此处输入图像描述

示例 2

txt = "0.028506949  0.059389476  0.285069486  0.282693907  0.242309063  2.974224967
 0.064140634  0.002375579  0.019004632  0.280318328  0.033258107  0.073642950
  0.007126737  0.007126737 39.045017223  2.261551253  0.052262739  0.045136002
  0.014253474  0.035633686  5.223898325  1.073761729  4.150136596  0.009502316
  5.038603160  1.021498990  4.017104169  0.002375579  0.073642950  1.197291840
  0.501247179  0.052262739  0.776814348  0.071267371  8.416676565  0.026131370
  0.019004632  0.002375579  0.168666112  0.023755790  5.718018767  0.501247179
  0.014253474  0.776814348  0.071267371  8.416676565  0.026131370  0.002375579
  0.002375579  0.168666112  0.023755790  5.718018767  0.194797482  0.028506949
  0.137783585  0.016629053  0.002375579  0.494120442  0.007126737  "

# Here is the target frequency distribution
cfreq_2=scan(text=txt,multi.line =TRUE)
freq_2 = matrix(cfreq_2, nrow = 1, ncol = length(cfreq_2), byrow = TRUE,
              dimnames = list(c("row1" ),
                              c(as.character(4+(1:length(cfreq_2))))))
# Convert to probability
pr_2=freq_2/sum(freq_2)

# Generate some data
ns=42095
df_2a<-data.frame(nbr = sample(4+(1:length(pr_2)),ns,
                               replace=TRUE,prob=pr_2),
                  ord=1:ns)
df_2b<-sampFreq(df_2a, "nbr", 10000)

tb_2a<-table(df_2a$nbr)
tb_2b<-table(df_2b$nbr)
s_tb_2a<-tb_2a[order(as.numeric(attr(tb_2a,"dimnames")[[1]]))]
s_tb_2b<-tb_2b[order(as.numeric(attr(tb_2b,"dimnames")[[1]]))]
plot(as.numeric(attr(pr_2,"dimnames")[[2]]),pr_2,log="y",ylim=c(.00001,.7),
     cex=1.5,pch=15,col="black",type="o", lty=2, 
     xlab='event',ylab='Probability',main="Example 2, Oversampled Point With High Prob, Undersampled Others")
points(as.numeric(attr(tb_2a,"dimnames")[[1]]),s_tb_2a/sum(s_tb_2a),
       cex=1.5,pch=16,col="blue",type="o", lty=2)
points(as.numeric(attr(tb_2b,"dimnames")[[1]]),s_tb_2b/sum(s_tb_2b),
       cex=1.5,pch=17,col="red",type="o", lty=1)
legend("topleft",c("prescribed", "data", "sampled"),pch=c(15,16,17),
       col=c("black","blue","red"),lty=c(2,2,1))
grid()

在这里，只有一个事件被过采样，而其余事件被采样不足。

在此处输入图像描述

问题是为什么红线不靠近其他线，而且似乎存在系统误差。

具有小频率的不频繁元素对于匹配/获得很重要，匹配/获得频繁出现的元素的频率（大频率）不太重要。

score 1 · Accepted Answer

下面的函数给出了想要的结果。

sampFreq<-function(cdf,col,ns) {
  x<-as.factor(cdf[,col])
  freq_x<-table(x)
  prob_x<-freq_x/sum(freq_x)
  df_prob = prob_x[as.factor(cdf[,col])]
  nr=nrow(cdf)
  sLevels = levels(as.factor(cdf[,col]))
  nLevels = length(sLevels)
  rat = ns/nr
  rdata = NULL
  for (is in seq(1,nLevels)) {
    ldata <- cdf[cdf[,col]==sLevels[is],]
    ndata <- nrow(ldata)
    nsdata = max(ndata*rat,1)
    srows <- sample(seq(1,ndata),nsdata,replace=rat>1)
    sdata <- ldata[srows,]
    rdata <- rbind(rdata,sdata)
  }
  return(rdata)
}

示例 1 在此处输入图像描述示例 2

r - 如何降低样本数据保持一列的相同频率分布

示例 1

示例 2

1 回答 1

Related

Reference