r - 相对于均值离散化分数

Question

我有日期、邮政编码和分数的数据。我想对数据进行离散化，以使同一月份的所有行和同一邮政编码高于同一月份的平均值，邮政编码为 1，所有其他行均为零。

示例（数据框称为 score_df）：

date       zip      score
2014-01-02 12345    10
2014-01-03 12345    20
2014-01-04 12345    2
2014-01-05 99885    15
2014-01-06 99885    12

输出：

date       zip      score    above_avg
2014-01-02 12345    10       0
2014-01-03 12345    20       1
2014-01-04 12345    3        0
2014-01-05 99885    15       1
2014-01-06 99885    12       0

到目前为止，我一直在使用低效的解决方案：

1.遍历所有月份并使用 ifelse 语句应用二进制条件

score_df$above_avg <- rep(0,length(score_df$score))
for (month in (1:12)) {
score_df$above_avg <- ifelse(as.numeric(substring(score_df$date,6,7)) == month,ifelse(score_df$score>quantile(score_df$score[as.numeric(substring(score_df$date,6,7)) == month],(0.5)),1,0),score_df$above_avg)
}

2.我还尝试使用聚合生成平均表，然后将平均列加入原始数据框，然后应用二进制条件

avg_by_month_zip <- aggregate(score~month+zip,data=score_df,FUN=mean)
score_df$mean <- sqldf("select * from score_df join avg_by_month_zip on avg_by_month_zip.zip = score_df.zip and avg_by_month_zip.month = score_df.month")
score_df$discrete <- ifelse(score_df$score>score_df$mean,1,0)

我想在功能上做到这一点。我知道如何在一个条件下（只是日期或只是 zip）在功能上做到这一点，但不是两个。我可以连接这两个字段以创建一个唯一字段。这将是一个快速解决方案，但我想知道是否有一种方法可以使用 apply 函数或 plyr 简单有效地完成此操作。

score 1 · Accepted Answer

我没有假设您有日期类变量（它们实际上是因素。）但基本上沿着与值得检查的 MrFlick 相同的路线进行：

> inp$above_avg <- with(inp, ave(score, zip, format(as.Date(date), "%m"), FUN=function(s) as.numeric(s > mean(s)) ) )
> inp
        date   zip score above_avg
1 2014-01-02 12345    10         0
2 2014-01-03 12345    20         1
3 2014-01-04 12345     2         0
4 2014-01-05 99885    15         1
5 2014-01-06 99885    12         0

score 1 · Accepted Answer

假设您将日期值正确编码为这样（例如）

score_df <- structure(list(date = structure(c(16072, 16073, 16074, 16075, 
16076), class = "Date"), zip = c(12345L, 12345L, 12345L, 99885L, 
99885L), score = c(10L, 20L, 2L, 15L, 12L)), .Names = c("date", 
"zip", "score"), row.names = c(NA, -5L), class = "data.frame")

那么你可以做

with(score_df, ave(score, strftime(date, "%m"), zip, 
    FUN=function(x) ifelse(x>mean(x), 1, 0)))
# [1] 0 1 0 1 0

我们ave()用来计算所有月份/邮编组合的值（我们用来strftime()从日期中获取月份）。

score 1 · Accepted Answer

尝试使用 data.table：

library(data.table)
ddt = data.table(score_df)
ddt[,above_avg:=ifelse(score>round(mean(score),0),1,0),]
ddt
         date   zip score above_avg
1: 2014-01-02 12345    10         0
2: 2014-01-03 12345    20         1
3: 2014-01-04 12345     2         0
4: 2014-01-05 99885    15         1
5: 2014-01-06 99885    12         0

r - 相对于均值离散化分数

3 回答 3

Related

Reference