1

所以假设我有一个看起来像这样的数据集,我在 R 中使用它:

player      at_bat  opponent_name     game  result
Torri_Hunter    1   Pittsburgh Pirates  1   home run
Torri_Hunter    2   Pittsburgh Pirates  1   triple
Torri_Hunter    3   Pittsburgh Pirates  1   strikeout
Torri_Hunter    4   Pittsburgh Pirates  1   strikeout
Torri_Hunter    1   Pittsburgh Pirates  2   groundout
Torri_Hunter    2   Pittsburgh Pirates  2   home run
Torri_Hunter    3   Pittsburgh Pirates  2   flyout
Torri_Hunter    1   Pittsburgh Pirates  2   home run
Torri_Hunter    2   Pittsburgh Pirates  3   triple
Torri_Hunter    3   Pittsburgh Pirates  3   strikeout
Torri_Hunter    4   Pittsburgh Pirates  3   strikeout
Torri_Hunter    1   Detroit Tigers      1   home run
Torri_Hunter    2   Detroit Tigers      1   home run
Torri_Hunter    3   Detroit Tigers      1   home run
Torri_Hunter    4   Detroit Tigers      1   strikeout

(我意识到 Torii 的名字拼写错误,请耐心等待)。

我最终想计算一个系列赛中本垒打的百分比,最终得到如下所示的结果:

                opponent_name       game_1s game_2s game_3s
Torri Hunter    Pittsburgh Pirates  25%     50%     0%
Torri Hunter    Detroit Tigers      75%     --      --

我可以 dplyr::filter 过滤结果,按 ID 计算每个游戏的统计数据,然后导出到 .csv,我可以在其中获得 excel 中的平均值(这就是我一直在做的),但必须更快完全在 R 中做到这一点的方法。有什么想法吗?

4

3 回答 3

4

你可以这样做:

library(dplyr)
df %>% 
  group_by(player, opponent_name, game) %>% 
  summarise(p = sum(result == "home run") / n()) 

这使:

#Source: local data frame [4 x 4]
#Groups: player, opponent_name
#
#        player      opponent_name game    p
#1 Torri_Hunter     Detroit Tigers    1 0.75
#2 Torri_Hunter Pittsburgh Pirates    1 0.25
#3 Torri_Hunter Pittsburgh Pirates    2 0.50
#4 Torri_Hunter Pittsburgh Pirates    3 0.00

为了匹配您想要的输出,您还可以执行以下操作:

df %>% 
  group_by(player, opponent_name, game) %>% 
  summarise(p = mean(result == "home run")) %>%
  tidyr::spread(game, p) %>%
  arrange(desc(opponent_name)) %>%
  setNames(c(names(.)[1:2], paste0("game_", names(.)[3:5], "s"))) %>%
  mutate_each(funs(ifelse(is.na(.), "--", paste0(. * 100, "%"))), -(player:opponent_name))

这使:

#Source: local data frame [2 x 5]
#
#        player      opponent_name game_1s game_2s game_3s
#1 Torri_Hunter Pittsburgh Pirates     25%     50%      0%
#2 Torri_Hunter     Detroit Tigers     75%      --      --
于 2015-07-19T16:16:17.713 回答
2

铸造的data.table解决方案是

require(data.table)
setDT(dat)
percentage <- dat[,mean(result == "home run"), by = c("player", "opponent_name", "game")]

结果:

> percentage

         player      opponent_name game   V1
1: Torri_Hunter Pittsburgh Pirates    1 0.25
2: Torri_Hunter Pittsburgh Pirates    2 0.50
3: Torri_Hunter Pittsburgh Pirates    3 0.00
4: Torri_Hunter     Detroit Tigers    1 0.75

根据问题的要求将其转换为输出

require(reshape2)
dcast(percentage, player + opponent_name ~ game , value.var = "V1")

结果:

        player      opponent_name    1   2  3
1 Torri_Hunter     Detroit Tigers 0.75  NA NA
2 Torri_Hunter Pittsburgh Pirates 0.25 0.5  0
于 2015-07-19T16:28:27.710 回答
0

写两个函数来帮助你怎么样?假设您的数据框调用 df.

perc_res <- function(opponent, game="1" player="Torri_Hunter", result="home run"){
  return(
   dim(df[df$player==player & df$opponent==opponent & df$result==result & df$game==game,])[1]/
      dim(df[df$player==player & df$opponent==opponent & df$game==game,])[1]
 )
}

然后,您可以制作一个看起来像的输出数据框

out.df <- data.frame(Opponent=levels(factor(df$opponent)), Player="Torri_Hunter")
out.df$game1s <- lapply(out.df$Opponent, perc_res, game=1)

等如果你以后想要有更多的玩家,你可以使用mapply.

ps:还没有实际运行代码,所以可能仍然存在一些一般错误。但我认为这至少应该让你开始!

于 2015-07-19T16:16:22.723 回答