r - 对从 ddply 创建的列执行计算

Question

我正在使用ddplyinsidesubset来计算一些指标并根据需要汇总表格。我要计算的一些指标需要使用作为ddply操作结果创建的汇总列。

这是具有简单计算列的函数：

subset_by_market <- function (q, marketname, dp) {
  subset(ddply(df, .(quarter, R.DMA.NAMES, daypart, station), summarise, 
               spot.count = length(spot.id), 
               station.investment = sum(rate),
               nullspots.male = sum(nullspot.male),
               nullspots.allpersons = sum(nullspot.allpersons),
               total.male.imp = sum(male.imp),
               total.allpersons.imp = sum(allpersons.imp),
               spotvalue.male = sum(spotvalue.male),
               spotvalue.allpersons = sum(spotvalue.allpersons)),
         quarter == q & R.DMA.NAMES == marketname & daypart == dp)
}

我subset_by_market ("Q32013" , "Columbus.OH", "primetime")用来总结创建一个子集。我的结果表如下所示：

  quarter R.DMA.NAMES   daypart           station spot.count station.investment nullspots.male     nullspots.allpersons
10186  Q32013 Columbus.OH primetime ADSM COLUMBUS, OH        103               5150             67                   61
10187  Q32013 Columbus.OH primetime              ESYX         49                  0             49                   49
10188  Q32013 Columbus.OH primetime  MTV COLUMBUS, OH         61               4500              7                    1
10189  Q32013 Columbus.OH primetime     WCMH-Retro TV         94                564             93                   93
10190  Q32013 Columbus.OH primetime              WTTE          1                  0              0                    0
10191  Q32013 Columbus.OH primetime              WWHO          9                  0              2                    2
  total.male.imp total.allpersons.imp spotvalue.male spotvalue.allpersons
10186           47.2                127.7       4830.409            4775.1068
10187            0.0                  0.0            NaN                  NaN
10188          157.9                371.1       4649.746            4505.2608
10189            0.3                  0.3       3162.000            3162.0000
10190            3.5                 10.3        570.166             591.0231
10191            3.9                 15.8       7155.000            4356.4162

问题 1：我想添加到相同的数据框中，例如：spot.count 的百分比值。= spot.count / sum(spot.count) (ii)percent.nullspots.male = nullspots.male / sum(nullspots.male)

但是，当我将其添加到ddply参数中时，我在结果列中得到 1 (100%)。该值除以自身，而不是除以列的总和。

问题 2：这是缓慢而谦虚的我接受这可能不是最佳编码。我正在使用带有 16Gb ddr3 RAM 和 64 位操作系统的 i5-2.6GHz PC。数据集是 1M 行。

system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
   user  system elapsed 
 228.13  176.84  416.12

目的是在在线仪表板上可视化所有计算的指标，并允许用户选择subset_by_market (q , marketname, dp)使用下拉菜单。我怎样才能让它更快？

添加样本数据：

`> structure(list(market = c("Local", "Local", "Local", "Local", 
"Local", "Local", "Local", "NATIONAL CABLE", "Local", "Local"
), spot.id = c(11248955L, 11262196L, 11946349L, 11625265L, 12929889L, 
11259758L, 11517638L, 11599834L, 12527365L, 12930259L), date = structure(c(1375675200, 
1376625600, 1390280400, 1383627600, 1401249600, 1375848000, 1380772800, 
1383019200, 1397102400, 1401163200), class = c("POSIXct", "POSIXt"
), tzone = ""), hour = c(15, 17, 11, 18, 19, 1, 13, 14, 16, 22
), time = structure(c(0.642361111111111, 0.749305555555556, 0.481944444444444, 
0.770138888888889, 0.830555555555556, 0.0597222222222222, 0.582638888888889, 
0.597222222222222, 0.675694444444444, 0.930555555555556), format = "h:m:s", class = "times"), 
    local.date = structure(c(1375675200, 1376625600, 1390280400, 
    1383627600, 1401249600, 1375848000, 1380772800, 1383019200, 
    1397102400, 1401163200), class = c("POSIXct", "POSIXt"), tzone = ""), 
    local.hour = c(15, 17, 11, 18, 18, 0, 13, 14, 15, 22), local.time = structure(c(0.642361111111111, 
    0.749305555555556, 0.481944444444444, 0.770138888888889, 
    0.788888888888889, 0.0180555555555556, 0.582638888888889, 
    0.597222222222222, 0.634027777777778, 0.930555555555556), format = "h:m:s", class = "times"), 
    vendor = c("Time Warner - Myrtle Beach", "WMYD", "WSBK", 
    "WDCA", "Comcast - Memphis", "Charter Media - Birmingham", 
    "WBNA", "G4", "Comcast - Houston", "Comcast - Youngstown"
    ), station = c("VH-1 MYRTLE BEACH", "WMYD", "WSBK", "WDCA", 
    "COM MEMPHIS", "FX BIRMINGHAM", "WBNA", "G4", "SPK HOUSTON", 
    "COM YOUNGSTOWN CC"), male.imp = c(0, 2, 0, 0, 0.6, 0.4, 
    0, 0, 3.9, 0), women.imp = c(0, 2.5, 0, 2.5, 0.2, 0.6, 0, 
    0, 4.6, 0.6), allpersons.imp = c(0, 3.5, 0, 2.5, 0.8, 0.8, 
    0, 0, 7.8, 0.6), hh.imp = c(0, 8.5, 8, 64.5, 1.3, 2.9, 1.3, 
    15, 13.7, 1), isci = c("IT6140MB", "ITCD78DT", "IT6192BS", 
    "IT6170WD", "IT6173ME", "IT6162BI", "IT6155LO", "ITES13410", 
    "IT3917", "IT3921"), creative = c("Eugene Elbert (Bach. Tcom Eng. Tech) :60", 
    "The Problem Solvers (revised) - IET :60", "Murtech/Kinetic/Integra :60", 
    "Kevin Bumper/NTSG/Lifetime :60", "NCR/Schlumberger/Sprint (revised) (Bach) :60", 
    "Skills Gap (revised) /Kevin :60", "Rising Costs60 (Opportunity Scholar - No Nursing)", 
    "Irina Lund (Bach. ISS) :60", "Augustine Lopez (A. CEET) :30 (no loc)", 
    "John Ryan Ellis (B. PM/A. CDD) :30 (no loc)"), program = c(NA, 
    "TYLER PERRY'S MEET THE BROWNS", "THE PEOPLE'S COURT", "Judge Judy", 
    NA, NA, "Meet the Browns/Are We There Yet/News/Wendy Willia", 
    "HEROES", "Spike EF Rotator", NA), rate = c(5, 230, 100, 
    625, 40, 0, 15, 40, 110, 7), R.DMA.NAMES = c("Myrtle.Beach.Florence", 
    "Detroit", "Boston.Manchester.", "Washington.DC.Hagrstwn.", 
    "Memphis", "Birmingham.Ann.and.Tusc.", "Louisville", "NATIONAL CABLE", 
    "Houston", "Youngstown"), date.time = c("2013-08-05 15:25:00", 
    "2013-08-16 17:59:00", "2014-01-21 11:34:00", "2013-11-05 18:29:00", 
    "2014-05-28 19:56:00", "2013-08-07 01:26:00", "2013-10-03 13:59:00", 
    "2013-10-29 14:20:00", "2014-04-10 16:13:00", "2014-05-27 22:20:00"
    ), daypart = c("afternoon", "evening", "morning", "evening", 
    "evening", "late fringe", "afternoon", "afternoon", "afternoon", 
    "primetime"), quarter = structure(c(4L, 4L, 1L, 6L, 3L, 4L, 
    6L, 6L, 3L, 3L), .Label = c("Q12014", "Q22013", "Q22014", 
    "Q32013", "Q32014", "Q42013"), class = "factor"), cpi.allpersons = c(96.2179487179487, 
    79.0114068441065, 35.1219512195122, 82.3322348711803, 30, 
    0, 138.721804511278, 28.3135215453195, 28.2384088854449, 
    86.6666666666667), cpi.male = c(750.5, 188.882673751923, 
    115.959004392387, 144.492639327024, 38.9847715736041, 0, 
    595.161290322581, 34.7402005469462, 62.010777084515, 156.712328767123
    ), spotvalue.allpersons = c(0, 276.539923954373, 0, 205.830587177951, 
    24, 0, 0, 0, 220.25958930647, 52), spotvalue.male = c(0, 
    377.765347503846, 0, 0, 23.3908629441624, 0, 0, 0, 241.842030629609, 
    0), nullspot.allpersons = c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0), 
    nullspot.male = c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)), .Names = c("market", 
"spot.id", "date", "hour", "time", "local.date", "local.hour", 
"local.time", "vendor", "station", "male.imp", "women.imp", "allpersons.imp", 
"hh.imp", "isci", "creative", "program", "rate", "R.DMA.NAMES", 
"date.time", "daypart", "quarter", "cpi.allpersons", "cpi.male", 
"spotvalue.allpersons", "spotvalue.male", "nullspot.allpersons", 
"nullspot.male"), row.names = c(561147L, 261262L, 89888L, 941010L, 
500366L, 65954L, 484053L, 598996L, 380976L, 968615L), class = "data.frame")`

为丑陋道歉dput。

score 0 · Accepted Answer

这仅回答了与使功能更快有关的第二个问题。根据@beginneR 提示，我将函数转换为dplyr.

subset_by_market <- function (q, marketname, dp) {

  subset(df %>% group_by(quarter, R.DMA.NAMES, daypart, station) %>%
  summarize (spot.count = length(spot.id), station.investment = sum(rate),
             nullspots.male = sum(nullspot.male),
             nullspots.allpersons = sum(nullspot.allpersons),
             total.male.imp = sum(male.imp),
             total.allpersons.imp = sum(allpersons.imp),
             spotvalue.male = sum(spotvalue.male),
             spotvalue.allpersons = sum(spotvalue.allpersons),
             male.imp.per.spot = total.male.imp / spot.count,
             allpersons.imp.per.spot = total.allpersons.imp / spot.count,
             cost.per.spot = station.investment / spot.count,
             male.value.per.spot = spotvalue.male / spot.count,
             allpersons.value.per.spot = spotvalue.allpersons / spot.count), 
  quarter == q & R.DMA.NAMES == marketname & daypart == dp) }

这大大减少了时间：

> system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
   user  system elapsed 
   1.06    0.00    1.09

我在使用时遇到的故障dplyr是我的数据中名为“时间”的列，它属于timespackage类chron。我一直收到错误Error: column 'local.time' has unsupported type。我无法确定解决此问题的确切方法，所以我只是将其转换为POSIXct使用df$time <- as.POSIXct(as.character(df$time, format = "%H:%M:%S")). 这不是最佳选择，因为我将其转换为times使用的原因chron是在不需要日期或时区的情况下保持时间年表。更多信息在这里：解决在 R 的 ifelse 语句中指定日期范围。但是，它解决了眼前的问题。

r - 对从 ddply 创建的列执行计算

1 回答 1

Related

Reference