1

Background: I am running a little A/B test, with 2x2 factors (foreground's black and background's white, off-color vs normal color), and Analytics reports the number of hits for each of the 4 conditions and at what rate they 'converted' (a binary variable, which I define as spending at least 40 seconds on page). It's easy enough to do a little editing and get in a nice R dataframe:

rates <- read.csv(stdin(),header=TRUE)
Black,White,N,Rate
TRUE,FALSE,512,0.2344
FALSE,TRUE,529,0.2098
TRUE,TRUE,495,0.1919
FALSE,FALSE,510,0.1882

Naturally, I'd like to look at a logistic regression on something like Rate ~ Black * White but R's glm wants a dataframe of 2046 rows each reporting a TRUE or FALSE conversion value & the values of Black and White. This... is a little more tricky. I googled around and checked SO but while I found some clunky code on how to convert a table of contingency counts to a dataframe, I didn't find anything about percentages/rates.

After a lot of trouble, I came up with a loop over the 4 conditions in which I repeat a dataframe rate * n times with the relevant condition values and the result True and then do the same thing but for (1 - rate) * n and the result False, and then stitch together all 8 dataframes into one giant dataframe:

ground <- NULL
for (i in 1:nrow(rates)) {
        x <- rates[i,]
        y <- do.call("rbind", replicate((x$N * x$Rate),     data.frame(Black=c(x$Black),White=c(x$White),Conversion=c(TRUE)),  simplify = FALSE))
        z <- do.call("rbind", replicate((x$N * (1-x$Rate)), data.frame(Black=c(x$Black),White=c(x$White),Conversion=c(FALSE)), simplify = FALSE))
        ground <- rbind(ground,y,z)
}

The resulting dataframe ground looks right:

sum(rates$N)
[1] 2046
nrow(ground)
[1] 2042
# the missing 4 are probably from the rounding-off of the reported conversion rate
summary(ground); head(ground, n=20)
   Black           White         Conversion     
 Mode :logical   Mode :logical   Mode :logical  
 FALSE:1037      FALSE:1020      FALSE:1623     
 TRUE :1005      TRUE :1022      TRUE :419      
 NA's :0         NA's :0         NA's :0        
   Black White Conversion
1   TRUE FALSE       TRUE
2   TRUE FALSE       TRUE
3   TRUE FALSE       TRUE
4   TRUE FALSE       TRUE
5   TRUE FALSE       TRUE
6   TRUE FALSE       TRUE
7   TRUE FALSE       TRUE
8   TRUE FALSE       TRUE
9   TRUE FALSE       TRUE
10  TRUE FALSE       TRUE
11  TRUE FALSE       TRUE
12  TRUE FALSE       TRUE
13  TRUE FALSE       TRUE
14  TRUE FALSE       TRUE
15  TRUE FALSE       TRUE
16  TRUE FALSE       TRUE
17  TRUE FALSE       TRUE
18  TRUE FALSE       TRUE
19  TRUE FALSE       TRUE
20  TRUE FALSE       TRUE

And likewise, the logistic regression spits out a sane-looking answer:

g <- glm(Conversion ~ Black*White, family=binomial, data=ground); summary(g)
...
Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-0.732  -0.683  -0.650  -0.643   1.832  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)           -1.472      0.114  -12.94   <2e-16
BlackTRUE              0.291      0.154    1.88    0.060
WhiteTRUE              0.137      0.156    0.88    0.381
BlackTRUE:WhiteTRUE   -0.404      0.220   -1.84    0.066

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2072.7  on 2041  degrees of freedom
Residual deviance: 2068.2  on 2038  degrees of freedom
AIC: 2076

Number of Fisher Scoring iterations: 4

So my question is: is there any more elegant way of turning my Analytics's rate data into glm input than that awful loop?

4

3 回答 3

1

一件事是如何转换您的数据。另一个是为什么。从?glm:“[f] 或二项式 [...] famil[y] 响应可以 [...] 指定为一个因子(当第一级表示失败而所有其他级别表示成功时)或作为两列矩阵列给出成功和失败的数量。”。第一种方法对应于您的“R 的 glm 想要一个包含 2046 行的数据框,每行报告一个 TRUE 或 FALSE 转换”。第二种方法基本上对应于您的原始数据集,其中可以从 Rate 和 N 轻松计算“成功”。第三种方法是使用每个治疗组合的成功比例作为响应变量,在这种情况下,试验次数必须作为weights参数提供。

set.seed(1)
 # one row per observation
 df1 <- data.frame(x = sample(c("yes", "no"), 40, replace = TRUE),
                 y = sample(c("yes", "no"), 40, replace = TRUE),
                 z = rbinom(n = 40, size = 1, prob = 0.5))
df1

library(plyr)
# aggregated data with one row per treatment combination
df2 <- ddply(.data = df1, .variables = .(x, y), summarize,
             n = length(z),
             rate = sum(z)/n,
             success = n*rate,
             failure = n - success)  
df2

# three different ways to specify the models,
# which all give the same parameter estimates for x, y and x*y
mod1 <- glm(z ~ x * y, data = df1, family = binomial) 
mod2 <- glm(cbind(success, failure) ~ x * y, data = df2, family = binomial)
mod3 <- glm(rate ~ x * y, data = df2, weights = n, family = binomial)

summary(mod1)
summary(mod2)
summary(mod3) 
于 2013-09-13T19:27:56.070 回答
1
rates$counts <- rates$N*rates$Rate
rates$counts <- round(rates$counts,0)
 rates
#----------
  Black White   N   Rate counts
1  TRUE FALSE 512 0.2344    120
2 FALSE  TRUE 529 0.2098    111
3  TRUE  TRUE 495 0.1919     95
4 FALSE FALSE 510 0.1882     96

> rates$failures <-rates$N -rates$counts    s
> glm(cbind(counts,failures)~Black*White, data=rates, family="binomial")

Call:  glm(formula = cbind(counts, failures) ~ Black * White, family = "binomial", 
    data = rates)

Coefficients:
        (Intercept)            BlackTRUE            WhiteTRUE  
            -1.4615               0.2777               0.1356  
BlackTRUE:WhiteTRUE  
            -0.3894  

Degrees of Freedom: 3 Total (i.e. Null);  0 Residual
Null Deviance:      4.104 
Residual Deviance: -7.461e-14   AIC: 33.05 
于 2013-09-13T19:20:59.143 回答
0

不太清楚你要转换什么,但如果你只需要ncolumn 中每个值的行N,那么编辑——我很草率。第一件事 - 将原始文件中的所有因素转换为适当的数字或字符。然后,

# just put in placeholder values
newdf<-data.frame(Black="n",White="n",Rate=0,stringsAsFactors=FALSE) 
newdf[1:rates[1,3],]<-rates[1,c(1,2,4)]
    newdf[4:rates[2,3],] <- rates[2,c(1,2,4)]

对于原始数据框中的每一行,依此类推rates

于 2013-09-13T18:02:52.113 回答