r - inaccurate percentages in expss table

Question

I'm analyzing some survey data and using expss to create tables.

One of our questions is about brand awareness. I have 3 types of brands: BrandA is a brand that a large subset of the sample sees, BrandB is a brand that a smaller (mutually exclusive!) subset of the sample sees, and BrandC is a brand that every respondent sees.

I'd like to treat this awareness question as a multiple response question and report the % of people (who actually saw the brand) who are aware of each brand. (In this case, a value of 1 means that the respondent was aware of the brand.)

The closest I can get is by using the code below, but tab_stat_cpct() is not reporting accurate percentages or # of cases, as you can see in the attached table. When you compare the Total % listed in the table to the total % computed manually (i.e., via mean(data$BrandA, na.rm = TRUE)), it is reporting values that are too low for BrandA and BrandB, and a value that is too high for BrandC. (Not to mention that the total # of cases should be 25.)

I've read over the documentation, and I understand that this issue is due to how tab_stat_cpct() defines a "case" for the purposes of computing the percentage, but I don't see an argument that will adjust that definition to do what I need. Am I missing something? Or is there some other way of reporting accurate percentages? Thanks!

set.seed(123)

data <- data.frame(
    Age = sample(c("25-34", "35-54", "55+"), 25, replace = TRUE),
    BrandA = c(1, 0, 0, 1, 0, 1, NA, NA, NA, NA, NA, NA, NA, 1, 
               0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1),
    BrandB = c(NA, NA, NA, NA, NA, NA, 1, 1, 0, 1, 0, 1, 1, NA, 
               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
    BrandC = c(1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 
               1, 1, 1, 0, 1, 0, 1, 0, 1)
)

data %>%
    tab_cells(mrset(as.category(BrandA %to% BrandC))) %>%
    tab_cols(total(), Age) %>%
    tab_stat_cpct() %>%
    tab_last_sig_cpct() %>%
    tab_pivot()

##    |              | #Total |     Age |       |      |
##    |              |        |   25-34 | 35-54 |  55+ |
##    |              |        |       A |     B |    C |
##    | ------------ | ------ | ------- | ----- | ---- |
##    |       BrandA |   52.4 |  83.3 B |  28.6 | 50.0 |
##    |       BrandB |   23.8 |         |  42.9 | 25.0 |
##    |       BrandC |   71.4 | 100.0 C |  71.4 | 50.0 |
##    | #Total cases |     21 |     6   |     7 |    8 |

score 2 · Accepted Answer

认为多重响应集中的所有项目具有相同的基数。基数mdset是我们至少有一个非空项目（值为 1 的项目）的案例数。这就是为什么您的品牌的基数是 21。如果我们将分别处理每个项目，那么我们需要显示每个项目的总数以计算显着性。在许多情况下，这是非常不方便的。

在您的情况下，您可以使用以下功能：

library(expss)
tab_stat_dich = function(data, total_label = NULL, total_statistic = "u_cases", 
                         label = NULL){

    if (missing(total_label) && !is.null(data[["total_label"]])) {
        total_label = data[["total_label"]]
    } 
    if(is.null(total_label)){
        total_label = "#Total"
    }

    # calculate means
    res = eval.parent(
        substitute(
            tab_stat_mean_sd_n(data, weighted_valid_n = "w_cases" %in% total_statistic,
                               labels = c("|", "@@@@@", total_label),
                               label = label)
        )
    )
    curr_tab = res[["result"]][[length(res[["result"]])]]
    # drop standard deviation
    curr_tab = curr_tab[c(TRUE, FALSE, TRUE), ]

    # convert means to percent
    curr_tab[c(TRUE, FALSE), -1] = curr_tab[c(TRUE, FALSE), -1] * 100
    ## clear row labels
    curr_tab[[1]] = gsub("^(.+?)\\|(.+)$", "\\2", curr_tab[[1]], perl = TRUE )

    res[["result"]][[length(res[["result"]])]] = curr_tab
    res
}

set.seed(123)
data <- data.frame(
    Age = sample(c("25-34", "35-54", "55+"), 25, replace = TRUE),
    BrandA = c(1, 0, 0, 1, 0, 1, NA, NA, NA, NA, NA, NA, NA, 1, 
               0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1),
    BrandB = c(NA, NA, NA, NA, NA, NA, 1, 1, 0, 1, 0, 1, 1, NA, 
               NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA),
    BrandC = c(1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 
               1, 1, 1, 0, 1, 0, 1, 0, 1)
)

res = data %>%
    tab_cells(BrandA %to% BrandC) %>%
    tab_cols(total(), Age) %>%
    tab_stat_dich() %>%
    tab_last_sig_cpct() %>%
    tab_pivot() 

res
# |        | #Total |   Age |        |      |
# |        |        | 25-34 |  35-54 |  55+ |
# |        |        |     A |      B |    C |
# | ------ | ------ | ----- | ------ | ---- |
# | BrandA |   61.1 |  71.4 | 83.3 C | 20.0 |
# | #Total |     18 |     7 |    6   |    5 |
# | BrandB |   71.4 | 100.0 | 66.7   | 50.0 |
# | #Total |      7 |     2 |    3   |    2 |
# | BrandC |   60.0 |  55.6 | 66.7   | 57.1 |
# | #Total |     25 |     9 |    9   |    7 |

# if we want to drop totals
where(res, !grepl("#", row_labels))
# |        | #Total |   Age |        |      |
# |        |        | 25-34 |  35-54 |  55+ |
# |        |        |     A |      B |    C |
# | ------ | ------ | ----- | ------ | ---- |
# | BrandA |   61.1 |  71.4 | 83.3 C | 20.0 |
# | BrandB |   71.4 | 100.0 | 66.7   | 50.0 |
# | BrandC |   60.0 |  55.6 | 66.7   | 57.1 |

r - inaccurate percentages in expss table

1 回答 1

Related

Reference