r - 将 R 中的列联表与第一列作为字符相加

Question

我的销售数据集包括 3 列：国家、销售类型/方法、季度总收入。这是前几行的显示，以便更好地了解：

   Retailer.country Order.method.type    Qtr.Rev
         <fctr>            <fctr>      <dbl>
 1        Australia            E-mail  171407.28
 2        Australia       Sales visit 2013909.18
 3        Australia           Special  158795.34
 4        Australia         Telephone 2289201.87
 5        Australia               Web 1738303.59
 6          Austria       Sales visit   66926.18
 7          Austria         Telephone 1671887.40
 8          Austria               Web 7050164.50
 9          Belgium       Sales visit 1655507.05
10          Belgium               Web 6222440.26
etc.........

这是此数据的输出：

    structure(list(Retailer.country = structure(c(1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 
7L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 
11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 14L, 
14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 
17L, 18L, 18L, 19L, 19L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L, 
21L, 21L, 21L, 21L, 21L, 21L), .Label = c("Australia", "Austria", 
"Belgium", "Brazil", "Canada", "China", "Denmark", "Finland", 
"France", "Germany", "Italy", "Japan", "Korea", "Mexico", 
"Netherlands", 
"Singapore", "Spain", "Sweden", "Switzerland", "United Kingdom", 
"United States"), class = "factor"), Order.method.type = 
structure(c(1L, 
4L, 5L, 6L, 7L, 4L, 6L, 7L, 4L, 7L, 7L, 1L, 2L, 4L, 7L, 2L, 4L, 
6L, 7L, 4L, 7L, 4L, 7L, 2L, 4L, 6L, 7L, 1L, 3L, 4L, 7L, 1L, 2L, 
4L, 5L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 4L, 6L, 7L, 4L, 5L, 7L, 
2L, 3L, 6L, 7L, 2L, 5L, 6L, 7L, 2L, 3L, 6L, 7L, 1L, 7L, 2L, 4L, 
5L, 6L, 7L, 1L, 2L, 4L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L), .Label = 
c("E-mail", 
"Fax", "Mail", "Sales visit", "Special", "Telephone", "Web"), class = 
"factor"), 
    Qtr.Rev = c(171407.28, 2013909.18, 158795.34, 2289201.87, 
    1738303.59, 66926.18, 1671887.4, 7050164.5, 1655507.05, 
    6222440.26, 
    7746789.52, 6864270.12, 195549.5, 450628.79, 12376528.53, 
    415128.31, 1453194.14, 2735416.3, 15777880.11, 413978.16, 
    3776833.13, 308638.6, 12328172.97, 709194.65, 1304167.86, 
    5897377.14, 11048160.97, 1546079.43, 1247170.05, 2373591.15, 
    12102240.99, 2461322.51, 165800.42, 1397604.56, 198705.05, 
    7413833.64, 2662351.94, 289704.5, 680467.87, 87186.72, 343708.86, 
    1802166.73, 16990817.52, 2821127.32, 431860.34, 10144353.75, 
    5063353.42, 1725508.54, 3571760.87, 593828.88, 1074860.66, 
    2981026.86, 5254137.56, 469627.61, 908725.05, 1625096.56, 
    9677070.09, 88788.41, 337710.73, 254360.21, 7835117.44, 
    1292812.39, 
    4818848.86, 217936.39, 792168.42, 790344.28, 109161.04, 
    4565896.64, 
    697619.35, 264500.2, 189218.02, 2022968.96, 13756025.4, 
    1357389.56, 
    2352483.29, 2842600.85, 685752.21, 13437403.28, 29573813.7
    )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-79L), .Names = c("Retailer.country", "Order.method.type", "Qtr.Rev"
))

我正在 R 中创建一个列联表，显示每个国家/地区每种销售方法产生的季度收入。最终输出应与此类似：

Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web  TOTAL.cn
1         Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
2           Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
3           Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
4            Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
5            Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
6             China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
7           Denmark        0.0       0.00       0.0   413978.16       
...
20   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
21    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
22       TOTAL.type 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461

reshape 库中的 cast() 函数完成了大部分工作，只留下汇总列和行以供计算所有值。

cast(sales.by.country, Retailer.country ~ Order.method.type, 
fill=0) -> sales.by.country

将这些行汇总到一个名为“TOTAL.cn”的新列中非常简单：

sales.by.country$TOTAL.cn <- rowSums(sales.by.country[,c(2:8)])

但是对列求和是一件很头疼的事情，因为最后一行的第一个组件必须是一个因子或字符。我将第一列“Retailer.country”转换为字符类型，因为它实际上只是一个视觉标签。

在对几个函数大惊小怪之后，这是我能够创建以实现预期的行求和的最佳代码：

# Sum the numeric columns, which is everything *except* column 1
total.by.ordertype <- (colSums(sales.by.country[,-1]))

# Create the Total by Order row
total.by.ordertype.row <- list("TOTAL.type", total.by.ordertype[1], 
total.by.ordertype[2], total.by.ordertype[3], total.by.ordertype[4], 
total.by.ordertype[5], total.by.ordertype[6], total.by.ordertype[7], 
total.by.ordertype[8])

# Add the Total by Order row to the bottom of the table
sales.by.country[22, ] <- total.by.ordertype.row

它可以在所有列之间工作并维护正确的数据类型......但我认为必须有一种更有效的方法，也许是通过使用 apply 系列函数，来自 dplyr 等。也许唯一的方法是编写我自己的功能？

例如，未来的数据集可能有 50 多种不同的销售方法。在为“按顺序汇总”行（上图）创建列表时，我必须调用向量中的每个单元格，用逗号分隔，以便成功地将其添加到我现有的表中。其他努力将所有其他列的数据类型转换为字符，这把一切都搞砸了。

我不介意复制/粘贴“total.by.ordertype”8 次。但是当我处理 50-100 个订单类型时会发生什么？有没有更清洁的方法来重现这些？

谢谢！

score 3 · Accepted Answer

库中的cast()函数reshape可以完成整个工作。使用参数margin = TRUE，将计算所有行和列的总计：

reshape::cast(sales.by.country, Retailer.country ~ Order.method.type, fun.aggregate = sum, 
     fill = 0, margins = TRUE)

   Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web     (all)
1         Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
2           Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
3           Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
4            Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
5            Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
6             China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
7           Denmark        0.0       0.00       0.0   413978.16       0.0        0.0   3776833   4190811
8           Finland        0.0       0.00       0.0   308638.60       0.0        0.0  12328173  12636812
9            France        0.0  709194.65       0.0  1304167.86       0.0  5897377.1  11048161  18958901
10          Germany  1546079.4       0.00 1247170.1  2373591.15       0.0        0.0  12102241  17269082
11            Italy  2461322.5  165800.42       0.0  1397604.56  198705.0        0.0   7413834  11637266
12            Japan  2662351.9  289704.50  680467.9    87186.72  343708.9  1802166.7  16990818  22856404
13            Korea        0.0       0.00       0.0  2821127.32       0.0   431860.3  10144354  13397341
14           Mexico        0.0       0.00       0.0  5063353.42 1725508.5        0.0   3571761  10360623
15      Netherlands        0.0  593828.88 1074860.7        0.00       0.0  2981026.9   5254138   9903854
16        Singapore        0.0  469627.61       0.0        0.00  908725.1  1625096.6   9677070  12680519
17            Spain        0.0   88788.41  337710.7        0.00       0.0   254360.2   7835117   8515977
18           Sweden  1292812.4       0.00       0.0        0.00       0.0        0.0   4818849   6111661
19      Switzerland        0.0  217936.39       0.0   792168.42  790344.3   109161.0   4565897   6475507
20   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
21    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
22            (all) 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461

当然，fun.aggregate也必须指定。

包中也提供了相同的功能，它是reshape2的继任者reshape，但对于这个小样本量来说，速度快了大约 4 倍。

reshape2::dcast(sales.by.country, Retailer.country ~ Order.method.type, fun.aggregate = sum, 
                fill = 0, margins = TRUE)

dcast()也可以从data.table声称比reshape2::dcast(). 不幸的是，该margins参数尚未实现（当前 CRAN 版本 1.10.4）。因此，边距必须单独计算并与原始数据结合：

DT2 <- rbind(
  DT,
  DT[, .(Qtr.Rev = sum(Qtr.Rev)), by = Retailer.country],
  DT[, .(Qtr.Rev = sum(Qtr.Rev)), by = Order.method.type],
  DT[, .(Qtr.Rev = sum(Qtr.Rev))], 
  fill = TRUE
)
dcast(DT2, Retailer.country ~ Order.method.type, fill = 0)

    Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web        NA
 1:        Australia   171407.3       0.00       0.0  2013909.18  158795.3  2289201.9   1738304   6371617
 2:          Austria        0.0       0.00       0.0    66926.18       0.0  1671887.4   7050164   8788978
 3:          Belgium        0.0       0.00       0.0  1655507.05       0.0        0.0   6222440   7877947
 4:           Brazil        0.0       0.00       0.0        0.00       0.0        0.0   7746790   7746790
 5:           Canada  6864270.1  195549.50       0.0   450628.79       0.0        0.0  12376529  19886977
 6:            China        0.0  415128.31       0.0  1453194.14       0.0  2735416.3  15777880  20381619
 7:          Denmark        0.0       0.00       0.0   413978.16       0.0        0.0   3776833   4190811
 8:          Finland        0.0       0.00       0.0   308638.60       0.0        0.0  12328173  12636812
 9:           France        0.0  709194.65       0.0  1304167.86       0.0  5897377.1  11048161  18958901
10:          Germany  1546079.4       0.00 1247170.1  2373591.15       0.0        0.0  12102241  17269082
11:            Italy  2461322.5  165800.42       0.0  1397604.56  198705.0        0.0   7413834  11637266
12:            Japan  2662351.9  289704.50  680467.9    87186.72  343708.9  1802166.7  16990818  22856404
13:            Korea        0.0       0.00       0.0  2821127.32       0.0   431860.3  10144354  13397341
14:           Mexico        0.0       0.00       0.0  5063353.42 1725508.5        0.0   3571761  10360623
15:      Netherlands        0.0  593828.88 1074860.7        0.00       0.0  2981026.9   5254138   9903854
16:        Singapore        0.0  469627.61       0.0        0.00  908725.1  1625096.6   9677070  12680519
17:            Spain        0.0   88788.41  337710.7        0.00       0.0   254360.2   7835117   8515977
18:           Sweden  1292812.4       0.00       0.0        0.00       0.0        0.0   4818849   6111661
19:      Switzerland        0.0  217936.39       0.0   792168.42  790344.3   109161.0   4565897   6475507
20:   United Kingdom   697619.3  264500.20       0.0   189218.02       0.0  2022969.0  13756025  16930332
21:    United States        0.0 1357389.56 2352483.3  2842600.85  685752.2 13437403.3  29573814  50249443
22:               NA 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461
    Retailer.country     E-mail        Fax      Mail Sales visit   Special  Telephone       Web        NA

score 0 · Accepted Answer

用于tidyr传播和janitor添加总计列和行：

library(janitor)
library(tidyr)
sales.by.country %>%
  spread(Order.method.type, Qtr.Rev, fill = 0) %>%
  adorn_totals(c("row", "col"))

score 0 · Accepted Answer

使用dplyr和的函数的解决方案tidyr。dt4是最终的输出。注意使用summarise_if. 当我们只想对符合预定条件的列应用函数时，它很有用。在这种情况下，我们只能将该sum函数应用于数字列。

# Create example data frame
library(dplyr)
library(tidyr)

# sales.by.country is created by OP's dput dataset
dt2 <- sales.by.country %>% 
  mutate(Retailer.country = as.character(Retailer.country)) %>%
  # Spread the data frame
  spread(Order.method.type, Qtr.Rev, fill = 0) %>%
  # Calcualte Total.cn by rowSums
  mutate(TOTAL.cn = rowSums(.[, 2:ncol(.)])) 

# Calculate the sum of each column if it is numeric
dt3 <- dt2 %>% summarise_if(is.numeric, sum)

# Combine dt3 (the summary) to dt2
dt4 <- dt2 %>%
  bind_rows(dt3) %>%
  # Replace the na in Retailer.country to be "TOTAL.type"
  replace_na(list(Retailer.country = "TOTAL.type"))

r - 将 R 中的列联表与第一列作为字符相加

3 回答 3

Related

Reference