我的销售数据集包括 3 列:国家、销售类型/方法、季度总收入。这是前几行的显示,以便更好地了解:
Retailer.country Order.method.type Qtr.Rev
<fctr> <fctr> <dbl>
1 Australia E-mail 171407.28
2 Australia Sales visit 2013909.18
3 Australia Special 158795.34
4 Australia Telephone 2289201.87
5 Australia Web 1738303.59
6 Austria Sales visit 66926.18
7 Austria Telephone 1671887.40
8 Austria Web 7050164.50
9 Belgium Sales visit 1655507.05
10 Belgium Web 6222440.26
etc.........
这是此数据的输出:
structure(list(Retailer.country = structure(c(1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L,
7L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L,
11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 14L,
14L, 14L, 15L, 15L, 15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L,
17L, 18L, 18L, 19L, 19L, 19L, 19L, 19L, 20L, 20L, 20L, 20L, 20L,
21L, 21L, 21L, 21L, 21L, 21L), .Label = c("Australia", "Austria",
"Belgium", "Brazil", "Canada", "China", "Denmark", "Finland",
"France", "Germany", "Italy", "Japan", "Korea", "Mexico",
"Netherlands",
"Singapore", "Spain", "Sweden", "Switzerland", "United Kingdom",
"United States"), class = "factor"), Order.method.type =
structure(c(1L,
4L, 5L, 6L, 7L, 4L, 6L, 7L, 4L, 7L, 7L, 1L, 2L, 4L, 7L, 2L, 4L,
6L, 7L, 4L, 7L, 4L, 7L, 2L, 4L, 6L, 7L, 1L, 3L, 4L, 7L, 1L, 2L,
4L, 5L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 4L, 6L, 7L, 4L, 5L, 7L,
2L, 3L, 6L, 7L, 2L, 5L, 6L, 7L, 2L, 3L, 6L, 7L, 1L, 7L, 2L, 4L,
5L, 6L, 7L, 1L, 2L, 4L, 6L, 7L, 2L, 3L, 4L, 5L, 6L, 7L), .Label =
c("E-mail",
"Fax", "Mail", "Sales visit", "Special", "Telephone", "Web"), class =
"factor"),
Qtr.Rev = c(171407.28, 2013909.18, 158795.34, 2289201.87,
1738303.59, 66926.18, 1671887.4, 7050164.5, 1655507.05,
6222440.26,
7746789.52, 6864270.12, 195549.5, 450628.79, 12376528.53,
415128.31, 1453194.14, 2735416.3, 15777880.11, 413978.16,
3776833.13, 308638.6, 12328172.97, 709194.65, 1304167.86,
5897377.14, 11048160.97, 1546079.43, 1247170.05, 2373591.15,
12102240.99, 2461322.51, 165800.42, 1397604.56, 198705.05,
7413833.64, 2662351.94, 289704.5, 680467.87, 87186.72, 343708.86,
1802166.73, 16990817.52, 2821127.32, 431860.34, 10144353.75,
5063353.42, 1725508.54, 3571760.87, 593828.88, 1074860.66,
2981026.86, 5254137.56, 469627.61, 908725.05, 1625096.56,
9677070.09, 88788.41, 337710.73, 254360.21, 7835117.44,
1292812.39,
4818848.86, 217936.39, 792168.42, 790344.28, 109161.04,
4565896.64,
697619.35, 264500.2, 189218.02, 2022968.96, 13756025.4,
1357389.56,
2352483.29, 2842600.85, 685752.21, 13437403.28, 29573813.7
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-79L), .Names = c("Retailer.country", "Order.method.type", "Qtr.Rev"
))
我正在 R 中创建一个列联表,显示每个国家/地区每种销售方法产生的季度收入。最终输出应与此类似:
Retailer.country E-mail Fax Mail Sales visit Special Telephone Web TOTAL.cn
1 Australia 171407.3 0.00 0.0 2013909.18 158795.3 2289201.9 1738304 6371617
2 Austria 0.0 0.00 0.0 66926.18 0.0 1671887.4 7050164 8788978
3 Belgium 0.0 0.00 0.0 1655507.05 0.0 0.0 6222440 7877947
4 Brazil 0.0 0.00 0.0 0.00 0.0 0.0 7746790 7746790
5 Canada 6864270.1 195549.50 0.0 450628.79 0.0 0.0 12376529 19886977
6 China 0.0 415128.31 0.0 1453194.14 0.0 2735416.3 15777880 20381619
7 Denmark 0.0 0.00 0.0 413978.16
...
20 United Kingdom 697619.3 264500.20 0.0 189218.02 0.0 2022969.0 13756025 16930332
21 United States 0.0 1357389.56 2352483.3 2842600.85 685752.2 13437403.3 29573814 50249443
22 TOTAL.type 15695863.0 4767448.43 5692692.6 23233800.42 4811539.3 35257926.7 203769190 293228461
reshape 库中的 cast() 函数完成了大部分工作,只留下汇总列和行以供计算所有值。
cast(sales.by.country, Retailer.country ~ Order.method.type,
fill=0) -> sales.by.country
将这些行汇总到一个名为“TOTAL.cn”的新列中非常简单:
sales.by.country$TOTAL.cn <- rowSums(sales.by.country[,c(2:8)])
但是对列求和是一件很头疼的事情,因为最后一行的第一个组件必须是一个因子或字符。我将第一列“Retailer.country”转换为字符类型,因为它实际上只是一个视觉标签。
在对几个函数大惊小怪之后,这是我能够创建以实现预期的行求和的最佳代码:
# Sum the numeric columns, which is everything *except* column 1
total.by.ordertype <- (colSums(sales.by.country[,-1]))
# Create the Total by Order row
total.by.ordertype.row <- list("TOTAL.type", total.by.ordertype[1],
total.by.ordertype[2], total.by.ordertype[3], total.by.ordertype[4],
total.by.ordertype[5], total.by.ordertype[6], total.by.ordertype[7],
total.by.ordertype[8])
# Add the Total by Order row to the bottom of the table
sales.by.country[22, ] <- total.by.ordertype.row
它可以在所有列之间工作并维护正确的数据类型......但我认为必须有一种更有效的方法,也许是通过使用 apply 系列函数,来自 dplyr 等。也许唯一的方法是编写我自己的功能?
例如,未来的数据集可能有 50 多种不同的销售方法。在为“按顺序汇总”行(上图)创建列表时,我必须调用向量中的每个单元格,用逗号分隔,以便成功地将其添加到我现有的表中。其他努力将所有其他列的数据类型转换为字符,这把一切都搞砸了。
我不介意复制/粘贴“total.by.ordertype”8 次。但是当我处理 50-100 个订单类型时会发生什么?有没有更清洁的方法来重现这些?
谢谢!