我有一个包含 11 个变量的 185,686 行数据框,但我只对两个感兴趣:Order.ID 和 Product
原始数据框的每一行都包含 ID、产品、数量、地址等的唯一组合。从这个 df 我创建了一个新的,只有购买的 ID 和产品,其中购买了多个产品。
所以我试图找出哪些产品经常一起销售。我已经确保原始数据框没有相同的行或空行并且一切看起来都很好,除了 R 说产品有 21 个级别但其中两个是错误的,所以数据框只有 19 个级别的产品。但如果是一种类型nlevels(venda.id$Product)
,我会得到 21。
Order.ID Product
1 176560 Google Phone
2 176560 Wired Headphones
3 176574 Google Phone
4 176574 USB-C Charging Cable
5 176586 AAA Batteries (4-pack)
6 176586 Google Phone
7 176672 Lightning Charging Cable
8 176672 USB-C Charging Cable
9 176681 Apple Airpods Headphones
10 176681 ThinkPad Laptop
11 176689 Bose SoundSport Headphones
12 176689 AAA Batteries (4-pack)
13 176739 34in Ultrawide Monitor
14 176739 Google Phone
15 176774 Lightning Charging Cable
16 176774 USB-C Charging Cable
17 176781 iPhone
18 176781 Lightning Charging Cable
structure(list(Order.ID = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L), .Label = c("176560",
"176574", "176586", "176672", "176681", "176689", "176739", "176774",
"176781", "176797"), class = "factor"), Product = structure(c(5L,
10L, 5L, 9L, 2L, 5L, 7L, 9L, 3L, 8L, 4L, 2L, 1L, 5L, 7L, 9L,
6L, 7L, 5L, 4L), .Label = c("34in Ultrawide Monitor", "AAA Batteries (4-pack)",
"Apple Airpods Headphones", "Bose SoundSport Headphones", "Google Phone",
"iPhone", "Lightning Charging Cable", "ThinkPad Laptop", "USB-C Charging Cable",
"Wired Headphones"), class = "factor")), row.names = c(NA, 20L
), class = "data.frame")
当我尝试获取前 2 个组合时出现问题:
tail(sort(table(unlist(tapply(as.character(venda.id$Product), venda.id$Order.ID, FUN=function(x) combn(unique(x), 2, paste, collapse=" and "))))), 2)
Error in combn(unique(x), 2, paste, collapse = " and ") : n < m
该代码应该产生如下内容:(不知道答案是什么)
Lightning Charging Cable and iPhone Wired Headphones and USB-C Charging
x y
x 和 y 是表计算的频率table
如果我不使用as.character
Product 列,我会得到一个不同的错误:
Error in class(out) <- class(x0) : adding class "factor" to an invalid object
我尝试了替代代码,但我得到了同样的错误。
我第一次运行时它工作,但结果似乎错误,因为计数低至 16 并且数据有 14,128 行。现在它不再运行了。
任何人都知道如何解决这个问题?
更新:我检测到错误发生在第 783 行和第 784 行,有 2 个相同的产品关联到同一个 ID,尽管原始数据中不会发生这种情况。
R version 4.0.4 (2021-02-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xts_0.12.1 zoo_1.8-9 lubridate_1.7.10 viridis_0.5.1
[5] viridisLite_0.3.0 hrbrthemes_0.8.0 forcats_0.5.1 stringr_1.4.0
[9] purrr_0.3.4 readr_1.4.0 tidyr_1.1.3 tibble_3.0.6
[13] tidyverse_1.3.0 dygraphs_1.1.1.6 ggplot2_3.3.3 dplyr_1.0.5
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 lattice_0.20-41 assertthat_0.2.1 digest_0.6.27
[5] utf8_1.1.4 R6_2.5.0 cellranger_1.1.0 backports_1.2.1
[9] reprex_1.0.0 evaluate_0.14 httr_1.4.2 pillar_1.5.0
[13] gdtools_0.2.3 rlang_0.4.10 readxl_1.3.1 rstudioapi_0.13
[17] extrafontdb_1.0 rmarkdown_2.7 labeling_0.4.2 extrafont_0.17
[21] htmlwidgets_1.5.3 munsell_0.5.0 tinytex_0.30 broom_0.7.5
[25] compiler_4.0.4 modelr_0.1.8 xfun_0.21 systemfonts_1.0.1
[29] pkgconfig_2.0.3 htmltools_0.5.1.1 tidyselect_1.1.0 gridExtra_2.3
[33] fansi_0.4.2 crayon_1.4.1 dbplyr_2.1.0 withr_2.4.1
[37] grid_4.0.4 jsonlite_1.7.2 Rttf2pt1_1.3.8 gtable_0.3.0
[41] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1
[45] cli_2.3.1 stringi_1.5.3 farver_2.1.0 fs_1.5.0
[49] xml2_1.3.2 ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.6
[53] tools_4.0.4 glue_1.4.2 hms_1.0.0 yaml_2.2.1
[57] colorspace_2.0-0 rvest_1.0.0 knitr_1.31 haven_2.3.1