对于我目前正在学习的一门课程,我正在尝试构建一个虚拟交易、客户和产品数据集,以展示网店环境中的机器学习用例以及财务仪表板;不幸的是,我们没有得到虚拟数据。我认为这是提高我的 R 知识的好方法,但在实现它时遇到了严重的困难。
# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15) # Probability of being in each group.
set.seed(1) # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000),
CustomerType = sample(CustomerTypes, size=10000,
replace=TRUE, prob=PropCustTypes),
NumBought = rnorm(10000,3,2) # Number of Transactions to Generate, open to alternative solutions?
Customers[Customers$Numbought<0]$NumBought <- 0 # Cap NumBought at 0
Products <- data.frame(
DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10 # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10") # Cap Releasedate to 1 year ago
现在我想根据当前相关的每个变量的以下参数生成 n 笔交易(数字在上面的客户表中)。
Parameters <- data.frame(
CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
BySearchEngine = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
ByPartnerBlog = c(0.30, .30, 0.35, 0.35),
Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1 EarlyAdopter 0.1 0.60 0.30 1 0.00
2 Pragmatists 0.4 0.30 0.30 6 0.00
3 Conservatives 0.5 0.15 0.35 12 0.05
4 Dealseeker 0.6 0.05 0.35 12 0.10
这个想法是“EarlyAdopters”将(平均而言,正态分布)10% 的交易带有标签“BySearchEngine”、60% 的“ByDirectCustomer”和 30% 的“ByPartnerBlog”;这些值需要相互排除:无法通过 PartnerBlog 和最终数据集中的搜索引擎获得。选项包括:
ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")
- 几天内分布均匀,周末可能稍微多一些;
- 分布于 2006-2014 年间。
- 多年来分散客户的交易数量;
- 客户不能购买尚未发布的产品。
YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <- 1 # Same question? Likely dependent on YearlyMax
CustomerID 2 的结果将是:
Transactions <- data.frame(
ID = c(1,2),
CustomerID = c(2,2), # The customer that bought the item.
ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.
ID CustomerID ProductID DateOfPurchase ReferredBy GrossPrice Discount
1 1 2 51 2013-01-02 DirectCustomer 50.00 0.02
2 2 2 100 2012-12-03 SearchEngine 52.99 0.00
我对编写 R 代码越来越有信心,但是我在编写代码以保持全局参数(交易的每日分布,每个客户每年最多 # 笔交易)以及各种链接保持一致时遇到了困难:
- 时效性:发布后人们购买的速度有多快
- ReferredBy:该客户是如何访问我的网站的?
- 客户有多少折扣(说明一个人对折扣有多敏感)
这使我不知道我是否应该在客户表上编写一个 for 循环,为每个客户生成事务,或者我是否应该采取不同的路线。非常感谢任何贡献。替代的虚拟数据集也是受欢迎的,尽管我渴望通过 R 来解决这个问题。随着我的进步,我会不断更新这篇文章。
- 使用 sample() 将客户分配给客户类型
- 生成 Customers$NumBought 交易
- ... 仍然在想?
Tr <- data.frame(
ID = 1:sum(Customers$NumBought),
CustomerID = NA,
DateOfPurchase = NA,
ReferredBy = NA,