r - 在 R 中生成虚拟网店数据：在随机生成交易时结合参数

Question

对于我目前正在学习的一门课程，我正在尝试构建一个虚拟交易、客户和产品数据集，以展示网店环境中的机器学习用例以及财务仪表板；不幸的是，我们没有得到虚拟数据。我认为这是提高我的 R 知识的好方法，但在实现它时遇到了严重的困难。

这个想法是我指定了一些参数/规则（任意/虚构，但适用于某种聚类算法的演示）。我基本上是在尝试隐藏一个模式，然后利用机器学习重新找到这个模式（不是这个问题的一部分）。我隐藏的模式基于产品采用生命周期，试图展示如何识别不同的客户类型用于有针对性的营销目的。

我将展示我在寻找什么。我想让它尽可能真实。我试图通过将每个客户的交易数量和其他特征分配给正态分布来做到这一点；我完全愿意接受其他潜在的方法来做到这一点？

下面是我走了多远，先建一张客户表：

# Define Customer Types & Respective probabilities
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability of being in each group.

set.seed(1)   # Set seed to make reproducible
Customers <- data.frame(ID=(1:10000), 
  CustomerType = sample(CustomerTypes, size=10000,
                                  replace=TRUE, prob=PropCustTypes),
  NumBought = rnorm(10000,3,2)   # Number of Transactions to Generate, open to alternative solutions?
)
Customers[Customers$Numbought<0]$NumBought <- 0   # Cap NumBought at 0

接下来，生成一个可供选择的产品表：

Products <- data.frame(
  ID=(1:50),
  DateReleased = rep(as.Date("2012-12-12"),50)+rnorm(50,0,8000),
  SuggestedPrice = rnorm(50, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$
Products[Products$DateReleased<as.Date("2013-04-10"),]$DateReleased <- as.Date("2013-04-10")   # Cap Releasedate to 1 year ago

现在我想根据当前相关的每个变量的以下参数生成 n 笔交易（数字在上面的客户表中）。

Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of coming through channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
    stringsAsFactors=FALSE)

Parameters
   CustomerType BySearchEngine ByDirectCustomer ByPartnerBlog Timeliness Discount
1  EarlyAdopter            0.1             0.60          0.30          1     0.00
2   Pragmatists            0.4             0.30          0.30          6     0.00
3 Conservatives            0.5             0.15          0.35         12     0.05
4    Dealseeker            0.6             0.05          0.35         12     0.10

这个想法是“EarlyAdopters”将（平均而言，正态分布）10% 的交易带有标签“BySearchEngine”、60% 的“ByDirectCustomer”和 30% 的“ByPartnerBlog”；这些值需要相互排除：无法通过 PartnerBlog 和最终数据集中的搜索引擎获得。选项包括：

ObtainedBy <- c("SearchEngine","DirectCustomer","PartnerBlog")

此外，我想使用上述方法生成一个正态分布的折扣变量。为简单起见，标准偏差可以是平均值/5。

接下来，我最棘手的部分，我想根据一些规则生成这些交易：

几天内分布均匀，周末可能稍微多一些；
分布于 2006-2014 年间。
多年来分散客户的交易数量；
客户不能购买尚未发布的产品。

其他参数：

YearlyMax <- 1 # ? How would I specify this, a growing number would be even nicer?
DailyMax <-  1 # Same question? Likely dependent on YearlyMax

CustomerID 2 的结果将是：

Transactions <- data.frame(
    ID        = c(1,2),
    CustomerID = c(2,2), # The customer that bought the item.
    ProductID = c(51,100), # Products chosen to approach customer type's Timeliness average
    DateOfPurchase = c("2013-01-02", "2012-12-03"), # Date chosen to mimic timeliness average
    ReferredBy = c("DirectCustomer", "SearchEngine"), # See above, follows proportions previously identified.
    GrossPrice = c(50,52.99), # based on Product Price, no real restrictions other than using it for my financial dashboard.
    Discount = c(0.02, 0.0)) # Chosen to mimic customer type's discount behavior.    

Transactions
  ID CustomerID ProductID DateOfPurchase     ReferredBy GrossPrice Discount
1  1          2        51     2013-01-02 DirectCustomer      50.00     0.02
2  2          2       100     2012-12-03   SearchEngine      52.99     0.00

我对编写 R 代码越来越有信心，但是我在编写代码以保持全局参数（交易的每日分布，每个客户每年最多 # 笔交易）以及各种链接保持一致时遇到了困难：

时效性：发布后人们购买的速度有多快
ReferredBy：该客户是如何访问我的网站的？
客户有多少折扣（说明一个人对折扣有多敏感）

这使我不知道我是否应该在客户表上编写一个 for 循环，为每个客户生成事务，或者我是否应该采取不同的路线。非常感谢任何贡献。替代的虚拟数据集也是受欢迎的，尽管我渴望通过 R 来解决这个问题。随着我的进步，我会不断更新这篇文章。

我当前的伪代码：

使用 sample() 将客户分配给客户类型
生成 Customers$NumBought 交易
... 仍然在想？

编辑：生成事务表，现在我“只”需要用正确的数据填充它：

Tr <- data.frame(
  ID = 1:sum(Customers$NumBought),
  CustomerID = NA,
  DateOfPurchase = NA,
  ReferredBy = NA,
  GrossPrice=NA,
  Discount=NA)

score 2 · Accepted Answer

很粗略，建立一个天数的数据库，以及当天的访问次数：

days<- data.frame(day=1:8000, customerRate = 8000/XtotalNumberOfVisits)
# you could change the customerRate to reflect promotions, time since launch, ...
days$nVisits <- rpois(8000, days$customerRate)

然后对访问进行编目

    visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
    visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
    visits$nPurchases <- rpois(nrow(vists), XpurchaseRate[visits$customerType])

X它们前面的任何变量都是您的过程的参数。根据您拥有的其他列，您将类似地继续通过参数化可用对象之间的相对可能性来生成事务数据库。或者，您可以生成一个访问数据库，其中包括当天可用的每个产品的密钥：

   productRelease <- data.frame(id=X, releaseDay=sort(X)) # ie df is sorted by releaseDay
   visits <- data.frame(id=1:sum(days$nVisits), day=rep(days$day, times=days$nVisits)
   visits$customerType <- sample(4, nrow(visits), replace=TRUE, prob=XmyWeights)
   day$productsAvailable = rep(1:nrow(productRelease), times=diff(c(productRelease$releaseDay, nrow(days)+1)))
   visits <- visits[(1:nrow(visits))[day$productsAvailable],]
   visits$prodID <- with(visits, ave(rep(id==id, id, cumsum))

然后，您可以决定一个函数，为每一行提供客户购买该商品的概率（基于日期、客户、产品）。然后通过`visits$didTheyPurchase <- runif(nrow(visits)) < XmyProbability 填写购买。

抱歉，当我直接打字时，整个过程中可能有错别字，但希望这能给你一个想法。

score 0 · Accepted Answer

在 Gavin 之后，我用以下代码解决了这个问题：

首先实例化 CustomerTypes：

require(lubridate)
CustomerTypes <- c("EarlyAdopter","Pragmatists","Conservatives","Dealseekers")
PropCustTypes <- c(.10, .45, .30, .15)   # Probability for being in each group.

为我的客户类型设置参数

set.seed(1)   # Set seed to make reproducible
Parameters <- data.frame(
  CustomerType= c("EarlyAdopter", "Pragmatists", "Conservatives", "Dealseeker"),
  BySearchEngine   = c(0.10, .40, 0.50, 0.6), # Probability of choosing channel X
  ByDirectCustomer = c(0.60, .30, 0.15, 0.05),
  ByPartnerBlog    = c(0.30, .30,  0.35, 0.35),
  Timeliness = c(1,6,12,12), # Average # of months between purchase & releasedate.
  Discount = c(0,0,0.05,0.10), # Average Discount incurred when purchasing.
  stringsAsFactors=FALSE)

描述访客人数

TotalVisits <- 20000
NumDays <- 100
StartDate <- as.Date("2009-01-04")
NumProducts <- 100
StartProductRelease <- as.Date("2007-01-04") # As products will be selected based on     this, make sure
                                             # we include a few years prior as people will buy products older than 2 years?
AnnualGrowth <- 0.15

现在，按照建议，构建一个天数数据集。我添加了 DaysSinceStart 以随着时间的推移使用它来发展业务。

days <- data.frame(
  day            = StartDate+1:NumDays, 
  DaysSinceStart = StartDate+1:NumDays - StartDate,
  CustomerRate = TotalVisits/NumDays)

days$nPurchases <- rpois(NumDays, days$CustomerRate)
days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)] <- # Increase sales in weekends
  as.integer(days$nPurchases[as.POSIXlt(days$day)$wday %in% c(0,6)]*1.5)

现在从这些天开始建立交易。

Transactions <- data.frame(
  ID           = 1:sum(days$nPurchases),
  Date         = rep(days$day, times=days$nPurchases),
  CustomerType = sample(CustomerTypes, sum(days$nPurchases), replace=TRUE, prob=PropCustTypes),
  NewCustomer  = sample(c(0,1), sum(days$nPurchases),replace=TRUE, prob=c(.8,.2)),
  CustomerID   = NA,
  ProductID = NA,
  ReferredBy = NA)
Transactions$CustomerType <- as.character(Transactions$CustomerType)

Transactions <- merge(Transactions,Parameters, by="CustomerType") # Append probabilities to table for use in 'sample', haven't found a better way to vlookup?

启动一些我们可以在不是新客户时选择的客户。

Customers <- data.frame(ID=(1:100), 
                        CustomerType = sample(CustomerTypes, size=100,
                                              replace=TRUE, prob=PropCustTypes)
); Customers$CustomerType <- as.character(Customers$CustomerType)
# Now make a new customer if transaction is with new customer, otherwise choose one with the right type.

丰富的产品可供选择，发布日期平均分配

ReleaseRange <- StartProductRelease + c(1:(StartDate+NumDays-StartProductRelease))
Upper <- max(ReleaseRange)
Lower <- min(ReleaseRange)
Products <- data.frame(
  ID = 1:NumProducts,
  DateReleased = as.Date(StartProductRelease+c(seq(as.numeric(Upper-Lower)/NumProducts,
                                         as.numeric(Upper-Lower),
                                         as.numeric(Upper-Lower)/NumProducts))),
  SuggestedPrice = rnorm(NumProducts, 50, 30))
Products[Products$SuggestedPrice<10,]$SuggestedPrice <- 10   # Cap ProductPrice at 10$

ReferredByOptions <- c("BySearchEngine", "Direct Customer", "Partner Blog")

现在我遍历新创建的 Transaction data.frame，从可用产品中进行选择（按购买日期测量 - 平均时效（以月为单位）* 30 天 +/- 15 天。我还将新客户分配给新的 CustomerID 并从现有的客户如果不是新的，其他字段由上面的参数决定。

Start.time <- Sys.time()
for (i in 1:length(Transactions$ID)){

  if (Transactions[i,]$NewCustomer==1){
    NewCustomerID <- max(Customers$ID, na.rm=T)+1
    Customers[NewCustomerID,]$ID = NewCustomerID
    Transactions[i,]$CustomerID <- NewCustomerID
    Customers[NewCustomerID,]$CustomerType <- Transactions[i,]$CustomerType
  }
  if (Transactions[i,]$NewCustomer==0){
    Transactions[i,]$CustomerID <- sample(Customers[Customers$CustomerType==Transactions[i,]$CustomerType,]$ID,
                                          1,replace=FALSE)
  }
  Transactions[i,]$Discount <- rnorm(1,Transactions[i,]$Discount,Transactions[i,]$Discount/20)
  Transactions[i,]$Timeliness <- rnorm(1,Transactions[i,]$Timeliness, Transactions[i,]$Timeliness/6)
  Transactions[i,]$ReferredBy <- sample(ReferredByOptions,1,replace=FALSE,
                               prob=Current[,c("BySearchEngine", "ByDirectCustomer", "ByPartnerBlog")])

  CenteredAround <- as.Date(Transactions[i,]$Date - Transactions[i,]$Timeliness*30)
  ProductReleaseRange <- as.Date(CenteredAround+c(-15:15))
  Transactions[i,]$ProductID <- sample(Products[as.character(Products$DateReleased) %in% as.character(ProductReleaseRange),]$ID,1,replace=FALSE)
}
Elapsed <- Sys.time()-Start.time
length(Transactions$ID)

它已经完成了！不幸的是，在 100 天内售出 20,000 种产品的数据集上，这需要大约 22 分钟。不一定是问题，但我对潜在的改进非常感兴趣。

r - 在 R 中生成虚拟网店数据：在随机生成交易时结合参数

2 回答 2

Related

Reference