r - 在分组数据中创建新变量

Question

我的数据如下所示：（基本上可以说它通过客户旅行对不同品牌的销售额，空白表示该品牌不是在该特定旅行中为客户购买的，商店是指购买的商店位置）

customerid  date    store   brand1  brand2  brand3  brand4
1   01-03-2012  a    $3.00   $-      $-      $2.00 
1   06-03-2012  a    $2.00   $-      $-      $3.00 
1   11-03-2012  b    $2.00   $1.00   $1.00   $1.00 
1   26-03-2012  a    $2.00   $-      $-      $-   
2   16-03-2012  d    $2.00   $1.00   $1.00   $2.00 
2   21-03-2012  a    $-      $-      $1.00   $2.00 
2   26-03-2012  a    $2.00   $1.00   $3.00   $1.00

我想为每个品牌创建单独的数据框，其中仅包含该品牌的销售额 >0 的行，所以我想..我可以将brand1-brand4放在一个名为 colnames_df 的列表中，如下所示：

 colnames_df<- colnames(myDf)

一旦我这样做了，我可以遍历这个循环的内容以生成品牌级别的数据集......从上面的数据中，我需要 4 个单独的数据集，其中包含相关的品牌列和 custID 的其他列，仅日期......就像 4 个数据集以下是我想要的：

品牌 1 的数据集：（预期输出）

customerid  date    store   brand1
1   01-03-2012  a    $3.00 
1   06-03-2012  a    $2.00 
1   11-03-2012  b    $2.00 
1   26-03-2012  a    $2.00 
2   16-03-2012  d    $2.00 
2   26-03-2012  a    $2.00

brand2 的数据集：（预期输出）

   customerid   store   date    brand2
1   b   11-03-2012   $1.00 
2   d   16-03-2012   $1.00 
2   a   26-03-2012   $1.00

同样，Brand3 和 4 也会有数据框...对于这一部分，我应该写一些类似 for( i in length(colnames_df) { paste("Brand",i)<-}... 不知道如何写这个..我需要从上面的原始数据创建品牌级数据框...如果我使用 lapply 和这样的功能，我能够弄清楚如何获取结果数据中所有列的列表/数据框..我该怎么做我需要在上面做什么...

除了上述之外，我还有另一个要求：

如果创建了品牌级别数据集，我还需要在每个品牌级别数据集上创建如下所示的滞后、反变量...

步骤 1：为每个客户旅行创建计数器变量（在数据集按 custID 和日期排序之后）...

品牌 1 的预期输出（带计数器）：

我使用的代码（我很难将此代码放在一个循环中，以便创建的每个品牌级别数据集都会自动创建新变量..而不是下面的brand1，它应该自动是brand1、2、3、4等）

brand1$counter <- with(brand1, ave(customerID, customerID, FUN = seq_along))

customerid  date    store   brand1  counter_custtrip
1   01-03-2012  a    $3.00  1
1   06-03-2012  a    $2.00  2
1   11-03-2012  b    $2.00  3
1   26-03-2012  a    $2.00  4
2   16-03-2012  d    $2.00  1
2   26-03-2012  a    $2.00  2

2.step2：创建一个滞后变量......就像下面的预期输出......

我可以使用这样的代码：（我的问题是我可以为每个数据集单独执行这些操作，但是我如何做到这一点，以便在创建每个品牌级别数据集时所有这些都发生......？？？）

ddply(.data = df, .variables = .(customerID), mutate,
   lagdate = c(NA, head(date, -1))

预期输出是：（对于brand1数据集）

  customerid    date    store   brand1  counter_custtrip    laggedtripdate
1   01-03-2012  a    $3.00  1   -
1   06-03-2012  a    $2.00  2   01-03-2012
1   11-03-2012  b    $2.00  3   06-03-2012
1   26-03-2012  a    $2.00  4   11-03-2012
2   16-03-2012  d    $2.00  1   -
2   26-03-2012  a    $2.00  2   16-03-2012

step3：按商店创建行程之间的天数

查看brand1的预期输出（同样适用于所有品牌）

customerid  date    store   brand1  counter_custtrip    laggedtripdate  daysbetweentrips
1   01-03-2012  a    $3.00  1   -   -
1   06-03-2012  a    $2.00  2   01-03-2012  5
1   11-03-2012  b    $2.00  3       -
1   26-03-2012  a    $2.00  4   06-03-2012  20
2   16-03-2012  d    $2.00  1   -   -
2   26-03-2012  a    $2.00  2   16-03-2012  -

正如我们所看到的，CustomerID 1 在 3/1 存储了 a，然后在 3/6 存储了 5 天，然后在 3/26 存储了 20 天。这就是逻辑。我如何为每个存储每个商店的客户??...

我知道有很多，而且我快到了，我只需要几行关于如何将整个结构组合在一起的建议，这样我就可以将它放在一个循环中，并创建新的品牌级数据集，并且每个其中有在数据框创建过程中创建的所有新变量....

让我知道我错过了什么

score 1 · Accepted Answer

尝试以下答案，它转换为长格式并使用data.table：

library(data.table)

# Your data:
data <- structure(list(customerid = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), date = structure(c(1325566800, 
1338696000, 1351915200, 1332734400, 1331870400, 1332302400, 1332734400
), class = c("POSIXct", "POSIXt"), tzone = ""), store = c("a", 
"a", "b", "a", "d", "a", "a"), brand1 = c(3L, 2L, 2L, 2L, 2L, 
NA, 2L), brand2 = c(NA, NA, 1L, NA, 1L, NA, 1L), brand3 = c(NA, 
NA, 1L, NA, 1L, 1L, 3L), brand4 = c(2L, 3L, 1L, NA, 2L, 2L, 1L
)), .Names = c("customerid", "date", "store", "brand1", "brand2", 
"brand3", "brand4"), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"))

# Convert from wide format to long, and subset to records with sales > 0:
data.long<-data.table(data[,list(customerid,store,date,laggedtripdate=as.POSIXct(NA))], brand=names(data)[4:7], sales=c(t(as.matrix(data[,4:7,with=F]))),key=c("customerid","date"))[sales>0]

# Add the lagged date, by customerid:
data.long[data.long[,.N,by=list(customerid,date)][,laggedtripdate:=c(as.POSIXct(NA),date),by=customerid],laggedtripdate:=i.laggedtripdate]

# Add daysbetweentrips:
data.long[,daysbetweentrips:=date-laggedtripdate]

# Add counter_custtrip:
data.long[,counter_custtrip:=1:.N,by=list(customerid,brand)]

# Subset of results for brand==1:
data.long[brand=="brand1"]
#   customerid store       date laggedtripdate  brand sales daysbetweentrips counter_custtrip
#1:          1     a 2012-01-03           <NA> brand1     3          NA days                1
#2:          1     a 2012-03-26     2012-01-03 brand1     2    82.95833 days                2
#3:          1     a 2012-06-03     2012-03-26 brand1     2    69.00000 days                3
#4:          1     b 2012-11-03     2012-06-03 brand1     2   153.00000 days                4
#5:          2     d 2012-03-16           <NA> brand1     2          NA days                1
#6:          2     a 2012-03-21     2012-03-16 brand1     2     5.00000 days                2

score 0 · Accepted Answer

这是一个使用长数据帧格式的数据示例。

library(reshape2)
library(plyr)


# Prepare data
# melt data
# measured variables given as a vector of variable names
df2 <- melt(data = df,
            measure.vars = paste0("brand", 1:4),
            variable.name = "brand",
            value.name = "sale")

更新melt了@kaos1511 的以下评论

# handling brand names that are not on the form brand1, brand2, brandn"

# add some fake brand names to df
names(df) <- c("customerid", "date", "store", "Mazda", "Toyota", "Nissan", "Volvo")

    # If data for different brands always come after customerid, date and store
# you can melt data by specifying 'measure variables' by position, like this
# melt data
df2 <- melt(data = df,
            measure.vars = 4:(ncol(df)),
            variable.name = "brand",
            value.name = "sale")

# alternatively, you can specify customerid, date and store as 'id variables'
# melt will then assume that all remainding variables, i.e. all 'brand columns', are measure variables
df2 <- melt(data = df,
            id.vars = c("customerid", "date", "store"),
            variable.name = "brand",
            value.name = "sale")

# remove $ and replace -
df2$sale <- with(df2, gsub(pattern = "$", replacement = "", sale, fixed = TRUE))
df2$sale[df2$sale == "-"] <- 0

# convert to date 
df2$date <- as.Date(df2$date, format = "%d-%m-%Y")

# select rows with sale > 0
df3 <- df2[df2$sale > 0, ]


# Create new variables
# per brand and customerid, create counter and lagdate
# nb, in your last two 'expected output', lagdate does not match.
# my lagdate matches the first of them.
df4 <- ddply(.data = df3, .variables = .(brand, customerid), mutate,
             counter = as.numeric(as.factor(date)),
             lagdate = c(NA, as.character(head(date, -1))))

# order by brand, store and date
df4 <- arrange(df4, brand, store, date)

# per brand and store, calculate days between trips
df5 <- ddply(.data = df4, .variables = .(brand, store), mutate,
             daysbetweentrips = c(NA, diff(date)))

# order by brand, customerid and date
df5 <- arrange(df5, brand, customerid, date)

r - 在分组数据中创建新变量

2 回答 2

Related

Reference