0

I am trying to convert data that show sales as cumulative total sales for the year to date. I want to show sales as they occur by day, not the cumulative figure.

Here is an example of the data:

Product, Geography, Date, SalesThisYear
Prod_1, Area_A, 20130501, 10
Prod_2, Area_B, 20130501, 5
Prod_1, Area_B, 20130501, 3
Prod_1, Area_a, 20130502, 12
Prod_2, Area_B, 20120502, 5
Prod_1, Area_B, 20130502, 4
...

So the transformed data would look like:

Product, Geography, Date, SalesThisYear*, DailySales
Prod_1, Area_A, 20130501, 10, 10
Prod_2, Area_B, 20130501, 5, 5
Prod_1, Area_B, 20130501, 3, 3
Prod_1, Area_a, 20130502, 12, 2
Prod_2, Area_B, 20120502, 3, 0
Prod_1, Area_B, 20130502, 4, 1

This can then be used in later analysis.

  • In case this makes any difference to the approach, I receive a new data file each day with the latest sales information. Therefore I need to append the new data to the existing data, and work out the daily sales figure. This is why I have kept the SalesThisYear field in the transformed data, so this field can be used to calculate the new DailySales figures when the next data file arrives.

I'm new to R so working out what is the best way to solve this problem. I recognize I have two categorical fields, so was anticipating one approach could be used to factor on these fields. My overall thinking was to use a function and then an apply command to run the function against the entire data set. As an overview, my thinking is:

(First load data file into R. Append second data file into R using rbind.)

Create a function that does the following:

  1. Identify products and geographies using factor/similar
  2. Identify largest date and second largest date
  3. For each product and geography combination, find the SalesThisYear value for the appended data and the original data,using the date values obtained in step 2/ -- I'm thinking of using the subset function here. Subtract the two values: this becomes the DailySales value. (There would need to be error checking logic in case a new geography or product was introduced)
  4. Append this new DailySales value to the results.

Data volume is about 120k rows per day, so the standard route of using a for loop in step 3. may not be advisable.

Is the above approach appropriate? Or is there an unknown unknown I need to learn? :)

4

1 回答 1

1
transform(d, 
    SalesThisDay = ave(SalesThisYear, Product, Geography, 
                       FUN=function(x) x - c(0, head(x, -1))))

#   Product Geography     Date SalesThisYear SalesThisDay
# 1  prod_1    area_a 20130501            10           10
# 2  prod_2    area_b 20130501             5            5
# 3  prod_1    area_b 20130501             3            3
# 4  prod_1    area_a 20130502            12            2
# 5  prod_2    area_b 20120502             5            0
# 6  prod_1    area_b 20130502             4            1
于 2013-05-20T16:45:21.320 回答