I am trying to convert data that show sales as cumulative total sales for the year to date. I want to show sales as they occur by day, not the cumulative figure.
Here is an example of the data:
Product, Geography, Date, SalesThisYear
Prod_1, Area_A, 20130501, 10
Prod_2, Area_B, 20130501, 5
Prod_1, Area_B, 20130501, 3
Prod_1, Area_a, 20130502, 12
Prod_2, Area_B, 20120502, 5
Prod_1, Area_B, 20130502, 4
...
So the transformed data would look like:
Product, Geography, Date, SalesThisYear*, DailySales
Prod_1, Area_A, 20130501, 10, 10
Prod_2, Area_B, 20130501, 5, 5
Prod_1, Area_B, 20130501, 3, 3
Prod_1, Area_a, 20130502, 12, 2
Prod_2, Area_B, 20120502, 3, 0
Prod_1, Area_B, 20130502, 4, 1
This can then be used in later analysis.
- In case this makes any difference to the approach, I receive a new data file each day with the latest sales information. Therefore I need to append the new data to the existing data, and work out the daily sales figure. This is why I have kept the SalesThisYear field in the transformed data, so this field can be used to calculate the new DailySales figures when the next data file arrives.
I'm new to R so working out what is the best way to solve this problem. I recognize I have two categorical fields, so was anticipating one approach could be used to factor on these fields. My overall thinking was to use a function and then an apply command to run the function against the entire data set. As an overview, my thinking is:
(First load data file into R. Append second data file into R using rbind.)
Create a function that does the following:
- Identify products and geographies using factor/similar
- Identify largest date and second largest date
- For each product and geography combination, find the SalesThisYear value for the appended data and the original data,using the date values obtained in step 2/ -- I'm thinking of using the subset function here. Subtract the two values: this becomes the DailySales value. (There would need to be error checking logic in case a new geography or product was introduced)
- Append this new DailySales value to the results.
Data volume is about 120k rows per day, so the standard route of using a for loop in step 3. may not be advisable.
Is the above approach appropriate? Or is there an unknown unknown I need to learn? :)