在交叉验证时,我问了一个关于按日期分析数据的问题,但不想通过按月对数据进行分箱来产生虚假的峰值和谷值。例如,如果一个人在每个月的最后一天支付账单,但有一次支付晚了几天,那么一个月将反映零费用,而下个月将反映通常费用的两倍。都是乱七八糟的垃圾。
我的问题的一个答案解释了在累积和上使用线性样条平滑来克服分箱中的打嗝的插值概念。我对它很感兴趣,想在 R 中实现它,但在网上找不到任何示例。我不只是想打印情节。我想获得每个时间点(可能是每天)的瞬时斜率,但该斜率应该来自一个样条曲线,该样条曲线输入几天(或者几周或几个月)之前到几天的点时间点之后。换句话说,在一天结束时,我想得到一个数据框之类的东西,其中一列是每天的钱或每周患者,但不受变幻莫测的影响,例如我是否迟交了几天或一个月是否碰巧有 5 个手术天(而不是通常的 4 个)。
这是一些简化的模拟和绘图,以显示我所反对的。
library(lubridate)
library(ggplot2)
library(reshape2)
dates <- seq(as.Date("2010-02-01"), length=24, by="1 month") - 1
dates[5] <- dates[5]+3 #we are making one payment date that is 3 days late
dates#look how the payment date is the last day of every month except for
#2010-05 where it takes place on 2010-06-03 - naughty boy!
amounts <- rep(50,each=24)# pay $50 every month
register <- data.frame(dates,amounts)#this is the starting register or ledger
ggplot(data=register,aes(dates,amounts))+geom_point()#look carefully and you will see that 2010-05 has no dots in it and 2010-06 has two dots
register.by.month <- ddply(register,.(y=year(dates),month=month(dates)),summarise,month.tot=sum(amounts))#create a summary of totals by month but it lands up omiting a month in which nothing happened. Further badness is that it creates a new dataframe where one is not needed. Instead I created a new variable that allocates each date into a particular "zone" such as month or
register$cutmonth <- as.Date(cut(register$dates, breaks = "month"))#until recently I did not know that the cut function can handle dates
table(register$cutmonth)#see how there are two payments in the month of 2010-06
#now lets look at what we paid each month. What is the total for each month
ggplot(register, aes(cutmonth, amounts))+ stat_summary(fun.y = sum, geom = "bar")#that is the truth but it is a useless truth
#so lets use cummulated expense over time
register$cumamount <- cumsum(register$amounts)
cum <- ggplot(data=register,aes(dates,cumamount))+geom_point()
cum+stat_smooth()
#That was for everything the same every month, now lets introduce a situation where there is a trend that in the second year the amounts start to go up,
increase <- c(rep(1,each=12),seq(from=1.01,to=1.9,length.out=12))
amounts.up <- round(amounts*increase,digits=2)#this is the monthly amount with a growth of amount in each month of the second year
register <- cbind(register,amounts.up)#add the variable to the data frarme
register$cumamount.up <- cumsum(register$amounts.up) #work out th cumulative sum for the new scenario
ggplot(data=register,aes(x=dates))+
geom_point(aes(y=amounts, colour="amounts",shape="amounts"))+
geom_point(aes(y=amounts.up, colour="amounts.up",shape="amounts.up"))# the plot of amount by date
#I am now going to plot the cumulative amount over time but now that I have two scenarios it is easier to deal with the data frame in long format (melted) rather than wide format (casted)
#before I can melt, the reshape2 package unforutnately can't handle date class so will have to turn them int o characters and then back again.
register[,c("dates","cutmonth")] <- lapply(register[,c("dates","cutmonth")],as.character)
register.long <- melt.data.frame(register,measure.vars=c("amounts","amounts.up"))
register.long[,c("dates","cutmonth")] <- lapply(register.long[,c("dates","cutmonth")],as.Date)
ggplot(register.long, aes(cutmonth,value))+ stat_summary(fun.y = sum, geom = "bar")+facet_grid(. ~ variable) #that is the truth but it is a useless truth,
cum <- ggplot(data=register,aes(dates,cumamount))+geom_point()
#that is the truth but it is a useless truth. Furthermore it appears as if 2010-06 is similar to what is going on in 2011-12
#that is patently absurd. All that happened was that the 2010-05 payment was delayed by 3 days.
#so lets use cummulated expense over time
ggplot(data=register.long,aes(dates,c(cumamount,cumamount.up)))+geom_point() + scale_y_continuous(name='cumulative sum of amounts ($)')
因此,对于简单的绘图,变量 interpolate.daily 一年中的每一天约为每天 50/30.4 = 1.64 美元。对于第二个地块,每月支付的金额在第二年每个月都开始增加,第一年每天的每日费率为 1.64 美元,第二年的日期显示每日费率从每天 1.64 美元逐渐增加到每天约 3.12 美元。
非常感谢您一直阅读到最后。你一定和我一样感兴趣!