7

I am new to R, but after taking an intro course and playing with it a bit, I'm hopeful that it can 1) more elegantly solve my modelling objectives (compared to Excel, which is my backup plan) and 2) be a useful skill to take away from this project.

The task/objective:

I am attempting to use driving diary data to simulate and model potential energy and GHG emissions from electric cars. Specifically:

  1. I have driving diary data (start and end time stamps, plus other data of thousands of drivers -- basic sample below) that I want to translate into:
  2. 24-hour time series data, such that for each minute of a 24-hour period, I know exactly who is driving a vehicle, and what 'trip' that it belongs to (for that driver). My problem here focuses on this issue.

The type of output I would like: NOTE: this output is NOT related to the sample data provided below. I used the first ten minutes of a certain day with some theoretical trips just as an example

enter image description here

Not essential to this problem, but may be useful to know: I will use the above output to cross-reference other driver-specific data to calculate minute-by-minute consumption of gasoline (or electricity) based on things associated with that trip, such as parking location or trip distance. I would like to do this in R but must first figure out the above problem before I move onto this step.

The solution I have so far is based on:

The problem:

Example simplified data:

a <- c("A","A","A","B","B","B","C","C","C")
b <- c(1, 2, 3, 1, 2, 3, 1, 2, 3)
c <- as.POSIXct(c(0.29167, 0.59375, 0.83333, 0.45833, 0.55347, 0.27083, 0.34375, 0.39236, 0.35417)*24*3600 + as.POSIXct("2013-1-1 00:00") )
d <- as.POSIXct(c(0.334027778, 0.614583333, 0.875, 0.461805556, 0.563888889, 0.295138889, 0.375, 0.503472222, 0.364583333)*24*3600 + as.POSIXct("2013-1-1 00:00"))
e <- c(2, 8, 2, 5, 5, 2, 5, 5, 2)
f <- as.POSIXct(c(0, 0.875, 0, 0.479166666666667, 0.580555555555556, 0.489583333333333, 0.430555555555556, 0.541666666666667, 0.711805555555555)*24*3600 + as.POSIXct("2013-1-1 00:00"))
g <- as.POSIXct(c(0, 0.885, 0, 0.482638888888889, 0.588194444444444, 0.496527777777778, 0.454861111111111, 0.559027777777778, 0.753472222222222)*24*3600 + as.POSIXct("2013-1-1 00:00"))
h <- c(0, 1, 0, 1, 4, 8, 8, 1, 5)
i <- as.POSIXct(c(0, 0, 0, 0.729166666666667, 0.595833333333333, 0.534722222222222, 0.59375, 0.779861111111111, 0.753472222222222)*24*3600 + as.POSIXct("2013-1-1 00:00"))
j <- as.POSIXct(c(0, 0, 0, 0.736111111111111, 0.605555555555556, 0.541666666666667, 0.611111111111111, 0.788194444444445, 0.75625)*24*3600 + as.POSIXct("2013-1-1 00:00"))
k <- c(0, 0, 0, 4, 4, 2, 5, 8,1)
testdata <- data.frame(a,b,c,d,e,f,g,h,i,j,k)
names(testdata) <- c("id", "Day", "trip1_start", "trip1_end", "trip1_purpose", "trip2_start", "trip2_end", "trip2_purpose", "trip3_start", "trip3_end", "trip3_purpose")

In this example data, I have three drivers (id = A, B, C) who each drive on three different days (day = 1, 2, 3). Note that some drivers may have different numbers of trips. The time stamps indicate start and end time of driving activities.

I then create minute intervals for a entire day (January 1, 2013)

start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
end.max <- as.POSIXct("2013-01-01 23:59:59 PST")
tinterval <- seq.POSIXt(start.min, end.max, na.rm=T, by = "mins")

Insert "1" during minutes where a given user is driving:

out1 <- xts(,align.time(tinterval,60))
# loop over each user
for(i in 1:NROW(testdata)) {
  # paste the start / end times into an xts-style range
  timeRange <- paste(format(testdata[i,c("trip1_start","trip1_end")]),collapse="/")
  # add the minute "by parameter" for timeBasedSeq
  timeRange <- paste(timeRange,"M",sep="/")
  # create the by-minute sequence and align to minutes to match "out"
  timeSeq <- align.time(timeBasedSeq(timeRange),60)
  # create xts object with "1" entries for times between start and end
  temp1 <- xts(rep(1,length(timeSeq)),timeSeq)
  # merge temp1 with out and fill non-matching timestamps with "0"
  out1 <- merge(out1, temp1, fill=0)
}
# add column names
colnames(out1) <- paste(testdata[,1], testdata[,2], sep = ".")

The idea is to then repeat this for each trip, e.g. out2, out3, etc. wherein I would fill any driving periods with "2", "3", etc., and then sum/merge all of the resulting outx dataframes, and eventually get the desired result.

Unfortunately when I try to repeat this for out2...

out2 <- xts(,align.time(tinterval,60))
for(i in 1:NROW(testdata)) {
  timeRange2 <- paste(format(testdata[i,c("trip2_start","trip2_end")]),collapse="/")
  timeRange2 <- paste(timeRange2,"M",sep="/")
  timeSeq2 <- align.time(timeBasedSeq(timeRange2),60)
  temp2 <- xts(rep(2,length(timeSeq2)),timeSeq2)
  out2 <- merge(out2, temp2, fill=0)
}
colnames(out2) <- paste(testdata[,1], testdata[,2], sep = ".")
head(out2)

I get the following errors:

  • Error in UseMethod("align.time") : no applicable method for 'align.time' applied to an object of class "Date"
  • Error in colnames<-(*tmp*, value = c("A.1", "A.2", "A.3", "B.1", "B.2", : attempt to set 'colnames' on an object with less than two dimensions

What is wrong with my code for out2?

Are there any other better solutions or packages I can learn about?

I realize this is probably a very roundabout way to get to my desired output.

Any help would be much appreciated.

4

2 回答 2

1

In this solution I read your original data and format it to get the generated data of my previous answer. The data provided is limited to 22 trips by driver, but the reshaping here is not limited by the number of trips. The idea is similar to the one used to generate sample data. I am using data.table since it is handy to manipulate data per group.

So for each(day,driver) I do the following:

  1. create a vector of zeros of length the number of minutes
  2. read start and end position using XXXstrip_start and XXXstrip_end.
  3. create sequence seq(start,end)
  4. use this sequence to change zeros by a sequence of number

Here my code:

start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
hours.min <- format(seq(start.min, 
                        length.out=24*60, by = "mins"),
                    '%H:%M')
library(data.table)
diary <- read.csv("samplediary.csv",
                  stringsAsFactors=FALSE)
DT <- data.table(diary,key=c('id','veh_assigned','day'))

dat <- DT[, as.list({ .SD;nb.trip=sum_trips
           tripv <- vector(mode='integer',length(hours.min))
           if(sum_trips>0){
             starts = mget(paste0('X',seq(nb.trip),'_trip_start'))
             ends = mget(paste0('X',seq(nb.trip),'_trip_end'))
             ids <- mapply(function(x,y){
                                        seq(as.integer(x),as.integer(y))},
                           starts,ends,SIMPLIFY = FALSE)
             for (x in seq_along(ids))tripv[ids[[x]]] <- x
             }
            tripv
           }),
   by=c('id','day')]
setnames(x=dat,old=paste0('V',seq(hours.min)),hours.min)

Here what you get for if you subset the 10 first variables :

dat[1:10,1:10,with=FALSE]


       id day 00:00 00:01 00:02 00:03 00:04 00:05 00:06 00:07
 1: 3847339   1     0     0     0     0     0     0     0     0
 2: 3847384   1     0     0     0     0     0     0     0     0
 3: 3847436   1     0     0     0     0     0     0     0     0
 4: 3847439   1     0     0     0     0     0     0     0     0
 5: 3847510   1     0     0     0     0     0     0     0     0
 6: 3847536   1     0     0     0     0     0     0     0     0
 7: 3847614   1     0     0     0     0     0     0     0     0
 8: 3847683   1     0     0     0     0     0     0     0     0
 9: 3847841   1     0     0     0     0     0     0     0     0
10: 3847850   1     0     0     0     0     0     0     0     0

One idea is to create a heatmap of your data ( at least per day) to get some intutions and see overlapping drivers for example. Here 2 ways to do this using lattice and ggplot2 but first I will reshape the data in the long format using reshape2

library(reshape2)
dat.m <- melt(dat,id.vars=c('id','day'))

Then I plot my heatmap to see which drivers are overlapping with others for example:

library(lattice)
levelplot(value~as.numeric(variable)*factor(id),data=dat.m)

enter image description here

library(ggplot2)
ggplot(dat.m, aes(x=as.numeric(variable),y=factor(id)))+ 
        geom_tile(aes(fill = value)) +
  scale_fill_gradient(low="grey",high="blue")

enter image description here

于 2013-06-29T04:01:13.890 回答
0

This is not an answer to your problem. Honestly it is not clear for me the transition between the data you show in the image and the example of your data. It seems like that you can't reproduce this data. So Here a function to generate a reproducible example of your data. I think it can be at least useful, to validate your model.

function to sample data

library(reshape2)
start.min <- as.POSIXct("2013-01-01 00:00:00 PST")
hours.min <- format(seq(start.min, 
                        length.out=24*60, by = "mins"),
                    '%H:%M')

## function to generate a trip sample
## min.dur : minimal duration of a trip
## max.dur : maximal duration of a trip
## min.trip : minimal number of trips that a user can do 

gen.Trip <- function(min.dur=3,max.dur=10,min.trip=100){
  ## gen number of trip
  n.trip <- sample(seq(min.trip,20),1)
  ## for each trip generate the durations
  durations <- rep(seq(1,n.trip),
                   times=sample(seq(min.dur,max.dur),
                                max(min.dur,n.trip),rep=TRUE))
  ## generate a vector of positions
  rr <- rle(durations)
  mm <- cumsum(rr$lengths)
  ## idrty part here
  pos <- sort(sample(seq(1,length(hours.min)-2*max(mm)),
              n.trip,rep=FALSE)) + mm
  ## assign each trip to each posistion  
  val <- vector(mode='integer',length(hours.min))
  for(x in seq_along(pos))
    val[seq(pos[x],length.out=rr$len[x])] <- rr$val[x]
  val
}

generate trips for 100 drivers

set.seed(1234)
nb.drivers <- 100
res <- replicate(nb.drivers,gen.Trip(),simplify=FALSE)
res <- do.call(rbind,res)
colnames(res) <- hours.min
rownames(res) <- paste0('driv',seq(nb.drivers))

wide format

head(res[,10:30])
  ##       00:09 00:10 00:11 00:12 00:13 00:14 00:15 00:16 00:17 00:18 00:19
## driv1     0     0     0     0     0     0     1     1     1     1     1
## driv2     0     1     1     1     1     1     1     2     2     2     1
## driv3     0     0     0     0     0     0     0     0     0     0     0
## driv4     1     1     1     0     0     0     0     0     0     0     0
## driv5     0     0     0     0     0     0     0     0     0     0     1
## driv6     0     0     0     0     0     0     0     0     0     0     0
##       00:20 00:21 00:22 00:23 00:24 00:25 00:26 00:27 00:28 00:29
## driv1     1     1     0     0     2     2     2     2     2     2
## driv2     0     0     0     0     0     0     3     3     3     3
## driv3     0     0     0     0     0     0     0     0     0     0
## driv4     0     0     0     0     0     0     0     0     0     0
## driv5     1     1     1     1     1     1     1     1     0     0
## driv6     0     0     0     0     0     0     0     0     0     0

long format

res.m <- melt(res)
head(res.m)
##    Var1  Var2 value
## 1 driv1 00:00     0
## 2 driv2 00:00     0
## 3 driv3 00:00     0
## 4 driv4 00:00     0
## 5 driv5 00:00     0
## 6 driv6 00:00     0
于 2013-06-28T16:56:59.657 回答