我有一个这样创建的序列对象:
subsequences <- function(data){
slmax <- max(data$time)
sequences.seqe <- seqecreate(data)
sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
(sequences.sts)
}
data <- subsequences(data)
head(data)
这给出了输出:
Sequence
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged
[3] *-discussed-*-discussed-*-discussed-*-discussed
[4] *-opened-*-discussed-merged-discussed
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed
但是当我计算子序列时,我得到了看似荒谬的答案:
seqsubsn(head(data))
[!] found missing state in the sequence(s), adding missing state to the alphabet
Subseq.
[1] 1036
[2] 1248
[3] 88
[4] 49
[5] 294
[6] 240
子序列的数量怎么会远远长于每个序列中的事件数量?
可以在此处找到数据集的“dput()” 。问题似乎是原始数据具有以秒为单位的时间戳。但是,我使用下面的函数将时间戳更改为简单的顺序:
read_seqdata <- function(data, startdate, stopdate){
data <- read.table(data, sep = ",", header = TRUE)
data <- subset(data, select = c("pull_req_id", "action", "created_at"))
colnames(data) <- c("id", "event", "time")
data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
'unixepoch', 'localtime') <= '",stopdate,"'"))
data$end <- data$time
data <- data[with(data, order(time)), ]
data$time <- match( data$time , unique( data$time ) )
data$end <- match( data$end , unique( data$end ) )
slmax <- max(data$time)
(data)
}
这使得为熵、序列长度等创建适当的度量成为可能,但子序列的数量仍然存在问题。