3

我有一个这样创建的序列对象:

subsequences <- function(data){
  slmax <- max(data$time)
  sequences.seqe <- seqecreate(data)
  sequences.sts <- seqformat(data, from="SPELL", to="DSS", begin="time", end="end", id="id", status="event", limit=slmax)
  sequences.sts <- seqdef(sequences.sts, right = "DEL", left = "DEL", gaps = "DEL")
  (sequences.sts)
}

data <- subsequences(data)

head(data)

这给出了输出:

    Sequence                                                                     
[1] discussed-subscribed-*-discussed-*-discussed-*-discussed-*-discussed-*-closed
[2] *-opened-*-reviewed-*-discussed-*-discussed-*-discussed-*-merged             
[3] *-discussed-*-discussed-*-discussed-*-discussed                              
[4] *-opened-*-discussed-merged-discussed                                        
[5] *-discussed-*-referenced-discussed-closed-discussed-referenced-discussed     
[6] *-referenced-*-referenced-*-referenced-assigned-*-closed

但是当我计算子序列时,我得到了看似荒谬的答案:

seqsubsn(head(data))
 [!] found missing state in the sequence(s), adding missing state to the alphabet
    Subseq.
[1]    1036
[2]    1248
[3]      88
[4]      49
[5]     294
[6]     240

子序列的数量怎么会远远长于每个序列中的事件数量?

可以在此处找到数据集的“dput()” 。问题似乎是原始数据具有以秒为单位的时间戳。但是,我使用下面的函数将时间戳更改为简单的顺序:

read_seqdata <- function(data, startdate, stopdate){
  data <- read.table(data, sep = ",", header = TRUE)
  data <- subset(data, select = c("pull_req_id", "action", "created_at"))
  colnames(data) <- c("id", "event", "time")
  data <- sqldf(paste0("SELECT * FROM data WHERE strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') >= '",startdate,"' AND strftime('%Y-%m-%d', time,
    'unixepoch', 'localtime') <= '",stopdate,"'"))
  data$end <- data$time
  data <- data[with(data, order(time)), ]
  data$time <- match( data$time , unique( data$time ) )
      data$end <- match( data$end , unique( data$end ) )
  slmax <- max(data$time)
  (data)
}

这使得为​​熵、序列长度等创建适当的度量成为可能,但子序列的数量仍然存在问题。

4

1 回答 1

2

返回的子序列的数量一点也不奇怪。这是“子序列”的定义问题,不应与“子字符串”混淆。

序列 $x = (x_1, x_2, ... , x_3)$ 是 $y$ 的子序列,如果它的元素 $x_i$ 都在 $y$ 中并且以与 $y$ 中相同的顺序出现。例如,ABA 是 CADBCDAD 的子序列。

为了说明这一点,请考虑 TraMineR 包中的“mvad”示例。

library(TraMineR)
data(mvad)
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad, 17:86, states = mvad.scodes)
print(mvad.seq[1:3,], format="SPS")

##    Sequence                      
##[1] (EM,4)-(TR,2)-(EM,64)         
##[2] (FE,36)-(HE,34)               
##[3] (TR,24)-(FE,34)-(EM,10)-(JL,2)

seqsubsn(mvad.seq)[1:3]

##[1]  7  4 16

默认情况下,seqsubsn计算不同连续状态 (DSS) 的子序列数。例如,第一个序列的 DSS 是 EM-TR-EM。EM-TR-EM 的七个子序列是:

  • 空序列
  • 由单个元素组成的两个序列:EM 和 TR
  • 两个长度的子序列:EM-TR、EM-EM、TR-EM
  • 三长序列:EM-TR-EM

以相同的方式进行,您可以验证您的第四个序列(即等于它的 DSS)

*-opened-*-discussed-merged-discussed

有 49 个子序列,其中 9 个二长子序列:

*-open, *-discussed, *-merged, opened-*, opened-discussed, opened-merged, discussed-merged, discussed-discussed, merged-discussed

希望这可以帮助

于 2013-12-23T07:46:46.263 回答