r - R：从 srt（字幕）文件中提取时间

Question

我需要计算每行字幕的语速。srt（字幕）文件的内容如下所示：

1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you

2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail, plus debate and analysis.

3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect the pay of millions

例如，说出“自民党承诺保护百万工资”这10个字需要4秒989毫秒。这 10 个词的平均语速是每个词 498.9 毫秒。

如何读取 srt 文件，以便我可以拥有一个数据框，其中startTime、endTime、textString和wordCount作为列和字幕行，如下所示？

startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000")

endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989")

textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions")

wordCount<-c(12,10,10)

rate.df<-data.frame(startTime, endTime, textString, wordCount)

当时间以小时：分钟：秒，毫秒的形式显示时，如何从 R 中的 endTime 中减去 startTime？

score 2 · Accepted Answer

这是一个可能的解决方案（代码非常不言自明）：

text="

1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you

2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail, 
plus debate 
and analysis.



3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect 
the pay of millions"

con<-textConnection(text)
lines <- readLines(con) 

# the previous lines of code are just to replicate you case, and
# they should be replaced by the following single line in the real case
# lines <- readLines(srtFileName)

listOfEntries <- 
lapply(split(1:length(lines),cumsum(grepl("^\\s*$",lines))),function(blockIdx){
    block <- lines[blockIdx]
    block <- block[!grepl("^\\s*$",block)]
    if(length(block) == 0){
      return(NULL)
    }
    if(length(block) < 3){
      warning("a block not respecting srt standards has been found")
    }
    return(data.frame(id=block[1], 
                      times=block[2], 
                      textString=paste0(block[3:length(block)],collapse="\n"),
                      stringsAsFactors = FALSE))
  })
m <- do.call(rbind,listOfEntries)


# split start and end times
tmp <- do.call(rbind,strsplit(m[,'times'],' --> '))
m$startTime <- tmp[,1]
m$endTime <- tmp[,2]

# parse times
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric))
m$fromSeconds  <- tmp %*% c(60*60,60,1,1/1000)

tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric))
m$toSeconds  <- tmp %*% c(60*60,60,1,1/1000)

# compute time difference in seconds
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds

# word count
m$wordCount <- vapply(gregexpr("\\W+",m$textString),length,0) + 1

# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. :
#m$wordCount <- vapply(gregexpr("\\W+",gsub("'","",m$textString)),length,0) + 1

m$millisecsPerWord <- m$timeDiffInSecs * 1000 / m$wordCount

结果：

> m
  id                         times                                                             textString
2  1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you
3  2 00:00:22,000 --> 00:00:23,989      the latest from the campaign trail, \nplus debate \nand analysis.
6  3 00:00:24,000 --> 00:00:28,989         The Liberal Democrats promise to protect \nthe pay of millions
     startTime      endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord
2 00:00:19,000 00:00:21,989          19    21.989          2.989        14         213.5000
3 00:00:22,000 00:00:23,989          22    23.989          1.989        11         180.8182
6 00:00:24,000 00:00:28,989          24    28.989          4.989        10         498.9000

r - R：从 srt（字幕）文件中提取时间

1 回答 1

Related

Reference