1

介绍

给定 R 中的一个字符串,是否有可能得到一个向量化的解决方案(即没有循环),我们可以将字符串分成块,其中每个块由字符串中第 n 次出现的子字符串决定。

使用可重现示例完成的工作

假设我们有几段著名的 Lorem Ipsum 文本。

library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)

my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")

> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "

我们希望在单词“in”的每3 次出现时将该文本分成多个段(包含一个空格是为了与包含“in”作为其中一部分的单词区分开来,例如“min”)。

我有以下带有while循环的解决方案:

# We wish to break up the string at every 
# 3rd occurence of the worn "in"

break.character = " in"
break.occurrence = 3
string.list = list()
i = 1

# initialize string to send into the loop
current.string = my.string

while(length(current.string) > 0){

  # Enter segment into the list which occurs BEFORE nth occurence character of interest
  string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)

  # Update next string to exmine.
  # Next string to examine is current string AFTER nth occurence of character of interest
  current.string = str_after_nth(current.string, break.character, break.occurrence)

  i = i + 1
}

我们能够在带有警告的列表中获得所需的输出(未显示警告)

> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"

[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...

[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"

目标

是否可以通过矢量化(即使用 、 等)来改进此apply()解决lapply()方案mapply()。此外,我当前的解决方案切断了块中子字符串的最后一次出现。

当前的解决方案可能不适用于极长的字符串(例如我们正在寻找第 n 次出现核苷酸子串的块的 DNA 序列)。

4

2 回答 2

1

试试这个:

text_split=strsplit(text," in ")[[1]]

l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)

L= list()
L=sapply(Seq, function(x){
  paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}

最后一个条件是在数字in不能被 3 整除的情况下。另外,最后in粘贴的那个sapply()是在那里,因为我不知道你想用in分隔你的块的那个做什么。

于 2019-04-04T16:27:02.393 回答
1

让我知道这是否有用。我会努力让它更快。它将第三个保留in在代码块中。如果它有效,我也会对其进行更多注释。

library(lipsum)
library(stringi)

my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")

end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)]) 
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]


stri_sub(my.string, start_of_strings, end_of_strings)

编辑:实际上,使用stri_subfrom stringi。它将比substring. 看:

my.string <- paste(rep(my.string, 10000), collapse = " ")
nchar(my.string)
[1] 22349999

microbenchmark::microbenchmark(
  sol1 = {
    text_split=strsplit(my.string," in ")[[1]]

    l=length(text_split)
    n = floor(l/3)
    Seq = seq(1,by=2,length.out = n)

    L= list()
    L=sapply(Seq, function(x){
      paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
    })
    if (l>(n*3)){
      L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
    }
  },
  sol2 = {
    end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
    start_of_strings <- c(1, end_of_in[c(F, F, T)]) 
    end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
    end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
    stri_sub(my.string, start_of_strings, end_of_strings)
  },
  times = 10
)

Unit: milliseconds
 expr      min        lq      mean    median        uq       max neval
 sol1 914.1268 927.45958 941.36117 939.80361 950.18099 980.86941    10
 sol2  55.4163  56.40759  58.53444  56.86043  57.03707  71.02974    10
于 2019-04-04T17:07:26.190 回答