regex - 使用 strsplit(...) 将文本向量拆分为 R 块

Question

请帮助我完成我的小项目。

拥有大量文本元素。每个元素都应该被分成一个小的句子列表。每个小列表应作为一个元素“保存”到初始大列表的新列中，与原始文本元素相同的位置（“行”）。

分割标准是"/$", "und/KON", "oder/KON". 这应该保留在新的小列表元素的头部。

我尝试过使用正则表达式，例如"/$|und/KON|oder/KON"转义"$", "|", "/". 我也尝试更改参数perl = TRUE，fixed = TRUE并且FALSE。每次我尝试注意都会发生。似乎|没有正确解释。你有什么建议来解决这个问题？

library(stringr) # don't know if it's required

# Input list to be splitted at each
#      "/$", "und/KON", "oder/KON"
#      but should keep the expression at the start of the next list element
#      
#      Would be nice but not necessary: The small-list to be named after the ID in the first column

> r <- list(ID=c(01, 02, 03),
            elements=c("This should become my first small-list :/$. the first element ,/$, the second element ,/$, and the third element ./$.",
                       "This should become my second small-list :/$. Element eins und/KON Element zwei oder/KON Element drei ./$.",
                       "This should become my third small-list :/$. Element Alpha und/KON Element Beta oder/KON Element Gamma ./$.")

# Would look something like 
r$small_lists <- sapply(r$elements ,function(x) as.list(strsplit(x,"/$|und/KON"|oder/KON", fixed=TRUE)))
> r$small_lists

$01
[1] "This should become my first small-list "
[2] ":/$. the first element "
[3] ",/$, the second element "
[4] ",/$, and the third element "
[5] "./$."

$02 
[1] "This should become my second small-list "
[2] ":/$. Element eins "
[3] "und/KON Element zwei "
[4] "oder/KON Element drei"
[5] "./$."

$03
[1] "This should become my third small-list "
[2] ":/$. Element Alpha "
[3] "und/KON Element Beta "
[4] "oder/KON Element Gamma "
[5] "./$."

> class(r)
[1] "list"
> class(r$small_lists)
[1] "list"

score 3 · Accepted Answer

如果这是您想要的输出，您实际上有比您指示的更多的模式来拆分。请注意，我的模式与您的不同。所有特殊字符都用\\.

为了使事情易于管理，我会创建一个您想要拆分的模式的单独向量，将它们粘贴到一个主模式中，搜索它们并在它们前面加上一些您知道不会出现在您的文本中的字符串，然后拆分在那。

以下是我确定的“模式”：

Pattern <- c(":/\\$", ",/\\$", "\\./\\$",
             "und/KON", "oder/KON")

我们可以将paste这些模式结合起来得到主模式。sep内部paste是管道符号，用于匹配不同的图案。整个模式放在括号 ((和)) 中，以便我们以后可以引用它。

Pattern <- paste("(", paste(Pattern, collapse = "|"), ")", sep = "")

我们现在可以使用gsub向模式添加“前缀”（这就是\\1引用的内容）。我们需要该前缀，因为您想保留提到的表达式。

## Insert some text pattern you know doesn't occur in your text
## Here, I've prepended the matched patterns with "^&*"
## You now have something on which you can split
strsplit(gsub(Pattern, "^&*\\1", r$elements), "^&*", fixed = TRUE)
# [[1]]
# [1] "This should become my first small-list "
# [2] ":/$. the first element "                
# [3] ",/$, the second element "               
# [4] ",/$, and the third element "            
# [5] "./$."                                   
# 
# [[2]]
# [1] "This should become my second small-list "
# [2] ":/$. Element eins "                      
# [3] "und/KON Element zwei "                   
# [4] "oder/KON Element drei "                  
# [5] "./$."                                    
# 
# [[3]]
# [1] "This should become my third small-list "
# [2] ":/$. Element Alpha "                    
# [3] "und/KON Element Beta "                  
# [4] "oder/KON Element Gamma "                
# [5] "./$."

从上面继续，获取您描述的命名列表：

out <- strsplit(gsub(Pattern, "^&*\\1", r$elements), "^&*", fixed = TRUE)
setNames(lapply(out, `[`, -1), lapply(out, `[`, 1))
# $`This should become my first small-list `
# [1] ":/$. the first element "    
# [2] ",/$, the second element "   
# [3] ",/$, and the third element "
# [4] "./$."                       
# 
# $`This should become my second small-list `
# [1] ":/$. Element eins "    
# [2] "und/KON Element zwei " 
# [3] "oder/KON Element drei "
# [4] "./$."                  
# 
# $`This should become my third small-list `
# [1] ":/$. Element Alpha "    
# [2] "und/KON Element Beta "  
# [3] "oder/KON Element Gamma "
# [4] "./$."

regex - 使用 strsplit(...) 将文本向量拆分为 R 块

1 回答 1

Related

Reference