r - 在 rm_between 函数中使用逻辑运算符提取单词之间的字符串

Question

我正在尝试提取单词之间的字符串。考虑这个例子 -

x <-  "There are 2.3 million species in the world"

这也可能采取另一种形式，即

x <-  "There are 2.3 billion species in the world"

There我需要'million或之间的文本billion，包括它们。百万或十亿的存在是由运行时间决定的，而不是事先决定的。所以我需要从这句话中得到的输出是

[1] There are 2.3 million或者
[2] There are 2.3 billion

我正在使用包中的rm_between功能qdapRegex。使用此命令，我一次只能提取其中一个。

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE)

或者我必须使用

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

我如何编写一个可以检查是否存在million或billion在同一个句子中的命令。像这样的东西-

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

我希望这很清楚。任何帮助，将不胜感激。

score 3 · Accepted Answer

You may use str_extact_all (for global matching) or str_extract (single match)

library(stringr)
str_extract_all(s, "\\bThere\\b.*?\\b(?:million|billion)\\b")

or

str_extract_all(s, perl("(?<!\\S)There(?=\\s+).*?\\s(?:million|billion)(?!\\S)"))

score 3 · Accepted Answer

和参数 in采用字符left/数字符号。因此，您可以在两个参数中使用长度相等的向量。rightrm_betweenvectorleft/right

 library(qdapRegex)
 unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million" "There are 2.3 billion"
 unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million"

 unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 billion"

或者

  sub('\\s*species.*', '', x)

数据

 x <-  c("There are 2.3 million species in the world", 
   "There are 2.3 billion species in the world")
 x1 <- "There are 2.3 million species in the world"
 x2 <- "There are 2.3 billion species in the world"

score 2 · Accepted Answer

~~rm_between您可以为文档所述的多个等长标记提供向量。~~

编辑

请参阅@TylerRinker对rm_between.

虽然，您可以使用用户定义的正则表达式的另一种方法是rm_default：

rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)

示例：

library(qdapRegex)

x <-  c(
    'There are 2.3 million species in the world',
    'There are 2.3 billion species in the world'
)

rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"

## [[2]]
## [1] "There are 2.3 billion"

score 2 · Accepted Answer

@hwnd （我的qdapRegex合著者）的回应引发了一场讨论，引发了一个新的论点，fixed, for rm_between。以下描述在开发版本中：

rm_between并r_between_multiple拿起一个fixed论点。以前，包含正则表达式特殊字符的边界默认是固定的（转义）left。right这不允许对左/右边界使用强大的正则表达式。该fixed = TRUE行为仍然是默认行为，但用户现在可以设置fixed = FALSE为使用正则表达式边界。这个新功能的灵感来自@Ronak Shah 的 StackOverflow 问题：Extracting string between words using logical operators in rm_between function

要安装开发版本：

if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")

使用qdapRegex版本 >= 4.1，您可以执行以下操作。

x <-  c(
    "There are 2.3 million species in the world",
    "There are 2.3 billion species in the world"
)

rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
    include=TRUE, extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"
## 
## [[2]]
## [1] "There are 2.3 billion"

r - 在 rm_between 函数中使用逻辑运算符提取单词之间的字符串

4 回答 4

数据

编辑

Related

Reference