3

我正在尝试提取单词之间的字符串。考虑这个例子 -

x <-  "There are 2.3 million species in the world"

这也可能采取另一种形式,即

x <-  "There are 2.3 billion species in the world"

There我需要'million或之间的文本billion,包括它们。百万或十亿的存在是由运行时间决定的,而不是事先决定的。所以我需要从这句话中得到的输出是

[1] There are 2.3 million或者
[2] There are 2.3 billion

我正在使用包中的rm_between功能qdapRegex。使用此命令,我一次只能提取其中一个。

library(qdapRegex)
rm_between(x, 'There', 'million', extract=TRUE, include.markers = TRUE) 

或者我必须使用

rm_between(x, 'There', 'billion', extract=TRUE, include.markers = TRUE)

我如何编写一个可以检查是否存在millionbillion在同一个句子中的命令。像这样的东西-

rm_between(x, 'There', 'billion' || 'million', extract=TRUE, include.markers = TRUE)

我希望这很清楚。任何帮助,将不胜感激。

4

4 回答 4

3

You may use str_extact_all (for global matching) or str_extract (single match)

library(stringr)
str_extract_all(s, "\\bThere\\b.*?\\b(?:million|billion)\\b")

or

str_extract_all(s, perl("(?<!\\S)There(?=\\s+).*?\\s(?:million|billion)(?!\\S)"))
于 2015-07-25T04:55:22.163 回答
3

和参数 in采用字符left/数字符号。因此,您可以在两个参数中使用长度相等的向量。rightrm_betweenvectorleft/right

 library(qdapRegex)
 unlist(rm_between(x, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million" "There are 2.3 billion"
 unlist(rm_between(x1, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 million"

 unlist(rm_between(x2, rep('There',2), c('million', 'billion'),
                         extract=TRUE, include.markers=TRUE))
 #[1] "There are 2.3 billion"

或者

  sub('\\s*species.*', '', x)

数据

 x <-  c("There are 2.3 million species in the world", 
   "There are 2.3 billion species in the world")
 x1 <- "There are 2.3 million species in the world"
 x2 <- "There are 2.3 billion species in the world"
于 2015-07-25T04:56:24.530 回答
2

rm_between您可以为文档所述的多个等长标记提供向量。

编辑

请参阅@TylerRinkerrm_between.

虽然,您可以使用用户定义的正则表达式的另一种方法是rm_default

rm_default(x, pattern='There.*?[bm]illion', extract=TRUE)

示例

library(qdapRegex)

x <-  c(
    'There are 2.3 million species in the world',
    'There are 2.3 billion species in the world'
)

rm_default(x, pattern = 'There.*?[bm]illion', extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"

## [[2]]
## [1] "There are 2.3 billion"
于 2015-07-25T05:35:24.917 回答
2

@hwnd (我的qdapRegex合著者)的回应引发了一场讨论,引发了一个新的论点,fixed, for rm_between。以下描述在开发版本中:

rm_betweenr_between_multiple拿起一个fixed论点。以前,包含正则表达式特殊字符的边界默认是固定的(转义)leftright这不允许对左/右边界使用强大的正则表达式。该fixed = TRUE行为仍然是默认行为,但用户现在可以设置fixed = FALSE为使用正则表达式边界。这个新功能的灵感来自@Ronak Shah 的 StackOverflow 问题:Extracting string between words using logical operators in rm_between function

要安装开发版本:

if (!require("pacman")) install.packages("pacman")
pacman::p_install_gh("trinker/qdapRegex")

使用qdapRegex版本 >= 4.1,您可以执行以下操作。

x <-  c(
    "There are 2.3 million species in the world",
    "There are 2.3 billion species in the world"
)

rm_between(x, left='There', right = '[mb]illion', fixed = FALSE,
    include=TRUE, extract = TRUE)

## [[1]]
## [1] "There are 2.3 million"
## 
## [[2]]
## [1] "There are 2.3 billion"
于 2015-07-25T16:45:09.983 回答