regex - 使用正则表达式提取 r ngram

Question

Karl Broman 的帖子：https ://kbroman.wordpress.com/2015/06/22/randomized-hobbit-2/让我玩正则表达式和 ngram 只是为了好玩。我尝试使用正则表达式来提取 2-grams。我知道有解析器可以做到这一点，但我对正则表达式逻辑感兴趣（即，这是我未能满足的自我挑战）。

下面我给出一个最小的例子和所需的输出。我尝试的问题是2倍：

克（单词）被吃掉了，不能用于下一次传递。 如何使它们可用于第二次通行证？（例如，我希望like在like toast之前已经使用过之后可以使用I like）
我无法使单词之间的空格未被捕获（请注意输出中的尾随空格，即使我使用了(?:\\s*)）。 如何不捕获第 n 个（在本例中为第二个）单词的尾随空格？我知道这可以简单地完成："(\\b[A-Za-z']+\\s)(\\b[A-Za-z']+)"对于 2-gram，但我想将解决方案扩展到 n-gram。PS我知道，\\w但我不认为下划线和数字是单词的一部分，而是考虑'作为单词的一部分。

MWE：

library(stringi)

x <- "I like toast and jam."

stringi::stri_extract_all_regex(
    x,
    pattern = "((\\b[A-Za-z']+\\b)(?:\\s*)){2}"
)

## [[1]]
## [1] "I like "    "toast and "

期望的输出：

## [[1]]
## [1] "I like"  "like toast"    "toast and"  "and jam"

score 8 · Accepted Answer

这是使用基本 R 正则表达式的一种方法。这可以很容易地扩展到处理任意 n-gram。诀窍是将捕获组放在积极的前瞻断言中，例如，(?=(my_overlapping_pattern))

x <- "I like toast and jam."
pattern <- "(?=(\\b[A-Za-z']+\\b \\b[A-Za-z']+\\b))"
matches<-gregexpr(pattern, x, perl=TRUE)
# a little post-processing needed to get the capture groups with regmatches
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
regmatches(x, matches)

# [[1]]
# [1] "I like"     "like toast" "toast and"  "and jam"

score 2 · Accepted Answer

实际上，有一个应用程序：quanteda包（用于文本数据的定量分析）。我的合著者 Paul Nulty 和我正在努力改进这一点，但它可以轻松处理您描述的用例。

install.packages("quanteda")
require(quanteda)
x <- "I like toast and jam."
> ngrams(x, 2)
## [[1]]
## [1] "i_like"     "like_toast" "toast_and"  "and_jam"   
ngrams(x, n = 2, concatenator = " ", toLower = FALSE)
## [[1]]
## [1] "I like"     "like toast" "toast and"  "and jam"

不需要痛苦的正则表达式！

regex - 使用正则表达式提取 r ngram

2 回答 2

Related

Reference