regex - 通过第一个冒号提取字符串

Question

我有一个字符串数据集，想要提取一个子字符串，直到并包括第一个冒号。早些时候，我在这里发帖询问如何仅提取第一个冒号之后的部分：在第一个冒号处拆分字符串下面我列出了一些解决当前问题的尝试。

我知道这^[^:]+:与我想保留的部分相匹配，但我不知道如何提取该部分。

这是一个示例数据集和所需的结果。

my.data <- "here is: some text
here is some more.
even: more text
still more text
this text keeps: going."

my.data2 <- readLines(textConnection(my.data))

desired.result <- "here is:
0
even:
0
this text keeps:"

desired.result2 <- readLines(textConnection(desired.result))

# Here are some of my attempts

# discards line 2 and 4 but does not extract portion from lines 1,3, and 5.
ifelse( my.data2 == gsub("^[^:]+:", "", my.data2), '', my.data2)

# returns the portion I do not want rather than the portion I do want
sub("^[^:]+:", "\\1", my.data2, perl=TRUE)

# returns an entire line if it contains a colon
grep("^[^:]+:", my.data2, value=TRUE)

# identifies which rows contain a match
regexpr("^[^:]+:", my.data2)

# my attempt at anchoring the right end instead of the left end
regexpr("[^:]+:$", my.data2)

这个较早的问题涉及返回匹配的反面。如果我从上面链接的早期问题的解决方案开始，我还没有想出如何在 R 中实现这个解决方案：正则表达式相反

我最近获得了 RegexBuddy 来研究正则表达式。这就是我知道^[^:]+:与我想要的匹配的方式。我只是无法使用该信息来提取匹配项。

我知道stringr包裹。也许它可以提供帮助，但我更喜欢 base R 中的解决方案。

谢谢你的任何建议。

score 6 · Accepted Answer

“我知道 ^[^:]+: 匹配我要保留的部分，但我不知道如何提取该部分。”

因此，只需将括号括起来并在末尾添加“.+$”并使用带有引用的 sub

sub("(^[^:]+:).+$", "\\1", vec)

 step1 <- sub("^([^:]+:).+$", "\\1", my.data2)
 step2 <- ifelse(grepl(":", step1), step1, 0)
 step2
#[1] "here is:"         "0"                "even:"            "0"               
#[5] "this text keeps:"

目前尚不清楚您是否希望将它们作为单独的矢量元素将它们与换行符粘贴在一起：

> step3 <- paste0(step2, collapse="\n")
> step3
[1] "here is:\n0\neven:\n0\nthis text keeps:"
> cat(step3)
here is:
0
even:
0
this text keeps:

score 4 · Accepted Answer

这似乎产生了你正在寻找的东西（尽管它只返回其中有冒号的行）：

grep(":",gsub("(^[^:]+:).*$","\\1",my.data2 ),value=TRUE)
[1] "here is:"         "even:"            "this text keeps:"

当我输入此内容时，我看到了@DWin 的答案，该答案也建议使用括号，并且ifelse其中也确实为您提供了“ 0”。

score 2 · Accepted Answer

另一种不太优雅的方法strsplit：

x <- strsplit(my.data2, ":")
lens <- sapply(x, length)
y <- sapply(x, "[", 1)
y[lens==1] <- "0"

regex - 通过第一个冒号提取字符串

3 回答 3

Related

Reference