15

假设我有一些这样的文字,

text<-c("[McCain]: We need tax policies that respect the wage earners and job creators. [Obama]: It's harder to save. It's harder to retire. [McCain]: The biggest problem with American healthcare system is that it costs too much. [Obama]: We will have a healthcare system, not a disease-care system. We have the chance to solve problems that we've been talking about... [Text on screen]: Senators McCain and Obama are talking about your healthcare and financial security. We need more than talk. [Obama]: ...year after year after year after year. [Announcer]: Call and make sure their talk turns into real solutions. AARP is responsible for the content of this advertising.")

我想删除(编辑:摆脱)[和](以及括号本身)之间的所有文本。最好的方法是什么?这是我使用正则表达式和 stingr 包的微弱尝试:

str_extract(text, "\\[[a-z]*\\]")

谢谢你的帮助!

4

5 回答 5

28

有了这个:

gsub("\\[[^\\]]*\\]", "", subject, perl=TRUE);

正则表达式的含义:

  \[                       # '['
  [^\]]*                   # any character except: '\]' (0 or more
                           # times (matching the most amount possible))
  \]                       # ']'
于 2014-05-31T05:25:10.063 回答
10

以下应该可以解决问题。?强制进行惰性匹配,在.随后的].

gsub('\\[.*?\\]', '', text)
于 2014-05-31T05:26:19.023 回答
3

这是另一种方法:

library(qdap)
bracketX(text, "square")
于 2014-05-31T07:42:25.010 回答
3

无需使用带有否定字符类/括号表达式的 PCRE 正则表达式,“经典”TRE 正则表达式也可以:

subject <- "Some [string] here and [there]"
gsub("\\[[^][]*]", "", subject)
## => [1] "Some  here and "

查看在线 R 演示

详情

  • \\[- 文字[(必须转义或在括号表达式中使用,如[[]被解析为文字[
  • [^][]*- 一个否定括号表达式,匹配除[and之外的 0+ 个字符](请注意,]括号表达式开头的 被视为文字]
  • ]- 文字](此字符在 PCRE 和 TRE 正则表达式中都不是特殊字符,不必转义)。

如果您只想用其他分隔符替换方括号,请在替换模式中使用带有反向引用的捕获组:

gsub("\\[([^][]*)\\]", "{\\1}", subject)
## => [1] "Some {string} here and {there}"

查看另一个演示

括号构造(...)形成一个捕获组,并且可以通过反向引用访问其内容\1(因为该组是模式中的第一个,其 ID 设置为 1)。

于 2016-12-14T18:34:21.410 回答
3

我认为这在技术上回答了你的问题,但你可能想\\:在正则表达式的末尾添加一个更漂亮的文本(删除冒号和空格)。

library(stringr)
str_replace_all(text, "\\[.+?\\]", "")

#> [1] ": We need tax policies that respect the wage earners..."

对...

str_replace_all(text, "\\[.+?\\]\\: ", "")
#> [1] "We need tax policies that respect the wage earners..." 

reprex 包(v0.2.0)于 2018 年 8 月 16 日创建。

于 2018-08-16T19:46:09.190 回答