r - 用该字符串的一部分替换来自 tibble 的字符串

Question

我在这里搜索了很多正则表达式答案，但找不到此类问题的解决方案。

我的数据集是一个带有维基百科链接的小标题：

library(tidytext)
library(stringr)
text.raw <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."

我正在尝试从链接中清理我的文本。这个：

str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])")
# [1] "Duits"     "architect"

从括号中选择我需要的单词。

这个：

str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# [1] "Berthold Speer was een Duits Duits."

按预期工作，但不是我需要的。这个：

str_replace_all(text.raw, "\\[\\[.*?\\]\\]", str_extract_all(text.raw, "[a-zA-Z\\s]+(?=\\])"))
# Error: `replacement` must be a character vector

在我预期的地方给出错误"Berthold Speer was een Duits architect"

目前我的代码看起来像这样：

text.clean <- data_frame(text = text.raw) %>%
  mutate(text = str_replace_all(text, "\\[\\[.*?\\]\\]", str_extract_all(text, "[a-zA-Z\\s]+(?=\\])")))

我希望有人知道一个解决方案，或者如果存在一个重复的问题，可以向我指出一个重复的问题。我想要的输出是"Berthold Speer was een Duits architect".

score 5 · Accepted Answer

您可以使用单个 gsub 操作

text <- "Berthold Speer was een [[Duitsland (hoofdbetekenis)|Duits]] [[architect]]."
gsub("\\[{2}(?:[^]|]*\\|)?([^]]*)]{2}", "\\1", text)

模式将匹配

1 回答 1