regex - 使用 stringr 和 regex 将解析的语料库转换为数据框

Question

我正在尝试使用 stringr 和正则表达式将已解析的语料库转换为 R 中的数据框（我已经读过也许我不应该将正则表达式用于此类工作，但我花了很多时间在这我想知道是否有解决方案）。语料库看起来像这样：

text <- paste("<w type=\"NP0\" lemma=\"dorothy\">Dorothy</w><c type=\"PUN\">, </c><w type=\"PRP\" lemma=\"in\">in </w><w type=\"DPS\" lemma=\"she\">her </w><w type=\"NN1\" lemma=\"time\">time</w><c type=\"PUN\">, </c><w type=\"VHD\" lemma=\"have\">had </w><w type=\"VBN\" lemma=\"be\">been </w><w type=\"AT0\" lemma=\"an\">an </w><w type=\"AJ0\" lemma=\"active\">active </w><w type=\"NN1\" lemma=\"member\">member </w><w type=\"PRF\" lemma=\"of\">of </w><w type=\"AT0\" lemma=\"an\">an </w><w type=\"NN1\" lemma=\"organisation\">organisation </w><w type=\"VVN-VVD\" lemma=\"call\">called </w><w type=\"AT0\" lemma=\"the\">the </w><w type=\"NN1\" lemma=\"noise\">Noise </w><w type=\"NN1\" lemma=\"reduction\">Reduction </w><w type=\"NN1\" lemma=\"society\">Society</w><c type=\"PUN\">, </c>")

我已经接近了我想要使用的东西：

library("stringr")

# Extract type
type <- str_extract_all(text, "<. type=\\\"(.*?)\\\"") %>%
    unlist()

#Extract word
word <- str_extract_all(text, ">(.*?)<\\/.>") %>%
    unlist()

#Convert to Data frame
df <- data.frame(
    type = type, 
    word = word)

问题是我只想要出现在等之间的东西，<w type = \"而\"不是那些字符本身，所以像这样（对于前两个词）：

df2 <- data.frame(type = c("NP0", "PUN"), word = c("Dorothy", ","))

再次，理解我应该学习，比如说，XML这种数据的包，我可以用正则表达式得到我想要的吗？

score 2 · Accepted Answer

您可以使用环视来仅提取. 我还添加str_trim了以删除单词周围不需要的空格

data.frame(
  type = str_extract_all(text , '(?<=type=\\")(.*?)(?=\\")')[[1]],
  word = str_trim(str_extract_all(text , '(?<=\\">)(.*?)(?=<)')[[1]], side = "both")
)    

#       type         word
# 1      NP0      Dorothy
# 2      PUN            ,
# 3      PRP           in
# 4      DPS          her
# 5      NN1         time
# 6      PUN            ,
# 7      VHD          had
# 8      VBN         been
# 9      AT0           an
# 10     AJ0       active
# 11     NN1       member
# 12     PRF           of
# 13     AT0           an
# 14     NN1 organisation
# 15 VVN-VVD       called
# 16     AT0          the
# 17     NN1        Noise
# 18     NN1    Reduction
# 19     NN1      Society
# 20     PUN            ,

regex - 使用 stringr 和 regex 将解析的语料库转换为数据框

1 回答 1

Related

Reference