我正在尝试在 R 中进行文本分析。我有一个具有以下结构的文本文件。
HD A YEAR Oxxxx
WC 244 words
PD 28 February 2018
SN XYZ
SC hydt
LA English
CY Copyright 2018
LP Rio de Janeiro, Feb 28
TD
With recreational cannabis only months away from legalization in Canada, companies are racing to
prepare for the new market. For many, this means partnerships, supply agreements,
我想在 R 中提取以下元素(PD 和 TD),并保存到表中。
我已经尝试过了,但我无法正确。
提取 PD
library(stringr)
library(tidyverse)
pd <- unlist(str_extract_all(txt, "\\bPD\\b\t[0-9]+?\\s[A-Za-z]+?\\s[0-9]+\\s"))
pd <- str_replace_all(pd, "\\bPD\\b\t", "")
if (length(pd) == 0) {
pd <- as.character(NA)
}
pd <- str_trim(pd)
pd <- as.Date(strptime(pd, format = "%d %B %Y"))
提取 TD
td <- unlist(str_extract_all(txt, "\\bTD\\b[\\t\\s]*?.+?\\bCO\\b"))
td <- str_replace_all(td, "\\bTD\\b[\\t\\s]+?", "")
td <- str_replace_all(td, "\\bCO\\b", "")
td <- str_replace_all(td, "\\s+", " ")
if (length(td) == 0) {
td <- as.character(NA)
我想要如下表格:
PD TD
28 February 2018 With recreational cannabis only months away from
legalization in Canada, companies are racing to
prepare for the new market. For many, this means
partnerships, supply agreements, Production hit a
record 366.5Mt
任何帮助,将不胜感激。谢谢