0

我正在尝试识别多行的模式,确切地说是 2 行。由于任一行中的模式都不是唯一的,因此我正在使用这种方法。

到目前为止,我已经尝试使用函数“grep”,但我认为我在这里缺少正确的正则表达式。

grep("^Item\\s{0,}2[^A]", f.text, ignore.case = TRUE)

这部分是 edgar 包函数“getfillings”的修改版本,并尝试仅提取管理层的评论/项目 2 以获得季度结果。如果可能的话,我会在 ...2[^A]在对新行做出反应的函数中包含一些内容,然后是字符串“Management...”

我拥有的普通 txt 中的模式如下所示:

项目二
、管理层对财务状况和经营成果的讨论与分析

对于如何在 R 的正则表达式中最好地捕获这一点,我将不胜感激。

示例输入如下所示:

21 第 2 项
管理层对财务状况和经营成果的讨论与分析 本节及本季度报告的其他部分 表 10 第 3 项
市场风险的定量和定性披露 公司市场风险未发生重大变化

并且期望的输出是

管理层对财务状况和经营成果的讨论和分析 本节和本季度报告的其他部分,表格 10

我需要匹配“第 2 项......管理讨论”,因为第 2 项不是唯一的。如何跨两行制定正则表达式?

4

2 回答 2

0

不是很复杂,因为我不是字符串操作专家:使用包tidyverse提供了一些强大的工具来获得所需的输出。

text <- "21 Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.
Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk Item 4.
Fluffy Text example Item 5.
Lorem ipsum dolor sit amet, consectetur adipisici elit"

现在

text %>%
  str_extract_all("(?<=Item\\s\\d[[:punct:]]\\n).*", simplify = TRUE) %>%
  str_remove("\\s+Item\\s\\d[[:punct:]]")

给你

[1] "Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10"
[2] "Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"                           
[3] "Fluffy Text example"                                                                                                                                 
[4] "Lorem ipsum dolor sit amet, consectetur adipisici elit" 

如果您只想提取Item 2,请将\\d内部替换为str_extract_all2 。

于 2020-05-25T10:53:25.243 回答
0

您可以简单地删除换行符:

gsub("\\n", "", text)
[1] "21 Item 2.Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"

现在您将所有内容都放在一条长线上,并且可以提取您想到的任何模式。例如,使用str_extractfrom package stringr

library(stringr)
str_extract(gsub("\\n", "", text), "Management.*on Form 10")
[1] "Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10"

数据:

text <- "21 Item 2.
Management Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.
Quantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"

text
[1] "21 Item 2.\nManagement Discussion and Analysis of Financial Condition and Results of Operations This section and other parts of this Quarterly Report on Form 10 Item 3.\nQuantitative and Qualitative Disclosures About Market Risk There have been no material changes to the Company market risk"
于 2020-05-25T11:07:10.847 回答