r - 使用制表符包基于字符串提取列表

Question

使用制表程序包提取季度损益表并将其转换为表格形式。

# 2017 Q3 Report
telia_url = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2017/q3/telia-company-q3-2017-en"
telialists = extract_tables(telia_url)
teliatest1 = as.data.frame(telialists[22])

#2009 Q3#
telia_url2009 = "http://www.teliacompany.com/globalassets/telia-
company/documents/reports/2009/q3/teliasonera-q3-2009-report-en.pdf"
telialists2009 = extract_tables(telia_url2009)
teliatest2 = as.data.frame(telialists2009[9])

只对综合收益表的简明综合报表感兴趣。对于所有历史报告，此字符串完全相同或非常相似。

上面，对于 2017 年的报告，列表 #22 是正确的表格。但是，由于 2009 年的报告有不同的布局，#9 是该特定报告的正确答案。

根据“简明综合综合收益表”字符串（或子字符串）的位置，使该函数动态化的聪明解决方案是什么？

也许使用 tm 包来查找相对位置？

谢谢

score 1 · Accepted Answer

您可以使用pdftools找到您感兴趣的页面。

例如，像这样的函数应该可以完成这项工作：

get_table <- function(url) {
  txt <- pdftools::pdf_text(url)
  p <- grep("condensed consolidated statements.{0,10}comprehensive income", 
            txt,
            ignore.case = TRUE)[1]
  L <- tabulizer::extract_tables(url, pages = p)
  i <- which.max(lengths(L))
  data.frame(L[[i]])
}

第一步是读取字符向量中的所有页面txt。然后grep让您找到看起来像您想要的第一页（我插入.{0,10}以允许标题中间最多包含十个字符，如空格或换行符）。

使用tabulizer，您可以提取L位于此页面上的所有表格的列表，这应该比提取文档的所有表格要快得多，就像您所做的那样。您的表格可能是该页面上最大的，因此which.max.

r - 使用制表符包基于字符串提取列表

1 回答 1

Related

Reference