r - 使用 R 从 PDF 中提取字符串

Question

我有这个来自欧洲议会的 PDF 文件，你可以在这里下载。我已经下载它并把它放在R中。它包含经过投票的欧洲议会（MEP）成员名单。

我只想提取这些列表的一部分。"AVGIVNA RÖSTER"具体来说，我想提取和之间的名称并将其放入表格中0，请参阅此屏幕截图中突出显示的文本。

类似系列的名称在 PDF 中重复出现。它指的是特定的投票。我希望它们都在一张桌子上。MEP 的名称发生了变化，但结构保持不变，它们始终位于位"AVGIVNA RÖSTER"和“0”之间。

我想过使用一个startswith函数和一个 for 循环“但我在写作方面遇到了困难。

这是我到目前为止所做的：

library(pdftools)
library(tidyverse)

votetext <- pdftools::pdf_text("MEP.pdf") %>%
  readr::read_lines()

score 1 · Accepted Answer

你可以试试这样的

votetext <- pdftools::pdf_text("MEP.pdf") %>%
  readr::read_lines()

a <- which(grepl("AVGIVNA RÖSTER", votetext)) #beginning of string
b <- which(grepl("^\\s*0\\s*$", votetext)) #end of string

sapply(a, function(x){paste(votetext[x:(min(b[b > x]))], collapse = ". ")})

请注意，在定义中，b我使用\\s*在字符串中查找空格。一般来说，您可以先删除尾随和前导空格，请参阅此问题。

在你的情况下，你可以这样做：

votetext2 <- pdftools::pdf_text("data.pdf") %>%
  readr::read_lines() %>%
  str_remove("^\\s*") %>% #remove white space in the begining
  str_remove("\\s*$") %>% #remove white space in the end
  str_replace_all("\\s+", " ") #replace multiple white-spaces with a singe white-space

a2 <- which(votetext2 == "AVGIVNA RÖSTER")
b2 <- which(votetext2 == "0")

result <- sapply(a2, function(x){paste(votetext2[x:(min(b2[b2 > x]))], collapse = ". ")})

result然后看起来像这样：

`"AVGIVNA RÖSTER. Martin Hojsík, Naomi Long, Margarida Marques, Pedro Marques, Manu Pineda, Ramona Strugariu, Marie Toussaint,. + Dragoş Tudorache, Marie-Pierre Vedrenne. -. Agnès Evren. 0"

r - 使用 R 从 PDF 中提取字符串

1 回答 1

Related

Reference