r - 如何系统地从教科书中提取数据

Question

{编辑}大家好！

我正在尝试从教科书（pdf）中系统地提取数据。因为这个任务不容易转化为可重现的例子，我在这里提供了书中的 2 页作为例子。这两页包含一个物种学名（属物种）列表和一系列 2 字符代码。我想从提供的 2 页示例中提取所有物种的学名及其代码。

这是我要提取的示例（物种=绿色，代码=蓝色）：

到目前为止，我已经能够非常可靠地恢复科学名称，但是代码并没有像我想要的那样提取：

library(pdftools)
library(tidyverse)

plants <- pdf_text("World_Checklist_of_Useful_Plant_Species_2020-pages-12-13.pdf") %>% 
  str_split("\n") # splitting up the document by pages: result is a list of length = # pages (689)

species_full <- list()
taxa_full <- list()
use_full <- list()

for(i in 1:length(plants)){ 
  # for loop to search for species names across all subsetted pages
  species_full[[i]] <- plants[[i]] %>%
    str_extract("[A-Z]+[a-z]+ [a-z]+\\b") # extracting words with upper and lower case letters between margins and abbr. words
  
  use_full[[i]] <- plants[[i]] %>%
    str_extract("(?<=\\|).+(?=\\|)") %>% # extracting use codes
    str_split("\n") %>%
    str_extract_all("[A-Z]+[A-Z]")
  
}

species_full_df <- species_full %>%
  unlist() %>% # unlisting
  as.data.frame() %>%
  drop_na() %>%
  rename(species = ".") %>%
  filter(!species %in% c("Checklist of", "Database developed")) # removing artifacts from page headers

use_full_df <- use_full %>% 
  unlist() %>% # unlisting
  as.data.frame() %>%
  rename(code = ".") %>%
  filter(!code == "<NA>") %>%
  as.data.frame()

从这段代码中，我获得以下内容species_full_df：

> head(species_full_df)
                     species
1      Encephalartos cupidus
2 Encephalartos cycadifolius
3       Encephalartos eugene
4    Encephalartos friderici
5     Encephalartos heenanii
6                 Cycas apoa

（注意顺序没有保留，但大部分物种名称都在里面）

我从以下位置获得这些结果use_full_df：

> head(use_full_df)
  code
1  RBG
2   EU
3   EU
4   MA
5   ME
6   ME

问题：提取是抓取 3 个字符的代码（我只想提取 2 个字符的使用代码），并且每行只返回一个代码（许多物种有多个代码）。

你能建议如何改进这个过程吗？大概我对正则表达式的使用是可恶的。

先感谢您！

-亚历克斯。

score 2 · Accepted Answer

我会以不同的方式解决它。首先，我将依靠可以将tabulizerpdf 中的列解析为行字符串信号的程序包。然后，我会将原始线条转换为 tibble/data.frame 以矢量化转换，而不是在线循环。

library(tabulizer)
library(splitstackshape)
library(tidyverse)

text_plants <- tabulizer::extract_text(file = "World_Checklist_of_Useful_Plant_Species_2020-pages-12-13.pdf")

df_plants <- 
  read.delim(file = textConnection(text_plants), header = FALSE) %>% as_tibble() %>% #as_tibble is optional, but helps a lot for exploring the results of the read.delim and the following mutations.
  filter(grepl("^\\s?(World.Checklist.of.Useful.Plant|m.diazgranados@kew.org|Page *\\d+ of \\d+|\\s*$)", V1) == FALSE) %>% # Optional. Removes the first and final with headers and footers.
  mutate(V1 = trimws(V1), 
         is_metadata = grepl('^\\s?\\d+.*[|]', V1), #Starts by checking those lines that have metadata, and which are always below a plant
         is_plant = lead(is_metadata), #Identifies those lines with the plant name, which seems to be always above a metadata line
         plant_metadata = if_else(is_plant == TRUE, true = trimws(lead(V1)), false = NA_character_)) %>% #moves the metadata signal into the same row but different variable of the plant signal.
  filter(is_plant == TRUE) %>% # Removes all lines not lsiting a plant.
  rename(plant = V1) %>% 
  mutate(usage_codes = str_extract(string = plant_metadata, pattern = "(?<=\\|).+(?=\\|)") %>% trimws()) %>% # Extractx the "usage codes"
  select(plant, usage_codes) %>% 
  splitstackshape::cSplit(splitCols = "usage_codes", sep = " ", direction = "long") %>% # Extracts the usage code into a tidy table with plats as ID
  filter(!is.na(usage_codes)) %>% 
  mutate(exists = TRUE) %>%
  pivot_wider(id_cols = plant, names_from = usage_codes, values_from = exists, values_fill = FALSE) # pivots the tidy table into a wide format.

df_plants
# A tibble: 114 x 10
   plant                      ME    HF    PO    SU    EU    GS    MA    IF    AF   
   <chr>                      <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
 1 Cycas apoa K.D.Hill        TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 2 Cycas circinalis L.        TRUE  TRUE  TRUE  TRUE  FALSE FALSE FALSE FALSE FALSE
 3 Cycas inermis Lour.        TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 4 Cycas media R.Br.          TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 5 Cycas micronesica K.D.Hill TRUE  TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 6 Cycas pectinata Buch.-Ham. TRUE  TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 7 Cycas revoluta Thunb.      TRUE  TRUE  FALSE FALSE TRUE  TRUE  TRUE  FALSE FALSE
 8 Cycas rumphii Miq.         TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  FALSE
 9 Cycas siamensis Miq.       TRUE  TRUE  FALSE FALSE TRUE  FALSE FALSE FALSE FALSE
10 Cycas taiwaniana Carruth.  FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE
# … with 104 more rows

r - 如何系统地从教科书中提取数据

1 回答 1

Related

Reference