0

我使用 r 中的 pdftools 从 pdf 中提取了表格。PDF 中的表格具有用于列的多行文本。我用“|”替换了超过2个空格的空格 这样就更容易了。但我遇到的问题是,由于多行和表格在 PDF 中的格式化方式,数据出现乱序。原来的样子是这样的

在此处输入图像描述

我提取的数据如下所示:

    scale_definitions <- c("", "                                        to lack passion                        easily annoyed", 
"      Excitable", "                                        to lack a sense of urgency             emotionally volatile", 
"", "                                        naive                                  mistrustful", 
"      Skeptical", "                                        gullible                               cynical", 
"", "                                        overly confident                       too conservative", 
"      Cautious", "                                        to make risky decisions                risk averse", 
"", "                                        to avoid conflict                      aloof and remote", 
"      Reserved", "                                        too sensitive                          indifferent to others' feelings", 
"", "                                        unengaged                              uncooperative", 
"      Leisurely", "                                        self-absorbed                          stubborn", 
"", "                                        unduly modest                          arrogant", 
"      Bold", "                                        self-doubting                          entitled and self-promoting", 
"", "                                        over controlled                        charming and fun", 
"      Mischievous", "                                        inflexible                             careless about commitments", 
"", "                                        repressed                              dramatic", 
"      Colorful", "                                        apathetic                              noisy", 
"", "                                        too tactical                           impractical", 
"      Imaginative", "                                        to lack vision                         eccentric", 
"", "                                        careless about details                 perfectionistic", 
"      Diligent", "                                        easily distracted                      micromanaging", 
"", "                                        possibly insubordinate                 respectful and deferential", 
"      Dutiful", "                                        too independent                        eager to please"
)

scale_definitions <-  scale_definitions %>% str_replace_all("\\s{2,}", "|")

我如何最好地将其放入数据框中?

4

1 回答 1

2

不幸的是,reprex 会很复杂,所以这里描述了如何获得结构化的 df:

恐怕您必须使用pdftools::pdf_data()而不是pdftools::pdf_text().

这样,您就可以为列表中的每个页面获取一个 df。在这些 dfs 中,您会得到页面上每个单词的一行以及确切的位置(加上扩展 IRCC)。有了这个,你可以编写一个解析器来完成你的任务......这将是一些工作,但这是我知道解决这类问题的唯一方法。

更新:

我找到了一个readr对您的情况有帮助的函数,因为我们可以nchar()为列位置假设一个固定的长度 ( ):

library(tidyverse)

scale_definitions %>%
    # parse into columns by lenght and there for implicitely start position
    readr::read_fwf(fwf_widths(c(39, 40, 40), c("col1", "col2", "col3"))) %>%
    # build group ID from row number
    dplyr::mutate(grp = (dplyr::row_number() - 1) %/% 3) %>%
    # firm groupings
    dplyr::group_by(grp) %>%
    # impute missing value in col 1
    tidyr::fill(col1, .direction = "downup") %>%
    # remove groupings to prevent unwanted behaviour down stream
    dplyr::ungroup() %>%
    # remove auxiliary variable
    dplyr::select(-grp) %>%
    # convert to long format (saver to remove NAs)
    tidyr::pivot_longer(-col1, names_to = "cols", values_to = "vals") %>%
    # remove NAs
    dplyr::filter(!is.na(vals))

# A tibble: 44 x 3
   col1      cols  vals
   <chr>     <chr> <chr>
 1 Excitable col2  to lack passion
 2 Excitable col3  easily annoyed
 3 Excitable col2  to lack a sense of urgency
 4 Excitable col3  emotionally volatile
 5 Skeptical col2  naive
 6 Skeptical col3  mistrustful
 7 Skeptical col2  gullible
 8 Skeptical col3  cynical
 9 Cautious  col2  overly confident
10 Cautious  col3  too conservative
# ... with 34 more rows
于 2021-08-25T19:28:31.737 回答