0

我有需要转换为“整洁”格式的 pdf 文本。但我不确定如何在不影响我需要的信息的情况下阅读 pdf 文本。例如:

# install pacman package if you require it
if (!require("pacman")) install.packages("pacman")

# p_load installs and loads packages

pacman::p_load(tidyverse, pdftools, tabulizer)

pdf_txt_raw <- pdf_text("https://www.statcan.gc.ca/eng/statistical-programs/document/5027_D1_V10-eng.pdf") %>% 
               read_lines()

pdf_txt_raw 

使用read_lines()似乎会出错,因为只要“法定名称”列中有两行,它就会弄乱我正在寻找的整洁格式。例如,Loblaw Inc [4] 应该可以很好地清理,因为每个操作名称都用逗号分隔,并且位于 Loblaws 行内,从而为我提供了一个干净的类别。

但是由于 PDF 中的换行符,第一个法定名称类别是错误的 - 即,“Buy-Low Foods Limited Partnership”应该是法定名称,而该类别中的运营名称应该是“AG Foods, Buy-Low Foods , Buy & Save Foods, Fine Foods, G&H Shop N' Save, Nesters Market”。

关于如何正确清理它并获得我正在寻找的整洁格式的任何提示?

4

1 回答 1

0

我刚刚找到了一个作弊的方法。让我为你放下它。我在我的 Windows 上复制并粘贴了 EXCEL 2013 中 PDF 文件的每一页并创建了一个文件。该文件以非常好的方式包含数据。我想这是你的救命稻草。首先,我将它导入到 R 中read_xlsx()。然后,我将列名更改为Legal Name. 如果你愿意,你可以有别的东西。然后,我删除了 PDF 文件中包含标题的行,例如Operating Name|Operating Name等等。然后,我创建了两列。一个是id。基本上一家公司的信息会连续两行。我使用这种模式来创建变量。我也创造了type,其中包含两种类型的名称。最后,我将数据转换为宽格式数据。这给了我们以下结果。我不确定这种方法是否一直有效。但至少对于您的 PDF 数据,是的。

library(readxl)
library(dplyr)
library(tidyr)

mydf <- read_xlsx("SO_pdf.xlsx")

rename(mydf, legal_name = "Legal Name") %>% 
filter(!grepl(x = legal_name, pattern = "Operating Name|The Food Retailers|The Department Stores|The Other Non-Food Retailers")) %>% 
mutate(id = rep(1:(n()/2), each = 2),
       type = rep(c("legal_name", "operating_name"), times = (n()/2))) %>% 
pivot_wider(id_cols = "id", names_from = "type", values_from = "legal_name")

#      id legal_name                        operating_name                                                                            
#   <int> <chr>                             <chr>                                                                                     
# 1     1 Buy-Low Foods Limited Partnership AG Foods, Buy-Low Foods, Buy & Save Foods, Fine Foods, G&H Shop N' Save, Nesters Market   
# 2     2 Loblaws Inc.                      At the Pumps, Atlantic Gas Bars, Dominion, Extra Foods, Joe Fresh, Loblaws, Loblaws à Ple~
# 3     3 Metro Ontario Inc.                Drug Basics, Food Basics, Metro, Super C, The Pharmacy                                    
# 4     4 Overwaitea Food Group Limited Pa~ Cooper's Foods, Overwaitea Foods, PriceSmart Foods, Save-On-Foods, Urban Fare             
# 5     5 Sobeys Capital Incorporated       Candico Food Markets, Canada Safeway, Canada Safeway Liquor Store, Fast Fuel, Foodland, F~
# 6     6 Hudson's Bay Company              Home Outfitters/Déco Découverte, The Bay/ La Baie, Zellers                                
# 7     7 Sears Canada Inc.                 Sears, Sears Home Stores, Sears Hometown Stores, Sears Outlet                             
# 8     8 Wal-Mart Canada Corp              Walmart                                                                                   
# 9     9 American Eagle Outfitters Canada~ Aerie, American Eagle Outfitters                                                          
#10    10 Apple Canada Inc.                 Apple Store   
于 2020-01-29T02:37:23.947 回答