我刚刚找到了一个作弊的方法。让我为你放下它。我在我的 Windows 上复制并粘贴了 EXCEL 2013 中 PDF 文件的每一页并创建了一个文件。该文件以非常好的方式包含数据。我想这是你的救命稻草。首先,我将它导入到 R 中read_xlsx()
。然后,我将列名更改为Legal Name
. 如果你愿意,你可以有别的东西。然后,我删除了 PDF 文件中包含标题的行,例如Operating Name|
,Operating Name
等等。然后,我创建了两列。一个是id
。基本上一家公司的信息会连续两行。我使用这种模式来创建变量。我也创造了type
,其中包含两种类型的名称。最后,我将数据转换为宽格式数据。这给了我们以下结果。我不确定这种方法是否一直有效。但至少对于您的 PDF 数据,是的。
library(readxl)
library(dplyr)
library(tidyr)
mydf <- read_xlsx("SO_pdf.xlsx")
rename(mydf, legal_name = "Legal Name") %>%
filter(!grepl(x = legal_name, pattern = "Operating Name|The Food Retailers|The Department Stores|The Other Non-Food Retailers")) %>%
mutate(id = rep(1:(n()/2), each = 2),
type = rep(c("legal_name", "operating_name"), times = (n()/2))) %>%
pivot_wider(id_cols = "id", names_from = "type", values_from = "legal_name")
# id legal_name operating_name
# <int> <chr> <chr>
# 1 1 Buy-Low Foods Limited Partnership AG Foods, Buy-Low Foods, Buy & Save Foods, Fine Foods, G&H Shop N' Save, Nesters Market
# 2 2 Loblaws Inc. At the Pumps, Atlantic Gas Bars, Dominion, Extra Foods, Joe Fresh, Loblaws, Loblaws à Ple~
# 3 3 Metro Ontario Inc. Drug Basics, Food Basics, Metro, Super C, The Pharmacy
# 4 4 Overwaitea Food Group Limited Pa~ Cooper's Foods, Overwaitea Foods, PriceSmart Foods, Save-On-Foods, Urban Fare
# 5 5 Sobeys Capital Incorporated Candico Food Markets, Canada Safeway, Canada Safeway Liquor Store, Fast Fuel, Foodland, F~
# 6 6 Hudson's Bay Company Home Outfitters/Déco Découverte, The Bay/ La Baie, Zellers
# 7 7 Sears Canada Inc. Sears, Sears Home Stores, Sears Hometown Stores, Sears Outlet
# 8 8 Wal-Mart Canada Corp Walmart
# 9 9 American Eagle Outfitters Canada~ Aerie, American Eagle Outfitters
#10 10 Apple Canada Inc. Apple Store