r - 如何使用带有pdftools和html链接的自定义函数在R中一次将大量列突变到数据框中？

Question

抱歉，如果这很长或结构不正确，这是我的第一个问题和第一个主要的 R 侧项目！让我知道我是否应该为将来更改有关我的问题的任何内容。

我目前正在处理一些存储相当奇怪的城市交通数据。不是将每个交叉口的数据存储在可下载的 csv 文件中，而是提供一个指向网站的链接，该链接包含指向所有以前交通调查的 HTML 链接的 PDF 格式。

我已经为每个交通路口提取了最新的 PDF，但是无法让我的函数读取 PDF 并在输入到 mapply 和 mutate 时返回数据以工作。我之前编写了一个函数，该函数能够以 PDF 格式和 PDF 链接作为输入，并返回一个 1 行 55 列的数据框，其中包含该路口的所有交通数据。

现在，我似乎无法让该函数在 mapply/with mutate 中工作。该函数如下所示，但它需要两个输入，即用于 PDF 的调查类型和指向相关 PDF 的 PDF 链接，并返回上述数据框。

当简单地单独使用mapply时，我认为该函数将整个调查类型列/整个PDF链接列作为一个整体用于函数中的变量，而不是循环遍历整个列。

在 mutate 中使用 mapply 时，它似乎在两列上正确循环，但我不确定如何使用 mutate 一次正确添加大量列。理想情况下，我会简单地以正确的顺序制作列名的向量，并使用 mutate 或一些类似 mutate 的函数将 mapply 的结果分配给列列表，即

类似变异（traffic_westwood，col-names = mapply（get_data，SURVEY_TYPE，PDF））

因此，我有两个问题：

mapply 如何与我下面的代码相关？我是否误解了 mapply 如何在两列上循环，或者我的函数没有正确矢量化？我是否应该为此任务使用与 mapply 不同的循环函数？
是否有一个函数可以用来分配循环遍历感兴趣的两列并将 PDF 数据提取到列名列表中的结果，然后将这些列添加到我的数据框中？

以前，我尝试将 mapply 调用放入 mutate 中，如上所述，但它仍然需要一种将 55 列分配给名称列表的方法。

此外，当我意识到循环遍历我需要的两列而不是尝试在整行上执行我的自定义函数时，我切换到了 mapply 而不是 lapply。

另请注意，在 mapply 调用中使用的 SURVEY_TYPE 列和 PDF 列都是正确的维度，每个插槽中都有可接受的值：我已经检查过了。

请注意，我的代码理想情况下适用于包含以下两列 SURVEY_TYPE 和 PDF 的任何维度的任何输入有限数据框。SURVEY_TYPE 包含“Auto”以指示自动调查类型和“Manual”以指示手动调查类型。PDF 包含 SURVEY_TYPE = 'Auto' 的行的字符串向量/PDF html 链接，如果 SURVEY_TYPE = 'Manual' 则包含单个字符串/PDF html 链接。

有关 LA 市手动交通调查 PDF 链接的示例，请检查下面打印的已关闭连接的错误消息，其中应包含多个此类链接。

#This function obtains a data frame with rows of 55 entries for each intersection with manual data.
#Automatic data only intersections have blank rows for now.
get_data <- function(survey_type, pdf_link){

  #Return data frame of empty row if type was auto. I'm fine on this part.
  if(survey_type == 'Auto'){
    ...
    return(data.frame(#Whatever I want for auto, empty 1 row 55 cols for now))
  }

  #Now we read in this entries manual pdf data.

  #Read in the first pdf.
  pdf1 <- pdf_text(toString(pdf_link))


  #Get the handle for this file..
  file1 <- file("Traffic_Data_Files/pdf1.txt", 'w')

  #Write the PDF contents to the file.
  write(pdf1, file = file1, sep = '\t')

  #Close the file and reopen it.
  close(file1)
  file1 <- file("Traffic_Data_Files/pdf1.txt", 'r')

  #Here is where I had code to get the data frame we will return and extract the 
  #data to it. Note that #the data frame will be 1 row and 55 columns and will  
  #be called 'intersection1'.
  #I have tested this code, which works on a single file to return such data.

  #Return the info for this intersection.
  return(intersection1)

}

#Use lapply on the traffic data.
manual_intersections <- mutate(traffic_westwood, Data = mapply(get_data, SURVEY_TYPE, PDF))

理想情况下，我想直接将 55 行填写的信息附加到当前数据框中。当然，作为中间步骤，如果有必要，我可以获取一个包含行的新数据框并将其附加到旧数据框。

当我将 mapply 调用保留在上面最后一行的 mutate 调用之外时，会出现以下错误消息（对我来说，它似乎无法告诉列是向量并且首先检查 SURVEY_TYPE == 'Auto'在列的第一个组件上而不是按元素）：

Error in open.connection(con, "rb") : cannot open the connection
In addition: Warning messages:
1: In if (survey_type == "Auto") { :
  the condition has length > 1 and only the first element will be used
2: In open.connection(con, "rb") :
  cannot open URL 'http://navigatela.lacity.org/dot/traffic_data/manual_counts/46972_WESWEY96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/46974_KINWES96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/Gayley.Weyburn.180927-NDSMAN.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12599_GAYVET96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/gaylec06.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12690_BROLEC95.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12691_BROXTON.WEYBURN07.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12995_HILLEC95.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/13531_LECWES96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/13997_VETWIL081021.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/SELWIL080319.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/16896_GLELIN96.pdf, http [... truncated]

当我如代码所示在 mutate 中调用 mapply 并尝试将调用分配给单个变量时，乍一看，我相信它不能将整个结果行分配给单个列条目。因此，我的问题是如何使用 mutate 中的任意自定义函数或类似 mutate 的函数将大量列分配给名称列表：

Error in eval(substitute(expr), envir, enclos) : 
  more columns than column names
In addition: Warning message:
closing unused connection 3 (http://navigatela.lacity.org/dot/traffic_data/manual_counts/46972_WESWEY96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/46974_KINWES96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/Gayley.Weyburn.180927-NDSMAN.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12599_GAYVET96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/gaylec06.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12690_BROLEC95.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12691_BROXTON.WEYBURN07.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/12995_HILLEC95.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/13531_LECWES96.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/13997_VETWIL081021.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/SELWIL080319.pdf, http://navigatela.lacity.org/dot/traffic_data/manual_counts/16896_GLELIN [... truncated]
'''

EDIT: Further examination of my code for creating a minimal example revealed other problems. I will update when I am farther along

score 0 · Accepted Answer

我弄清楚了为什么我的代码不起作用。事实证明，虽然 PDF 以相同的格式编写，但 PDFTools 读取的方式不同，因为看似相同的表格/标题格式之间存在几个明显的间距差异。这就是导致问题第二部分的错误消息的原因，因为我在函数中假设从文件中读取 csv 的行数对于这些不同格式的 PDF 是不正确的。

因此，我需要确定我的函数在哪些 PDF 上失败，创建第二种或可能更多类型的手动调查，并调整从 PDFS 读取数据的函数以处理所有不同类型的手动调查。感谢评论中关于找到最小可重现数据集的建议：在玩具数据集而不是完整数据集上运行函数的过程让我发现了这一点。

r - 如何使用带有pdftools和html链接的自定义函数在R中一次将大量列突变到数据框中？

1 回答 1

Related

Reference