r - 使用 apache 箭头在一个 R 数据帧中读取分区 parquet 目录（所有文件）

Question

如何使用箭头将分区镶木地板文件读入 R（没有任何火花）

情况

使用 Spark 管道创建镶木地板文件并保存在 S3 上
使用 RStudio/RShiny 读取一列作为索引以进行进一步分析

parquet 文件结构

从我的 Spark 创建的 parquet 文件由几个部分组成

tree component_mapping.parquet/
component_mapping.parquet/
├── _SUCCESS
├── part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00001-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00002-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00003-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── part-00004-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet
├── etc

如何将此 component_mapping.parquet 读入 R？

我试过的

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet")

但这失败并出现错误

IOError: Cannot open for reading: path 'component_mapping.parquet' is a directory

如果我只读取目录的一个文件，它就可以工作

install.packages("arrow")
library(arrow)
my_df<-read_parquet("component_mapping.parquet/part-00000-e30f9734-71b8-4367-99c4-65096143cc17-c000.snappy.parquet")

但我需要全部加载才能对其进行查询

我在文档中找到的

在 apache 箭头文档 https://arrow.apache.org/docs/r/reference/read_parquet.html和 https://arrow.apache.org/docs/r/reference/ParquetReaderProperties.html 我发现有一些read_parquet() 命令的属性，但我无法让它工作，也没有找到任何示例。

read_parquet(file, col_select = NULL, as_data_frame = TRUE, props = ParquetReaderProperties$create(), ...)

如何正确设置属性以读取完整目录？

# should be this methods
$read_dictionary(column_index)
or
$set_read_dictionary(column_index, read_dict)

帮助将不胜感激

score 9 · Accepted Answer

正如@neal-richardson 在他的回答中提到的那样，在这方面已经做了更多的工作，并且使用当前的arrow包（我目前正在运行 4.0.0）这是可能的。

我注意到您的文件使用了 snappy 压缩，在安装之前需要一个特殊的构建标志。（此处的安装文档：https ://arrow.apache.org/docs/r/articles/install.html ）

Sys.setenv("ARROW_WITH_SNAPPY" = "ON")
install.packages("arrow",force = TRUE)

DatasetAPI 使用多文件数据集实现您正在寻找的功能。虽然文档还没有包含各种各样的示例，但它确实提供了一个明确的起点。https://arrow.apache.org/docs/r/reference/Dataset.html

下面的示例显示了从给定目录读取多文件数据集并将其转换为内存中 R 数据帧的最小示例。API 还支持过滤条件和选择列的子集，尽管我仍在尝试自己找出语法。

library(arrow)

## Define the dataset
DS <- arrow::open_dataset(sources = "/path/to/directory")
## Create a scanner
SO <- Scanner$create(DS)
## Load it as n Arrow Table in memory
AT <- SO$ToTable()
## Convert it to an R data frame
DF <- as.data.frame(AT)

score 9 · Accepted Answer

解决方案：使用箭头将本地文件系统中的分区拼花文件读入 R 数据帧

因为我想避免在 RShiny 服务器上使用任何 Spark 或 Python，所以我不能使用其他库，例如sparklyr如何在R 中读取 Parquet 并将其转换为 R DataFrame？SparkRreticulatedplyr

我现在用你的提议解决了我arrow的lapply任务rbindlist

my_df <-data.table::rbindlist(lapply(Sys.glob("component_mapping.parquet/part-*.parquet"), arrow::read_parquet))

期待apache箭头功能可用谢谢

score 5 · Accepted Answer

通过为（单个）文件阅读器设置选项，您无法读取文件目录。如果内存不是问题，今天您可以lapply/map覆盖目录列表和rbind/bind_rows进入单个 data.frame。可能有一个purrr函数可以干净地做到这一点。在对文件的迭代中，如果您只需要已知的数据子集，您还可以选择/过滤每个文件。

在 Arrow 项目中，我们正在积极开发一个多文件数据集 API，它可以让您做您想做的事情，以及将行和列选择下推到单个文件等等。敬请关注。

score 2 · Accepted Answer

解决方案：使用箭头将 S3 中的分区镶木地板文件读入 R 数据帧

由于我现在花了很长时间才找到解决方案，而且我在网上找不到任何东西，我想分享这个解决方案，了解如何从 S3 读取分区镶木地板文件

library(arrow)
library(aws.s3)

bucket="mybucket"
prefix="my_prefix"

# using aws.s3 library to get all "part-" files (Key) for one parquet folder from a bucket for a given prefix pattern for a given component
files<-rbindlist(get_bucket(bucket = bucket,prefix=prefix))$Key

# apply the aws.s3::s3read_using function to each file using the arrow::read_parquet function to decode the parquet format
data <- lapply(files, function(x) {s3read_using(FUN = arrow::read_parquet, object = x, bucket = bucket)})

# concatenate all data together into one data.frame
data <- do.call(rbind, data)

What a mess but it works.
@neal-richardson is there a using arrow directly to read from S3? I couldn't find something in the documentation for R

r - 使用 apache 箭头在一个 R 数据帧中读取分区 parquet 目录（所有文件）

4 回答 4

Related

Reference