4

It is possible to read parquet files from S3 as shown here or here.

I am working with S3 access points. Having S3 access point ARN is it possible to read parquet files from it?

I am trying with the following sample code:

import s3fs
import pyarrow.parquet as pq

S3_ACCESS_POINT_ARN = "..."

s3_filesystem = s3fs.S3FileSystem()
s3_file_uri = f"{S3_ACCESS_POINT_ARN}/examples/example1.parquet"
example1_df = pq.ParquetDataset(s3_file_uri, s3_filesystem).read_pandas().to_pandas()

Executing it results with:

ParamValidationError: Parameter validation failed:
Invalid bucket name S3_ACCESS_POINT_ARN: Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$"

I have also tried replacing / with : in S3_ACCESS_POINT_ARN which results in:

PermissionError: AccessDenied

Finally I tried using:

pq.read_table(S3_ACCESS_POINT_ARN, s3_filesystem).to_pandas()

which resulted in:

OsError: Passed non-file path: S3_ACCESS_POINT_ARN

It is worth mentioning that there is no access issues with reading files from this access point, with the code below working:

import boto3

S3_ACCESS_POINT_ARN = "..."

s3 = boto3.resource('s3')
bucket = s3.bucket(S3_ACCESS_POINT_ARN)
bucket.download_file(f"{S3_ACCESS_POINT_ARN}/examples/example1.parquet", "/tmp/examples/example1.parquet")
example1_df = pq.read_table("/tmp/examples/example1.parquet").to_pandas()

UPDATE: S3 access point does not allow non top-level list objects operations:

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

But I cannot see any parameter that would allow pyarrow to treat the parquet file as a single file, which could potentially avoid having this issue.

4

0 回答 0