It is possible to read parquet files from S3 as shown here or here.
I am working with S3 access points. Having S3 access point ARN is it possible to read parquet files from it?
I am trying with the following sample code:
import s3fs
import pyarrow.parquet as pq
S3_ACCESS_POINT_ARN = "..."
s3_filesystem = s3fs.S3FileSystem()
s3_file_uri = f"{S3_ACCESS_POINT_ARN}/examples/example1.parquet"
example1_df = pq.ParquetDataset(s3_file_uri, s3_filesystem).read_pandas().to_pandas()
Executing it results with:
ParamValidationError: Parameter validation failed:
Invalid bucket name S3_ACCESS_POINT_ARN: Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:s3:[a-z\-0-9]+:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
I have also tried replacing /
with :
in S3_ACCESS_POINT_ARN
which results in:
PermissionError: AccessDenied
Finally I tried using:
pq.read_table(S3_ACCESS_POINT_ARN, s3_filesystem).to_pandas()
which resulted in:
OsError: Passed non-file path: S3_ACCESS_POINT_ARN
It is worth mentioning that there is no access issues with reading files from this access point, with the code below working:
import boto3
S3_ACCESS_POINT_ARN = "..."
s3 = boto3.resource('s3')
bucket = s3.bucket(S3_ACCESS_POINT_ARN)
bucket.download_file(f"{S3_ACCESS_POINT_ARN}/examples/example1.parquet", "/tmp/examples/example1.parquet")
example1_df = pq.read_table("/tmp/examples/example1.parquet").to_pandas()
UPDATE: S3 access point does not allow non top-level list objects operations:
An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied
But I cannot see any parameter that would allow pyarrow
to treat the parquet file as a single file, which could potentially avoid having this issue.