python - Segmentation Fault while reading parquet file from AWS S3 using read_parquet in Python Pandas

Question

I have a python script running on an AWS EC2 (on AWS Linux), and the scripts pulls a parquet file from S3 into Pandas dataframe. I'm now migrating to new AWS account and setting up a new EC2. This time when executing the same script on python virtual environment I get "Segmentation Fault" and the execution ends.

import pandas as pd
import numpy as np
import pyarrow.parquet as pq
import s3fs
import boto3
from fastparquet import write
from fastparquet import ParquetFile

print("loading...")
df = pd.read_parquet('<my_s3_path.parquet>', engine='fastparquet')

All packages were imported and all S3 and AWS configurations were set.

when executing the full script I get:

loading...
Segmentation fault

As you can see not much to work with. I've been googling for a few hours and I saw many speculations and reasons for this symptom. I'll appreciate the help here.

score 1 · Accepted Answer

我能够通过更改使用的引擎参数来解决这个问题。根据pandas的官方文档，这些是引擎选项：

引擎：{'auto'，'pyarrow'，'fastparquet'}，默认'auto'</p>

因此，只需更改为“自动”，问题就解决了。

df = pd.read_parquet('<my_s3_path.parquet>')

python - Segmentation Fault while reading parquet file from AWS S3 using read_parquet in Python Pandas

1 回答 1

Related

Reference