0

I have a Tensorflow model that I would like to feed with parquet files stored on s3. I'm using petastorm to query these files from s3 and the result of the query is stored as a Tensorflow dataset thanks to petastorm.tf_utils.make_petastorm_dataset.

Here's the code I used (mainly inspired from this thread Tensorflow Dataset API: input pipeline with parquet files):

import s3fs
from pyarrow.filesystem import S3FSWrapper
from petastorm.reader import Reader
from petastorm.tf_utils import make_petastorm_dataset

dataset_url = "analytics.xxx.xxx" #s3 bucket name

fs = s3fs.S3FileSystem()
wrapped_fs = S3FSWrapper(fs)

with Reader(pyarrow_filesystem=wrapped_fs, dataset_path=dataset_url) as reader:
    dataset = make_petastorm_dataset(reader)

This works pretty well, except that it generates 20+ lines of connection warnings:

W0514 18:56:42.779965 140231344908032 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.782773 140231311337216 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.854569 140232468973312 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.868761 140231328122624 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
W0514 18:56:42.885518 140230816429824 connectionpool.py:274] Connection pool is full, discarding connection: s3.eu-west-1.amazonaws.com
...

According to this thread urllib3 connectionpool - Connection pool is full, discarding connection, it's certainly related to urllib3, but I can't figure a way to get rid of these warnings.

Has anyone encountered this issue?

4

1 回答 1

2

在 Github 上得到了答案:https ://github.com/uber/petastorm/issues/376 。使用连接池设置boto3并增加max_pool_connections

fs = s3fs.S3FileSystem(config_kwargs = {'max_pool_connections': 50})

于 2019-07-11T09:22:05.667 回答