以下代码可以在此处讨论之后即时读取和写入 s3 :
from smart_open import open
import os
bucket_dir = "s3://my-bucket/annotations/"
with open(os.path.join(bucket_dir, "in.tsv.gz"), "rb") as fin:
with open(
os.path.join(bucket_dir, "out.tsv.gz"), "wb"
) as fout:
for line in fin:
l = [i.strip() for i in line.decode().split("\t")]
string = "\t".join(l) + "\n"
fout.write(string.encode())
问题是,在处理了数千行(几分钟)后,我收到“对等方重置连接”错误:
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
我能做些什么?我尝试过fout.flush()
每一次fout.write(string.encode())
,但效果不佳。是否有更好的解决方案来处理大约 2 亿行的 .tsv 文件?