apache-spark - 如何手动将 spark-redshift AVRO 文件加载到 Redshift 中？

Question

我有一个COPY在写入部分失败的 Spark 作业。我已经在 S3 中处理了所有输出，但是在弄清楚如何手动加载它时遇到了麻烦。

COPY table
FROM 's3://bucket/a7da09eb-4220-4ebe-8794-e71bd53b11bd/part-'
CREDENTIALS 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
format as AVRO 'auto'

在我的文件夹中有一个_SUCCESS,_committedxxx和_startedxxx文件，然后是 99 个文件都以前缀开头part-。当我运行它时，我得到一个stl_load_error->Invalid AVRO file found. Unexpected end of AVRO file. 如果我去掉那个前缀，那么我得到：

[XX000] ERROR: Invalid AVRO file Detail: ----------------------------------------------- error: Invalid AVRO file code: 8001 context: Cannot init avro reader from s3 file Incorrect Avro container file magic number query: 10882709 location: avropath_request.cpp:432 process: query23_27 [pid=10653] -----------------------------------------------

这可能吗？保存处理会很好。

score 2 · Accepted Answer

我从 Redshift 遇到了同样的错误。

在我删除 _committedxxx 和 _startedxxx 文件（_SUCCESS 文件没问题）后，COPY 工作。

如果 s3 中有许多目录，则可以使用 aws cli 清除这些文件：

aws s3 rm s3://my_bucket/my/dir/ --include "_comm*" --exclude "*.avro" --exclude "*_SUCCESS" --recursive

请注意，cli 似乎有一个错误， --include "_comm*" 对我不起作用。所以它试图删除所有文件。使用“--exclude *.avro”可以解决问题。小心并首先使用 --dryrun 运行命令！

apache-spark - 如何手动将 spark-redshift AVRO 文件加载到 Redshift 中？

1 回答 1

Related

Reference