我们在内部暂存节点中配置了一个带有端点的雪球http://10.91.16.213:8080
。一切正常,我什至可以通过 s3 cli 命令列出这个雪球中的文件
aws s3 ls my-bucket/data/ --endpoint-url=http://10.91.16.213:8080`
现在我正在尝试使用 hadoop distcp 命令将数据从 hdfs 复制到 s3 雪球。首先,我测试了 hadoop distcp 命令以将一些文件复制到我的 aws 帐户中的实际 s3 test-bucket,
hadoop distcp \
-Dfs.s3a.fast.upload=true \
-Dfs.s3a.access.key=AKIAUPWDYDZTSGWUWJWN \
-Dfs.s3a.secret.key=<my-secret> \
hdfs://path/to/data/ \
s3a://test-bucket-anum/
上述命令执行良好,并在 hadoop 集群中启动复制作业。现在,为了复制到我内部的雪球,我所要做的就是更改端点。这就是我尝试的;
hadoop distcp \
-Dfs.s3a.endpoint=http://10.91.16.213:8080 \
-Dfs.s3a.fast.upload=true \
-Dfs.s3a.access.key=AKIACEMGMYDQNJXGQ2DEOBXG42SQCFR2ZJFTDED3HX3KLVTLOIN6AH3FSDHUF \
-Dfs.s3a.secret.key=<snowball-secret> \
hdfs://path/to/data/ \
s3a://my-bucket/
上述命令失败并出现以下错误;
20/09/02 19:20:22 INFO s3a.S3AFileSystem: Caught an AmazonClientException, which means the client encountered a serious internal problem while trying to communicate with S3, such as not being able to access the network.
20/09/02 19:20:22 INFO s3a.S3AFileSystem: Error Message: {}com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:217)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
... 12 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field:
true
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
... 18 more
20/09/02 19:20:22 ERROR tools.DistCp: Invalid arguments:
com.amazonaws.AmazonClientException: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:738)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:399)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3480)
at com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:604)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:962)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at org.apache.hadoop.tools.DistCp.setTargetPathExists(DistCp.java:217)
at org.apache.hadoop.tools.DistCp.run(DistCp.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.tools.DistCp.main(DistCp.java:430)
Caused by: com.amazonaws.AmazonClientException: Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:150)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:279)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:75)
at com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:72)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at com.amazonaws.http.AmazonHttpClient.handleResponse(AmazonHttpClient.java:712)
... 12 more
Caused by: java.lang.RuntimeException: Invalid value for IsTruncated field:
true
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler.endElement(XmlResponsesSaxParser.java:647)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseXmlInputStream(XmlResponsesSaxParser.java:141)
... 18 more
Invalid arguments: Unable to unmarshall response (Failed to parse XML document with handler class com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler). Response Code: 200, Response Text: OK
usage: distcp OPTIONS [source_path...] <target_path>
OPTIONS
-append Reuse existing data in target files and append new
data to them if possible
-async Should distcp execution be blocking
-atomic Commit all changes or none
-bandwidth <arg> Specify bandwidth per map in MB
-delete Delete from target, files missing in source
-diff <arg> Use snapshot diff report to identify the
difference between source and target
-f <arg> List of files that need to be copied
-filelimit <arg> (Deprecated!) Limit number of files copied to <= n
-i Ignore failures during copy
-log <arg> Folder on DFS where distcp execution logs are
saved
-m <arg> Max number of concurrent maps to use for copy
-mapredSslConf <arg> Configuration for ssl config file, to use with
hftps://
-overwrite Choose to overwrite target files unconditionally,
even if they exist.
-p <arg> preserve status (rbugpcaxt)(replication,
block-size, user, group, permission,
checksum-type, ACL, XATTR, timestamps). If -p is
specified with no <arg>, then preserves
replication, block size, user, group, permission,
checksum type and timestamps. raw.* xattrs are
preserved when both the source and destination
paths are in the /.reserved/raw hierarchy (HDFS
only). raw.* xattrpreservation is independent of
the -p flag. Refer to the DistCp documentation for
more details.
-sizelimit <arg> (Deprecated!) Limit number of files copied to <= n
bytes
-skipcrccheck Whether to skip CRC checks between source and
target paths.
-strategy <arg> Copy strategy to use. Default is dividing work
based on file sizes
-tmp <arg> Intermediate work path to be used for atomic
commit
-update Update target, copying only missingfiles or
directories
这是我也尝试过的其他一些 hadoop 配置,但没有运气。
-Dfs.s3a.connection.ssl.enabled=false
:因为我的端点是http。
-Dfs.s3a.region=eu-west-1
我错过了什么吗?
更新:
由于错误消息还包括Invalid arguments:
我想我可能在 args 中提供了一些无效字符,所以我尝试将这些选项编写/etc/hadoop/conf/core-site.xml
如下;
<property>
<name>fs.s3a.endpoint</name>
<value>http://10.91.16.213:8080</value>
</property>
<property>
<name>fs.s3a.fast.upload</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>AKIACEMGMYDQNJXGQ2DEOBXG42SQCFR2ZJFTDED3HX3KLVTLOIN6AH3FSDHUF</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value><snowball-secret></value>
</property>
但得到相同的错误信息:(
更新 2:
读完后,在做 ListObjects 时看起来像是 s3 xml 解析问题。AWS 客户端有这个选项.withEncodingType("url");
,但找不到与 hadoop distcp 类似的东西。