我在 S3 上有一个 30gb lzo 文件,我正在使用 hadoop-lzo 使用区域 us-east1 使用 Amazon EMR (AMI v2.4.2) 对其进行索引。
elastic-mapreduce --create --enable-debugging \
--ami-version "latest" \
--log-uri s3n://mybucket/mylogs \
--name "lzo index test" \
--num-instances 2 \
--master-instance-type "m1.xlarge" --slave-instance-type "cc2.8xlarge" \
--jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
--arg com.hadoop.compression.lzo.DistributedLzoIndexer \
--arg s3://mybucket/my-30gb-file.lzo \
--step-name "Index LZO files"
1% 的进度大约需要 10 分钟,因此完成一个文件大约需要 16 小时。进度显示只读取了 80mb。
相比之下,使用同一个集群(当上述作业正在运行时),我可以将文件从 S3 复制到本地硬盘,然后复制到 HDFS,最后在大约 10 分钟内运行索引器。同样,我的本地集群可以在大约 7 分钟内处理这个问题。
过去,我相信我直接在 S3 上运行 LZO 索引而没有出现这种延迟,尽管它是在早期的 AMI 版本上。我不知道我使用的是什么 AMI,因为我总是使用“最新”。(更新:我尝试了 AMI v2.2.4 的结果相同,所以也许我记错了或其他原因导致速度缓慢)
任何想法可能会发生什么?
这是 Step 的日志输出的副本:
Task Logs: 'attempt_201401011330_0001_m_000000_0'
stdout logs
stderr logs
syslog logs
2014-01-01 13:32:39,764 INFO org.apache.hadoop.util.NativeCodeLoader (main): Loaded the native-hadoop library
2014-01-01 13:32:40,043 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name ugi already exists!
2014-01-01 13:32:40,120 INFO org.apache.hadoop.mapred.MapTask (main): Host name: ip-10-7-132-249.ec2.internal
2014-01-01 13:32:40,134 INFO org.apache.hadoop.util.ProcessTree (main): setsid exited with exit code 0
2014-01-01 13:32:40,138 INFO org.apache.hadoop.mapred.Task (main): Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5c785f0b
2014-01-01 13:32:40,943 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2014-01-01 13:32:41,104 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash
2014-01-01 13:32:41,104 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
2014-01-01 13:32:41,121 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available
2014-01-01 13:32:41,121 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded
2014-01-01 13:32:41,314 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3://mybucket/my-30gb-file.lzo' for reading
2014-01-01 13:32:41,478 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '63624'
2014-01-01 13:32:41,773 INFO com.hadoop.mapreduce.LzoIndexRecordWriter (main): Setting up output stream to write index file for s3://mybucket/my-30gb-file.lzo
2014-01-01 13:32:41,885 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index.tmp' but file does not exist, so returning false
2014-01-01 13:32:41,928 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index' but file does not exist, so returning false
2014-01-01 13:32:41,967 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://mybucket/my-30gb-file.lzo.index.tmp' in S3
2014-01-01 13:32:42,017 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '125908'
2014-01-01 13:32:42,227 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '187143'
2014-01-01 13:32:42,516 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '249733'
... (repeat of same "Stream for key" message)
2014-01-01 13:34:14,991 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62004474'
2014-01-01 13:34:15,077 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 1000 at pos 61941702 of 39082185217. Read is 0.15865149907767773% done.
2014-01-01 13:34:15,077 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62067843'
... (repeat of same "Stream for key" message)
2014-01-01 13:35:37,849 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '123946504'
2014-01-01 13:35:37,911 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 2000 at pos 123882460 of 39082185217. Read is 0.31714322976768017% done.
2014-01-01 13:35:37,911 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '124008849'
... (repeat of same "Stream for key" message)
我的解决方法
distcp
FWIW,我的解决方法是通过(见下文)将文件复制到 HDFS 。在我看来,这种缓慢似乎是 AWS 可以改进的一个问题。在下面的作业中,从 S3 复制到 HDFS 需要 17 分钟,而索引只需 1 分钟。
elastic-mapreduce --create --enable-debugging --alive \
--ami-version "latest" \
--log-uri s3n://mybucket/logs/dailyUpdater \
--name "daily updater test" \
--num-instances 2 \
--master-instance-type "m1.xlarge" --slave-instance-type "cc2.8xlarge" \
--jar s3://elasticmapreduce/samples/distcp/distcp.jar \
--arg s3://mybucket/my-30gb-file.lzo \
--arg hdfs:///my-30gb-file.lzo \
--step-name "Upload input file to HDFS" \
--jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
--arg com.hadoop.compression.lzo.DistributedLzoIndexer \
--arg hdfs:///my-30gb-file.lzo \
--step-name "Index LZO files" \
--jar s3://elasticmapreduce/samples/distcp/distcp.jar \
--arg hdfs:///my-30gb-file.lzo.index \
--arg s3://mybucket/my-30gb-file.lzo.index \
--step-name "Upload index to S3"