2

我在 S3 上有一个 30gb lzo 文件,我正在使用 hadoop-lzo 使用区域 us-east1 使用 Amazon EMR (AMI v2.4.2) 对其进行索引。

elastic-mapreduce --create --enable-debugging \
    --ami-version "latest" \
    --log-uri s3n://mybucket/mylogs \
    --name "lzo index test" \
    --num-instances 2 \
    --master-instance-type "m1.xlarge"  --slave-instance-type "cc2.8xlarge" \
    --jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
      --arg com.hadoop.compression.lzo.DistributedLzoIndexer \
      --arg s3://mybucket/my-30gb-file.lzo \
      --step-name "Index LZO files"

1% 的进度大约需要 10 分钟,因此完成一个文件大约需要 16 小时。进度显示只读取了 80mb。

相比之下,使用同一个集群(当上述作业正在运行时),我可以将文件从 S3 复制到本地硬盘,然后复制到 HDFS,最后在大约 10 分钟内运行索引器。同样,我的本地集群可以在大约 7 分钟内处理这个问题。

过去,我相信我直接在 S3 上运行 LZO 索引而没有出现这种延迟,尽管它是在早期的 AMI 版本上。我不知道我使用的是什么 AMI,因为我总是使用“最新”。(更新:我尝试了 AMI v2.2.4 的结果相同,所以也许我记错了或其他原因导致速度缓慢)

任何想法可能会发生什么?

这是 Step 的日志输出的副本:

Task Logs: 'attempt_201401011330_0001_m_000000_0'


stdout logs



stderr logs



syslog logs

2014-01-01 13:32:39,764 INFO org.apache.hadoop.util.NativeCodeLoader (main): Loaded the native-hadoop library
2014-01-01 13:32:40,043 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl (main): Source name ugi already exists!
2014-01-01 13:32:40,120 INFO org.apache.hadoop.mapred.MapTask (main): Host name: ip-10-7-132-249.ec2.internal
2014-01-01 13:32:40,134 INFO org.apache.hadoop.util.ProcessTree (main): setsid exited with exit code 0
2014-01-01 13:32:40,138 INFO org.apache.hadoop.mapred.Task (main):  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5c785f0b
2014-01-01 13:32:40,943 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2014-01-01 13:32:41,104 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash
2014-01-01 13:32:41,104 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
2014-01-01 13:32:41,121 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available
2014-01-01 13:32:41,121 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded
2014-01-01 13:32:41,314 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3://mybucket/my-30gb-file.lzo' for reading
2014-01-01 13:32:41,478 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '63624'
2014-01-01 13:32:41,773 INFO com.hadoop.mapreduce.LzoIndexRecordWriter (main): Setting up output stream to write index file for s3://mybucket/my-30gb-file.lzo
2014-01-01 13:32:41,885 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index.tmp' but file does not exist, so returning false
2014-01-01 13:32:41,928 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Delete called for 's3://mybucket/my-30gb-file.lzo.index' but file does not exist, so returning false
2014-01-01 13:32:41,967 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Creating new file 's3://mybucket/my-30gb-file.lzo.index.tmp' in S3
2014-01-01 13:32:42,017 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '125908'
2014-01-01 13:32:42,227 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '187143'
2014-01-01 13:32:42,516 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '249733'
  ... (repeat of same "Stream for key" message)
2014-01-01 13:34:14,991 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62004474'
2014-01-01 13:34:15,077 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 1000 at pos 61941702 of 39082185217. Read is 0.15865149907767773% done. 
2014-01-01 13:34:15,077 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '62067843'
  ... (repeat of same "Stream for key" message)
2014-01-01 13:35:37,849 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '123946504'
2014-01-01 13:35:37,911 INFO com.hadoop.mapreduce.LzoSplitRecordReader (main): Reading block 2000 at pos 123882460 of 39082185217. Read is 0.31714322976768017% done. 
2014-01-01 13:35:37,911 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key 'my-30gb-file.lzo' seeking to position '124008849'
  ... (repeat of same "Stream for key" message)

我的解决方法

distcpFWIW,我的解决方法是通过(见下文)将文件复制到 HDFS 。在我看来,这种缓慢似乎是 AWS 可以改进的一个问题。在下面的作业中,从 S3 复制到 HDFS 需要 17 分钟,而索引只需 1 分钟。

elastic-mapreduce --create --enable-debugging --alive \
    --ami-version "latest" \
    --log-uri s3n://mybucket/logs/dailyUpdater \
    --name "daily updater test" \
    --num-instances 2 \
    --master-instance-type "m1.xlarge"  --slave-instance-type "cc2.8xlarge" \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
      --arg s3://mybucket/my-30gb-file.lzo \
      --arg hdfs:///my-30gb-file.lzo \
      --step-name "Upload input file to HDFS" \
    --jar s3n://mybucket/hadoop-lzo-0.4.17-SNAPSHOT.jar \
      --arg com.hadoop.compression.lzo.DistributedLzoIndexer \
      --arg hdfs:///my-30gb-file.lzo \
      --step-name "Index LZO files" \
    --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
      --arg hdfs:///my-30gb-file.lzo.index \
      --arg s3://mybucket/my-30gb-file.lzo.index \
      --step-name "Upload index to S3"
4

1 回答 1

2

在 s3 上的流中查找是作为带有字节范围标头字段的 GET 实现的。这种调用需要几百毫秒是非常合理的。由于索引过程似乎需要大量查找,即使它们都是正向的,您实际上是在重新打开文件数千次。

您的解决方法是正确的方法。S3 针对顺序而非随机访问进行了优化。

于 2014-01-01T16:36:07.383 回答