hadoop - S3 上的 LZO 文件问题

Question

我在 HDFS 中有 3 个 LZO 压缩文件及其对应的索引文件。

Permission  Owner   Group   Size    Replication Block Size  Name
-rw-r--r--  alum    supergroup  0 B 3   128 MB  _SUCCESS
-rw-r--r--  alum    supergroup  192.29 MB   3   128 MB  part-00000.lzo
-rw-r--r--  alum    supergroup  89.56 KB    3   128 MB  part-00000.lzo.index
-rw-r--r--  alum    supergroup  243.09 MB   3   128 MB  part-00001.lzo
-rw-r--r--  alum    supergroup  106.67 KB   3   128 MB  part-00001.lzo.index
-rw-r--r--  alum    supergroup  163.99 MB   3   128 MB  part-00002.lzo
-rw-r--r--  alum    supergroup  70.54 KB    3   128 MB  part-00002.lzo.index

我们将这些文件复制到 Amazon S3 并创建 Hive 外部表进行分析。

以下是我们面临的问题，

1) LZO index files are also being treated as data files and meaningless data appears in hive tables
2) "count(*)" query on the table spans only 4 mappers. Indicating problem in splitting.

你能告诉我S3发生了什么吗？它在我们的 YARN 集群中运行良好。

score 0 · Accepted Answer

s3 的处理方式与 HDFS 不同。不需要像在 HDFS 中那样应用拆分逻辑。请记住 s3 是云存储，而 HDFS 不是本地存储。您的文件不会在 s3 中以块的形式出现。这种行为是预期的。

hadoop - S3 上的 LZO 文件问题

1 回答 1

Related

Reference