hadoop - 在 hadoop 中选择文件格式

Question

伙计
们，可以在 Hadoop 处理的不同阶段使用的推荐文件格式是什么。

处理：我一直在 hive 中使用文本格式/JSON serde 来进行处理。这是我执行 ETL（转换）操作的暂存表的好格式吗？我应该使用更好的格式吗？我知道 Parquet / ORC / AVRO 是专门的格式，但它是否适合 ETL（转换）操作。此外，如果我对 Zlib 使用诸如 Snappy 之类的压缩技术，那将是一种推荐的方法（我不想因为压缩导致额外的 CPU 利用率而降低性能，如果压缩会有更好的性能，请纠正我）

报告：根据我的查询需求
聚合：使用列式存储似乎是一个合乎逻辑的解决方案。带有 Snappy 压缩的 Parquet 是否非常适合（假设我的 hadoop 发行版是 Cloudera）。
完整的行提取 如果我的查询模式需要一行中的所有列，那么选择列式存储是一个明智的决定吗？还是我应该选择 AVRO 文件格式

存档：对于存档数据，我计划使用 AVRO，因为它可以通过良好的压缩处理模式演变。

score 0 · Accepted Answer

Choosing the file format depends on the usecase. You are processing data in hive hence below are the recommendation.

Processing : Use ORC for processing as you are using aggregation and other column level operation. It will help in increasing performance many fold.

Compression : Using it wisely on case basis will help in increasing performance by reducing expensive IO operation time.

If use case is row based operation then using Avro is recommended.

Hope this will help in taking decision.

hadoop - 在 hadoop 中选择文件格式

1 回答 1

Related

Reference