hadoop - Hadoop Ingestion automation techniques

Question

My context is ;

10 csv files are uploaded to my server during the night .

My process is :

Ingestion :
- Put the files on HDFS
- Create ORC Hive Table and put data on them .
Processing :
- Spark processing : transformation , cleaning , join ....
- a lot of chained steps(Spark Job)

I am searching best practices to automate the first part and trigger the second part .

I also see https://kylo.io/ , It's perfect but i think still young to put it in production.

Thanks in advance .

score 2 · Accepted Answer

Oozie 和 Nifi 都将与水槽、蜂巢和火花动作结合使用。

所以你的（Oozie 或 Nifi）工作流程应该像这样工作

cron 作业（或时间表）启动工作流程。
工作流程的第一步是 Flume 流程，将数据加载到所需的 HDFS 目录中。您可以在不使用 Flume 的情况下仅使用 HDFS 命令来执行此操作，但这将有助于保持您的解决方案在未来具有可扩展性。
用于创建/更新表的配置单元操作
用于执行自定义 Spark 程序的 Spark 操作

确保通过适当的日志记录和通知处理工作流程中的错误处理，以便您可以在生产中优化工作流程。

1 回答 1