My context is ;
10 csv files are uploaded to my server during the night .
My process is :
Ingestion :
- Put the files on HDFS
- Create ORC Hive Table and put data on them .
Processing :
- Spark processing : transformation , cleaning , join ....
- a lot of chained steps(Spark Job)
I am searching best practices to automate the first part and trigger the second part .
- Cron , sh , dfs put .
- Oozie ?
- Apache Nifi ?
- Flume ?
- Telend :(
I also see https://kylo.io/ , It's perfect but i think still young to put it in production.
Thanks in advance .