I have an app that exports files to an S3 bucket every certain amount of time. I need to develop a Spark Streaming app that streams from this bucket and delivers the lines of the new files every 30 secs.
I have read this post which helped me understanding about the credentials, but still won’t address my needs.
Q1. Could anyone provide some code or hint on how to do this? I’ve seen the twitter example but I could not figure out how to apply it to my scenario.
Q2. How does Spark Streaming know which was the last file that streamed before picking up the next one? Is this based on the file’s LastModified header or some sort of timestamp?
Q3. If the cluster goes down, how do I manage to start streaming from where I left?
Thanks in advance!!