5

I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s.
The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3. If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000).
When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower.
Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager.
Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?

4

1 回答 1

11

您只看到了一次的预期行为。Flink 通过检查点和故障重放的组合来实现容错。保证不是每个事件都会被发送到管道中一次,而是每个事件都会影响管道的状态一次。

检查点在整个​​集群中创建一致的快照。在恢复期间,操作员状态被恢复并且源从最近的检查点重放。

如需更详尽的解释,请参阅此数据 Artisans 博客文章:High-throughput, low-latency, and exact-once stream processing with Apache Flink™</a>,或Flink 文档

于 2017-04-16T21:55:33.977 回答