apache-flink - Flink exactly-once message processing

Question

I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s.
The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3. If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000).
When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower.
Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager.
Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?

score 11 · Accepted Answer

您只看到了一次的预期行为。Flink 通过检查点和故障重放的组合来实现容错。保证不是每个事件都会被发送到管道中一次，而是每个事件都会影响管道的状态一次。

检查点在整个集群中创建一致的快照。在恢复期间，操作员状态被恢复并且源从最近的检查点重放。

如需更详尽的解释，请参阅此数据 Artisans 博客文章：High-throughput, low-latency, and exact-once stream processing with Apache Flink™</a>，或 Flink 文档。

apache-flink - Flink exactly-once message processing

1 回答 1

Related

Reference