2

We are using Apache Kafka as an event sourcing mechanism for efficiently distributing events to multiple repositories to build an application state based on the use case of the services that will serve the data. As an example, say we have Service A and Service B. Service A and Service B both have a repository of the same people, but the view of the data is different based on use case. The idea being that as we will need new views of the data for separate use cases, we can replay the event stream into a new or existing repository. This is based on the idea outlined here and here regarding stream processing and event sourcing with rebuilding Application State from events..

As part of our business model, we may process frequent changes against those people (e.g. Names, addresses, and dates change somewhat frequently). One new use case that has come up is that we may need to perform a temporal query (i.e. Show me what the person's data looked like at a specific date and time). The obvious answer to that is to replay all of the events up to a specific moment in time to rebuild Application State. While this sounds reasonable on paper, I think this doesn't scale well when you potentially have billions of events over the course of years.

Current Solution Today, the general concept is that our Kafka consumers will process these events from a single topic based on activity (e.g. Address updates, Name Updates, etc) and update a master copy of the person data to what they should look like "right now", and then store a copy of the delta of the change in a separate store which is keyed appropriately so that we can relate changes to a specific person. Ideally, we are thinking a key/value store (NoSQL) is the right store for this data. The approach is to use this store to replay all of the deltas for a person to get serve that temporal query need.

The Ask Is this approach correct for building out stores for temporal queries of data? Are there other approaches or even tools that have solved this need when dealing with massive amounts of data? Kafka seems correct for the event sourcing part of the equation, but my concern exists around needing to maintain events for years for the purposes of auditing, disaster recovery, and temporal (point in time) queries.

4

0 回答 0