In the post, How to handle retry messages and failures when using event streaming platforms like Apache Kafka, I walked through how to retry events when using Apache Kafka.
Now in this post, I will discuss an approach you can implement to ensure processors process events in order even when you’ve implemented a retry solution.
After reading this article, please feel free to also watch the video or vice versa.
Ordering is Critical
Inconsistent data will lead to incorrect information being displayed and can cause problems for future processing. For example, suppose an IoT device’s status is affected by out-of-order event processing. In that case, it may give false positives or negatives when checking on its condition later down the line, which could be very costly!
For example, let’s take a valve sensor; when it is open, it will communicate its status to the server as open, and then when it is closed, it will communicate its status as closed. If events get processed out of order, this could lead to devastating consequences.
Ensuring event ordering in event-driven systems when implementing retries
Let’s get right into it.
So let’s say we have a service, a processor processing customer events.
And the processor consumes an event for customer.created.
Now, if that fails, you could publish it to a retry topic without doing anything, but what happens if you get another event for the same customer, the event could be billing-details.added.
But the failed event customer.created for the same customer has not been processed. So now, if the event billing-details.added gets processed before the customer.created, then you could end up with data inconsistency that will affect data consistency.
You can’t process that until you’ve fixed and remedied the first event.
So what do you do with the current event? Well, you need to hold it for later.
And here is what you do.
1 – On event failure
Before publishing the event to the retry topic, you want to log the event into a failure log in a persistent data store that processors will query.
The inserted record, along with the event payload, must contain the event identifiers, especially the event index and the entity object’s identifier, which in our example is the customer ID.
The object entity identifier needs to be searchable so you can perform a lookup for each event’s related object entity to check if there are any older events in a failed or held status.
I advise you to look at the outbox pattern and use the failure log table as the outbox. See (https://danieltammadge.com/2022/06/the-outbox-pattern-the-guaranteed-event-publishing-pattern/) for more information
2 – Before processing an event
Step 2 comes before step 1 because when the processor consumes an event, it needs to look up the entity object’s identifier to see if there is an older event that has not been processed.
3 – Hold events if they should not be processed yet
If there are events in a failed state for the same customer, the processor should not process the event but instead, send it to a holding queue table.
When inserting a record, ensure you store the event payload, event index and the object’s entity identifier.
The entity identifier, in this case, the customer id, will be used later to republish all held events for a particular entity. You will use the event index to publish events in the correct order.
(In the absence or in addition to the event index, if sequence identifiers or timestamps are available, then I’d advise storing them as well.)
4 – Retry failed events and publish held events once successful
Again it is vital to ensure that the processor checks if it should process an event by performing a lookup to check if an earlier event for the same entity object is in a failed state.
When the retry process successfully processes a failed event, You can then republish all the events for that particular customer held using the customer identifier to the first retry topic.
Ensure order by using the event index so that ordering is constant.
I would recommend using an atomic transaction to:
- Update the corresponding failure record in the failure table that it is no longer in a failed state (or remove it)
- Move the held events in the holding table to a retry outbox table to publish the events for retry (again, see the outbox pattern).
- Before processing events, check the failure log table to see if there are any earlier failed events for the corresponding entity
- And if there are failed events, then hold the event for later
- On failure, log the failing event into a failure log and republish for retry
- Hold for later processing any later events which get consumed, which corresponds to the failed event
- Once a failed event is processed successfully, then mark the event as successful and republish the held events