What is Orchestration and how to implement orchestration in an event-driven system?

There is also a YouTube version of this post. I have linked to the video at the bottom of this post.

Table of Contents

Outline

In this blog post, I’m going to be explaining.

What orchestration is
Seven key points, where I will also be explaining one key difference when using orchestration in a asynchronous event-driven system compared to a synchronous request-driven (point-to-point system).
Running through an example showing the flow of data between an orchestration service the services it needs to trigger.
How to handle unsuccessful events and rollbacks.
Before wrapping up with the trade-offs of orchestration, I will explain how you can prevent race conditions.

What is Orchestration

So the first thing to explain is what orchestration is?

A good analogy that people use is to think of an orchestra’s conductor because a conductor’s role is to conduct the musicians to play in time and together.

This analogy is used because, in an orchestrated system, you have a central service or process, which issues commands to worker services, the event processors. Then it waits for the processing outcome

http://3.8.172.178/2020/12/event-driven-architecture-commands-vs-events/

And when the event processor has processed the event, it would publish its outcome which the orchestration service would handle and trigger the next step in a given workflow based on the result.

Seven Important points

And seven important points are

Orchestration is also known as the mediator pattern.
An Orchestration service should not know about the worker processes implementations or the business logic, the event processors. It should only know the workflow logic and what to trigger in the workflow based on the workers’ results.
Event processors, the workers, should be self-contained, independent from each other. And the scope which workers manage could a single function to a full-blown business process with multiple downstream services, and could even be another orchestration service.
Workers should only publish the absolute outcome of its processing because orchestration services should not be responsible for triggering retires if a worker is unsuccessful in processing. it is the worker’s responsibility to handle manage its retries and only publish its outcome when it has either been successful or has exhausted its configured number of attempts to retry the request.
Even though I have just said that it’s the responsibility of the processors to handle it’s retires. In the request-driven, point to point world, the responsibility of re-triggering processors due to failures, falls to the orchestration service, because of the nature of point-to-point systems.
Orchestration services are required to keep track & persist the state of processing.
Orchestration service may also need to provide reports to interested parties, such as a monitoring dashboard, regarding items’ processing status.

A successful orchestration process example

Orchestration example: Showing all communication between services

Now don’t be scared with this slide showing all the connections as I will be walking you step by step shortly. But first I wanted to show you at a high level what a simple orchestration process could be.

Here we have one orchestration service with a persistent data store and four worker services with their databases or database schemas.

We also have an initial command event topic for the processing request which will trigger the process, multiple event topics for the processing events and a final topic where the outcome of the full process would be published, in order to notify interested subscribers of the result.

(Steps 1-3)

So let’s drive into it.

So the first step would be for the orchestration service to consume the initial request, which may have been published by a public customer experience API. Or a back-office service, which could be a function or an upstream event processing worker which has been triggered by another orchestration service.

Once the orchestration service has consumed the request, it would create a new record in a persistent database table, and the service would then publish the processing command to the first service, Service A.

(Steps 4-5) Ok so now we have the fourth step, where service A, will consume the event, processes it and then publishes the message synchronously in step five

So as you can see from the diagram, the orchestration service will now acknowledge that it has successfully processed the initial event. And, service A will also do the same after it has successfully published the processing event message.

(Steps 6-8) The Orchestration service would consume the event, store the processing state in the database, and trigger the next step of the process, which could be processed in parallel by two different services that handle other business processes.

However, what happens if one is unavailable or only runs at certain times of the day, like a batch process.

(Steps 9-10) In this example, Service C is not processing events. Service B is, and it consumes & processes the event and publishes the outcome.

(Steps 11-13) The orchestration service will consume the result and update the state, but now it needs to wait for Service C’s result before triggering the next step.

(Steps 14 -15) And When Service C resumes processing.

(Steps 16-18) It would process and publish its result, causing the orchestration service to trigger Service D.

(Steps 19-20) And Service D would process the request.

(Steps 21-23) And in this example, Service D is publishing an event stating that processing was completed successfully.

(Steps 24) And because there are no more steps in the workflow, the orchestration service could mark the processing items as completed in the database table for future reference.

Handling failures in orchestration example

(Failure steps 19-20)

However, what happens if service D is unable to process the request successfully. And as I mentioned in the key points, the worker, the processor, must manage its retries and failures or errors and only publish the absolute result.

And in this case, Service D has tried multiple times and has published the processing outcome stating it has failed.

Now I am not going to go through how to handle retires in event-driven systems in this video. I have a video in the pipeline to explain this in detail, so subscribe to my channel and enable notifications when I publish my future videos.

And please smash that like button. And please comment if you have found this video to be useful or want to ask further questions.

(Failure steps 21-24)

When the orchestration service consumes Service D’s failure result, it would then publish an event stating that the item’s processing has failed. Based on your requirements, you may need to publish this message after any rollbacks have happened.

In this example, the orchestration service would publish messages to trigger Service A, B & C.

You would do this if the services were required to handle failures and rollback operations

Finally, once all the workflow steps to rollback and handle failures have been completed, the orchestration service would mark that the process has been completed successfully. But of course, the actual request item would state that it had failed. But the orchestration service handled the processing correctly. Unless of course there was an unhandled exception.

Preventing race conditions

In the example above, there is one problem with having the orchestration service consume multiple outcome processing event topics. That is the fact, that a race condition could occur.

When a service is consuming different topics, there is no grantee that the service will consume different topic’s messages in order.

Unless it was published to a single queue which implements FIFO (First in, First out) and this is why I love Apache Kafka because it is log-based and order is preserved.

But how can you have all the workers publishing to the same topic without coupling services together, especially if other non-orchestrated processes use the worker services?.

And let’s look at two possible solutions.

Preventing race conditions with callback topics and event streaming

One – Preventing Race conditions using the Callback Topic pattern

The first is to implement the callback topic approach where the message publisher informs the processor which topic it should publish the processing outcome to.

If you want to know more about this pattern, please watch my video about callback topics, I will put the link and all other links in this video’s description.

Besides publishing to the callback topic, you can also have the services publish to an outcome event topics which can be consumed by other services and used for monitoring.

http://3.8.172.178/2020/11/what-are-callback-topics-and-when-you-would-use-them-asynchronous-design-patterns/

Two – Stream processing

However, for food for thought, if you are using Apache Kafka, you can use stream processing. Stream processing allows you to process the events from multiple topics and output events to a single topic based on unions, joins, grouping, aggregation, and data filtering. And output the result to a single topic. And that is the second approach.

Trade-offs and advantages & disadvantages

Now let us wrap up with the trade-offs with having a centralised service to handle orchestrating multiple services.

It introduces a higher degree of coupling between services. And if different teams maintain those services, then what could be seen as a simple change could cause development and release planning headaches, and priority juggling if multiple development teams are required to get involved in making changes.

For example, you could have a development team owning the orchestration service and different teams for the various worker services.

And depending on your organisation structure, it might be hard to coordinated changes.

But on the flip side, if the orchestration service and workers are maintained by one team or teams that can easily coordinate changes because of their structure and deployment tools, you may not have any issues.

And because you have a centralised process it allows you to know exactly what the workflow process is and can be simple to monitor, track and report on because the orchestration service would keep state. Now don’t mistake simple for easy.

While using the non-centralised approach called choreograph which is also called the broker pattern where the flow is designed at a global level without a centralised process managing the flow, you generally need to pull all the processing outcomes from multiple processes into a centralised data store for reporting.

And with Apache Kafka you to do that with its streaming functionality.

And finally. Should you use orchestration? Yes.

Should you use it for all solutions? No.

Why? Because Architecture is about understanding the trade-offs, you need to look at your companies requirements, development process, experience and a range of other factors including time, before determining if you should use orchestration in your next project.