In the world of microservices it’s important to ensure data consistency is maintained across distributed transactions
In Hector Garcia-Molina’s1987 paper “Sagas”, he described an approach for solving system failures in a long-running database transactions.
Hector described a Saga as a sequence of related small transactions. In a Saga, the coordinator (database in their case) makes sure that all of the involved transactions are successfully completed. Otherwise, if the transactions fails the coordinator runs compensating transactions to amend the partial execution.
This approach is increasingly relevant in the world of microservices as application logic often needs to transact across multiple bounded contexts — each encapsulated by its own microservice with independent databases.
Caitie McCaffreyrecently shared a great presentation that summarizes her experience using the Saga pattern in distributed systems.
During the presentation, Caitie uses the following example set of related transactions — or Saga — to illustrate the pattern.
We can use a Lambda function to model each of the actions — and their compensating actions — and use a state machine in Step Function as thecoordinatorfor the saga.
Each action and compensating action are modelled as a Lambda function.
Since the compensating actions can also fail, we need to be able to retry them until success — which means they have to beidempotent. We’ll also implementbackward recoveryin the event of a system failure.
Below is the state machine that represents our saga. Each of the actions — BookHotel, BookFlight and BookRental — have a compensating action and will be performed in order. The recursive arrows represent that the compensating actions are retried until successful.
Each Lambda function expects the input to be in the following shape:
Inside each of the functions is a simple
PutItemrequest against a different DynamoDB table. The corresponding compensating function will perform a
DeleteItemagainst the corresponding table to rollback the
The state machine pass the same input to each action in turn (Book Hotel → BookFlight → Book Rental) and record their results at a specific path. This will avoid overriding the input
$that will be passed to the next function.
In this naive implementation, we’ll apply the compensating action for any failure — hence the
State.ALLbelow. In practice, you should consider giving certain error types aretry— e.g. temporal errors such as DynamoDB’s provision throughput exceeded exceptions.
The output and error from each action and compensating action are stored at a specific path. This will avoid overriding the input value
$for the rest of the actions.
Following the happy path, each of the actions are performed in turn and the state machine will successfully complete.
When failures strike, we need to apply the corresponding compensating actions in turn depending on where the failure occurs.
In the examples below, if the failure happened at
BookFlight, then both
CancelHotelwill be executed to rollback any changes performed thus far.
Similar, if the failure happened at
BookRental, then all three compensating actions —
CancelHotel— will be executed in that order to rollback all the state changes from the transaction.
Each compensating action also have an infinite retry loop! In practice, there should be a reasonable upper limit on the no. of retries before you alert for human intervention.
If you’d like to experiment on your own with the Saga Pattern using this example, the source code for this demo can be foundhere.
I’d be interested in your thoughts on the benefits or drawbacks of using the Saga Pattern with microservices architecture … please drop a comment below.Thanks for reading!