How the Saga Pattern manages failures with AWS Lambda and Step Functions

How the Saga Pattern manages failures with AWS Lambda and Step Functions

In the world of microservices it’s important to ensure data consistency is maintained across distributed transactions

In Hector Garcia-Molina’s1987 paper “Sagas, he described an approach for solving system failures in a long-running database transactions.

Hector described a Saga as a sequence of related small transactions. In a Saga, the coordinator (database in their case) makes sure that all of the involved transactions are successfully completed. Otherwise, if the transactions fails the coordinator runs compensating transactions to amend the partial execution.

This approach is increasingly relevant in the world of microservices as application logic often needs to transact across multiple bounded contexts — each encapsulated by its own microservice with independent databases.

Caitie McCaffreyrecently shared a great presentation that summarizes her experience using the Saga pattern in distributed systems.

During the presentation, Caitie uses the following example set of related transactions — or Saga — to illustrate the pattern.

Begin transaction
Start book hotel request
End book hotel request
Start book flight request
End book flight request
Start book car rental request
End book car rental request
End transaction

We can use a Lambda function to model each of the actions — and their compensating actions — and use a state machine in Step Function as thecoordinatorfor the saga.

Each action and compensating action are modelled as a Lambda function.

Since the compensating actions can also fail, we need to be able to retry them until success — which means they have to beidempotent. We’ll also implementbackward recoveryin the event of a system failure.

Below is the state machine that represents our saga. Each of the actions — BookHotel, BookFlight and BookRental — have a compensating action and will be performed in order. The recursive arrows represent that the compensating actions are retried until successful.

Each Lambda function expects the input to be in the following shape:

"trip_id": "5c12d94a-ee6a-40d9-889b-1d49142248b7",
"depart": "London",
"depart_at": "2017-07-10T06:00:00.000Z",
"arrive": "Dublin",
"arrive_at": "2017-07-12T08:00:00.000Z",
"hotel": "holiday inn",
"check_in": "2017-07-10T12:00:00.000Z",
"check_out": "2017-07-12T14:00:00.000Z",
"rental": "Volvo",
"rental_from": "2017-07-10T00:00:00.000Z",
"rental_to": "2017-07-12T00:00:00.000Z"

Inside each of the functions is a simplePutItemrequest against a different DynamoDB table. The corresponding compensating function will perform aDeleteItemagainst the corresponding table to rollback thePutItemaction.

The state machine pass the same input to each action in turn (Book Hotel → BookFlight → Book Rental) and record their results at a specific path. This will avoid overriding the input$that will be passed to the next function.

In this naive implementation, we’ll apply the compensating action for any failure — hence theState.ALLbelow. In practice, you should consider giving certain error types aretry— e.g. temporal errors such as DynamoDB’s provision throughput exceeded exceptions.

The output and error from each action and compensating action are stored at a specific path. This will avoid overriding the input value$for the rest of the actions.

The Happy Path Flow

Following the happy path, each of the actions are performed in turn and the state machine will successfully complete.

Failure Cases

When failures strike, we need to apply the corresponding compensating actions in turn depending on where the failure occurs.

In the examples below, if the failure happened atBookFlight, then bothCancelFlightandCancelHotelwill be executed to rollback any changes performed thus far.

Similar, if the failure happened atBookRental, then all three compensating actions —CancelRental,CancelFlightandCancelHotel— will be executed in that order to rollback all the state changes from the transaction.

Each compensating action also have an infinite retry loop! In practice, there should be a reasonable upper limit on the no. of retries before you alert for human intervention.

If you’d like to experiment on your own with the Saga Pattern using this example, the source code for this demo can be foundhere.

theburningmonk/lambda-saga-pattern_lambda-saga-pattern - Implementing the Saga pattern for Lambda functions using Step

I’d be interested in your thoughts on the benefits or drawbacks of using the Saga Pattern with microservices architecture … please drop a comment below.Thanks for reading!