VictorOps is now Splunk On-Call! Learn More.
If you’re exploring serverless architecture on AWS then you’ll quickly run into DynamoDB. DynamoDB is AWS’s managed NoSQL solution, and commonly the first choice for providing database services when working with AWS Lambda.
But if your system isn’t well architected for its usage pattern or you’re not using best practices when working with DynamoDB, then your new serverless application is going to wake you up at night.
DynamoDB is a highly scalable serverless NoSQL service from AWS. It’s a NoSQL Database, meaning tables are independent of one another, in contrast with the tables in a relational database, and it’s typically used as a key/value store or a document store. Once you define the shape of your data, define the data’s hash key and optionally a range key to begin storing data.
It’s also serverless, which means it’s a fully managed service. AWS is responsible for managing the service for you. Picture all the work you would do with other NoSQL solutions to manage service software deployment and patching, scaling, fault tolerance, data replication, etc. This work is done for you.
So how can your serverless application fail when using DynamoDB? What are the sort of things that can cause it to go bump in the night and wake you from a peaceful night’s sleep? With any cloud managed service there’s the general problem of cloud instability. I’m not saying the cloud is unstable, but services and parts of it do fail on occasion, just as they do on premise. Cloud systems architecture needs to account for this concern by, for example, using multiple AWS regions.
Another issue class can be performance related issues. You should design your data schema around the sort of queries the table should answer.
However, the sort of failure you’re most likely to deal with when starting off is related to DynamoDB’s read and write capacity throughput. AWS manages the work of scaling DynamoDB’s storage capacity for you. AWS also does the work of adding read and write capacity to your table, but you need to tell it how much read and write capacity units there should be.
What are read and write capacity units?
In order to properly size read and write capacity for a DynamoDB table, you’ll have to start by making projections on the expected amount of read and write operations as well as the size of the data expected in each operation. Once in production, you’ll need to measure, verify, and adjust your capacity configuration accordingly.
Your DynamoDB read and write capacity unit configurations are only soft limits. That means you can burst above them for a short time with no issue. But, consume more than your configured capacity and you’ll receive a ProvisionedThroughputExceededException.
This is what’s going to wake you up at night when you first start using DynamoDB and building serverless applications–unless you prepare for it.
How do we prevent, or at least catch and gracefully handle, issues with exceeded read and write capacity to DynamoDB? We’re going to discuss three different approaches for making your serverless application more resilient to failure. And we’re going to take three different approaches involving different areas of your serverless application.
Let’s see how all three will improve the reliability of our application. To do that, we’re going to to work with a simple web API consisting of API Gateway, AWS Lambda, and DynamoDB. The service is called aws-api-dynamodb-example and available on GitHub. We’re going to make changes to our web service to make it more resilient to failure.
Adding retries to your AWS Lambda function is the first place to start. I’m not talking about Lambda’s own invocation retry behavior, but catching failures in your code and retrying operations appropriately. In fact, any code in a cloud or distributed system should be doing this already. (You can’t blindly assume the cloud is reliable!)
Let’s take the following Python code from the file handlers/put_item.py file as an example.
The function named handler() is our Lambda’s entry point, which means when our Lambda is invoked, the handler() function will be called. In that function there is a call to a _put_item() function. The _put_item() function will write the data passed to it, in this case the body of the API Gateway event, and write it to a DynamoDB table. If the write to DynamoDB fails, then normally the entire function fails. And in our case, when using API Gateway, we’ll end up returning an error to the client. How can we help avoid that?
To start, every AWS SDK implements retries and exponential backoff when working with DynamoDB, which is the suggested best practice from AWS. To put it another way, when using the AWS SDKs, you get this best practice for free. Why mention retries then? Because I didn’t know this initially, and upon asking around I found others did not know this either.
Since, in our example we’re using the official AWS SDK for Python, boto3, by default we’d make 10 attempts (an initial attempt and 9 retries), and exponentially back off starting at 50 milliseconds and doubling with each retry. That’s 25.55 seconds for the operation to succeed. The default retry behavior is documented here.
Here’s a few things to keep in mind, API Gateway has a 29 second request limit. In our case, the defaults are fine. But what happens if we had a time intensive operation before we tried to write to DynamoDB and can’t afford as many retries? Or, we weren’t using API Gateway and wanted to let the function run longer to increase the likelihood of success?
Take a look at the snippet of code below.
In our code, we create an AWS configuration object, AWS_CONFIG, which will try up to only 9 times. Then 2 lines below we create a DynamoDB resource object, named dynamodb, with that configuration. This affords us 9 attempts in the span of 12.75 seconds to successfully complete any operation against DynamoDB. If we weren’t using API Gateway and wanted to let the function run longer, then we could increase the number of attempts.
But wait, there’s more. Requests to AWS APIs are rate limited on a per region basis, and those failed DynamoDB operation retries will count against that limit. For that reason, you may opt to disable boto3’s retry behavior and implement your own with longer intervals between retries. Look at the code below.
In the AWS_CONFIG object we set max_attempts to 0. We then use the tenacity module’s retry decorator and its wait_random_exponential() function. The default behavior of tenacity will significantly reduce the number of retries. This is preferable if you’re worried about hitting API limits.
Our next spot to look at is DynamoDB’s configuration. I’ve mentioned that DynamoDB is highly scalable and, as a managed service, AWS handles the scaling behind the scenes for you based on your read and write capacity units. But not only is DynamoDB scalable, it’s autoscalable too!
When you need to scale DynamoDB to handle read or write load, you don’t need to go into the AWS console, or better yet your automation tool configuration such as AWS Cloudformation, AWS SAM CLI, or Serverless Framework. Instead, you can configure autoscaling to increase or decrease capacity according to load.
The minimum and maximum provisioned capacity for read and write capacity will determine the range of available capacity for the table. If this is a new DynamoDB table, start with generous amounts and tune after observing the activity pattern on the table after awhile.
Your target utilization should be reflective of the size of the activity spikes your table sees. That means, increase the target utilization if your activity spike is gradual. Decrease it if the spike is more sudden. It takes time for DynamoDB scaling operations to complete, so the faster your table scales, the less likelihood there is of throttling.
If you’re using AWS CloudFormation, or AWS SAM CLI, then have a look at this piece on how to use CloudFormation for DynamoDB autoscaling. If you’re using Serverless Framework to manage your serverless application, then take a look at the serverless-dynamodb-autoscaling plugin. This makes setting up autoscaling for DynamoDB incredibly easy. Take a look at Github and how we configured our project for DynamoDB autoscaling using serverless-dynamodb-autoscaling.
Keep in mind that provisioning new capacity takes time. For periodic spikes of reads or writes, DynamoDB’s burst capacity should be fine. If you expect to or are seeing extended spikes in operations on your table, then you need to tune your target utilization appropriately.
Because DynamoDB scaling can be slow and not fit for all load patterns, let’s discuss a potential architecture change. An SQS queue is a good way to control the rate of writes to your table. But this comes at the cost of increased system complexity and potentially other changes, such as API changes to your application.
The beauty of SQS queues is you can control the rate of DynamoDB writes by controlling the number of queue consumers fetching messages and attempting to write to the table. Additionally, if the table write fails, the message will be placed back on the queue for processing.
You can control the number of consumers by controlling both the Lambda function concurrency (how many simultaneous invocations of a function can occur at once) and the SQS event source, batchSize, which determines how many messages a queue consumer will process at once.
Take a look at a branch of aws-api-dynamodb-example where we’ve implemented an SQS queue.
In the event you decide to go down this path, you may need to make some additional application changes. If you’re adding a queue to batch processing pipeline that writes to DynamoDB then you should be fine. But what about our case, a RESTful API? When a request is made to our service to write data to DynamoDB, we can’t guarantee the operation was completed or successful. All we can do is inform the client that the request was queued. The client will have to make additional requests to the service to get the status of the write operation.
When you’re starting to build serverless applications you’ll quickly find yourself working with DynamoDB. Sooner or later you’ll reach read or write throughput capacity issues. Following these steps will help reduce the likelihood that VictorOps will interrupt your peaceful night’s sleep because your application failed. Come back another time to learn how to handle other issues such as cloud stability and performance issues.
VictorOps is purpose-built to help teams rapidly develop and maintain their services. Download our free Incident Management Buyers Guide to see how incident management software helps you build reliable services at the speed of DevOps.
Tom McLaughlin is the founder of ServerlessOps, a DevOps transformation and AWS cloud adoption advisory company with a specialty focus on serverless. He’s a cloud infrastructure engineer by background and writes regularly on serverless operations.