- Wasn’t this code tested?
- Yes, it was, but…
- Well, going forward, we need to focus on quality
<crickets>
I think every engineering team has had a conversation similar to this.
It inevitably happens on the journey of continuous delivery. It's the definitive end of the DevOps honeymoon, where speed is put head-to-head with quality. Once this happens, you risk resetting progress and slowing down development.
But in the spirit of fearless deployments, we want to introduce a healthy and popular framework for dealing with the tradeoff of speed and reliability - namely, error budgets.
Before we get into error budgets, let's look at two pitfalls developers tend to fall into when they either slow down or speed up deployments.
We know that most outages and customer-impacting failures come from change.
It’s natural to see change as the root cause of failure. So, instinctively, you might attempt to minimize change as a way to manage this.
And to some degree, it is.
But, if you slow down deployments every time you hit a bump, which you will inevitably do so in the world of software, you'll end up in a vicious loop of accumulating change. At some point, you'll need to change your software to keep it running optimally, but now the changes you have to make will be even more significant than before and will likely cause failure.
A better strategy: make smaller changes more frequently. Unfortunately, this will only take you so far if you don't have a framework as to when to make these changes.
Instead of slow deployments, you might end up having the opposite problem. In an effort to get rid of all of your application's known issues, you might try to push new code and improvements too quickly.
Unfortunately, this approach doesn't work. Just as you resolve one error, another one is bound to pop up, if not more if you try to speed your way through the debugging process.
As you can see, blindly moving forward and increasing deployments is no guarantee for eventual stability. In fact, speeding through software issues is a recipe for continuous failures, a rise in technical debt, and a loss of customer trust.
According to Atlassian,
“An error budget is the maximum amount of time that a technical system can fail without contractual consequences. ... If your SLA promises 99.95% uptime, your error budget is four hours, 22 minutes, and 48 seconds. And with an SLA promise of 99.9% uptime, your error budget is eight hours, 46 minutes, and 12 seconds.”
In other words, an error budget means that your service should be available 99.9% of the time. What about the remaining 0.1% of your budget? Like with any budget, it's meant to be used. Make use of the 0.1% to take calculated risks to increase velocity and ensure future reliability.
Essentially, if you have a budget left, you have no reason to fear deployment.
Let's say an application has 99.9% availability requirements. That remaining 0.1% translates into time is when the service can be unavailable:
If the team has budget left, they can:
On the other hand, if you find that you’re consistently running out of your error budget, then:
Once the budget is back, it's important to get back on the horse again and resume fearless deployments.
In the next section, we’ll go over how to set up error budget metrics that are right for your application.
To set error budgets, you first need to define the service level agreements (SLAs) and trickle them down to service level objectives (SLOs) and service level indicators (SLIs).
Service level agreements, also known as SLAs, is the promise your company makes to customers regarding the terms around the quality and availability of a service. They also serve as guidelines on how to respond to failures.
The drawback: SLAs are usually not very actionable on a technical level.
That's where service level objectives (SLOs) come in.
SLOs are technical and measurable metrics that a team has selected to ensure they reach their SLA.
The first step in creating an SLO is to determine what constitutes "availability" for the service. The "availability" of a service is not just about when a service is available, but what it takes to complete its tasks satisfactorily.
For example, a service that receives and stores messages in a database might have SLOs that look something like this:
Together, these two SLOs ensure that most messages are received and stored in a timely manner. If the service conforms to these SLOs, the company is in good shape and can continue to chip away on delivering improvements and value.
When creating SLOs, you need to develop metrics known as Service Level Indicators (SLIs).
Measure and monitor the current value of your SLOs using SLIs.
For example, if your SLO is that "99.9% of messages are received and stored with status 200" in a month, and you receive 1,000,000 messages, that means you can drop 1,000 messages. However, dropping more than 1,000 messages means that you are out of budget.
Here is an example of an SLO/SLI report:
Microservice 1
January 3 - January 10
SLO | SLI for Jan 3 - Jan 10 | Status |
99.9% of messages are received and stored with HTTP status 200 | 98.1% (981,000 / 1,000,000 messages successfully received with HTTP status 200) | 190% of budget consumed |
95% of messages are processed within 1000 ms. | 100% (981,000 / 981,000 are processed within 1000 ms) | 0% of budget consumed |
Current status | Out of budget |
By using performance monitoring solutions, it's possible to graph these metrics and set up alerts.
Now that you know how to create metrics for an error budget, let's dive into how to use these budgets.
You might think a high error budget at the end of the month is a good thing, but this isn't true. An excess of an error budget at the end of the month means your team is not going fast enough. In cases like these, your team should increase deployments.
If you consume your error budget on a monthly level - for example, you drop 5% of a month's messages, you might need to discuss this with your team. Violations such as these are serious and could have future consequences. Next steps might include writing post mortems, informing your customers, and freezing features until your fixes are deployed and verified.
If you find that you're consistently using all of your error budget, you should look into making systematic improvements on how you and your team works. You might even consider an SRE team. They have the know-how to make systematic improvements when it comes to building and deploying pipelines.
Congratulations! You are balancing speed and availability! Keep up the good work.
Error budgets are an effective way to help teams understand when they should speed up deployments and take risks and when they should slow down.
And it's great for customers too- they are getting exactly what they are paying for: high availability and continuous improvements.
If you want to learn more about error budgets and practices for balancing speed and reliability, check out these SRE books.
Another great tool you can use along with error budgets: Airbrake Error Monitoring and Performance Monitoring. Our product gives you all the tools you need to find and fix bugs in your code quickly before they have a chance to impact your customers. Discover the power of Airbrake today with a free 14-day trial.