O11Y: Observability Best Practices for Developers

Jul 28, 2021 10:22:12 AM | O11Y: Observability Best Practices for Developers

When devs have poor visibility into production, errors are bound to happen. That's where these o11y best practices come in.

It can be stressful when you lack observability or "o11y" into your application. And definitely contributes to fear of deployments.  

Yet, we hear that lots of developers have poor visibility into production. With poor visibility, developers are unable to see: 

  • Where code is running
  • Errors within code
  • What users are experiencing 

… unless they go through tickets, permissions, and dashboards that don't tell the right story. 

But it doesn't have to be like that.

This article will share some o11y best practices dev teams should be using for their code to improve their visibility. 

O11Y Best Practices

Monitor Increments of Code, Not Computers

Conventional monitoring tools tend to present issues and performance from the lens of infrastructure resources - not code, severely limiting developers and their visibility into their application. 

So, the best thing devs can do to improve observability (o11y) into their product is to use a code-centric monitoring tool. 

Here are a few simple questions to gauge whether or not your monitoring tool is code-centric:

  • Is a software release a primary entity within the monitoring tool? In other words, can you see how this release performs? 
  • Will the tool allow you to see the actual offending code? When you have an error, does it show you the code snippet where that error exists?  
  • Can you immediately start using this tool on your own? Or does it require training and configuring for you to get going.
  • Does the tool show you the urgency of the issue from the user's perspective, i.e., how users are experiencing the issue? Without this, it will be difficult for you to know how to prioritize errors within code.

A code-centric tool is just one way to improve developer visibility into a product. Insight into the user experience is also critical. 

Receive Alerts When the User Experience Is Bad

We understand that the last thing you want from a monitoring tool is a ton of alerts. Not only can they be disruptive, but they also require a lot of configurations. Still, there are a couple of important alerts you need to be aware of for the sake of the user experience.   

You don't need alerts about every little thing when it comes to the user experience. Instead, focus on what is business-critical and nothing else. With this in mind, pick one or two metrics that your users deeply care about and alert on that. Here are a couple of examples: 

  • 95 percentile latency is > 1s
  • Request failure rate is > 0.1%

That's it—numbers such as these alert you to the fact that users cannot use your service. Take it a step further and automatically page on-call developers if your application falls below your business-critical metrics. 

This proactive o11y best practice will help minimize bad user experiences. Once you have your metrics in place, it’s time to get proactive about errors.  

Don't Leave the System to Rot; Proactively Weed Out Errors

It's essential to complement alerts critical to the user experience with a healthy curiosity of anomalies and errors that are not yet critical. 

By paying attention to non-critical errors during your work hours, you'll reduce the risk of a severe incident. A great way to see non-critical errors is with an error-monitoring tool.

This is what o11y is about - understanding the system from its internal signals to have an informed mental model of how the system works and not some fantasy that looks great on a chart.

Have Supporting Data for Deep Troubleshooting

When an incident occurs, you need data, especially if it spans different teams, systems, and services. 

Severe outages are black swans - they do not happen in ways that we expect them to. 

Here are a couple of things you will likely need to fix a completely unknown situation: 

  • Turn on verbose logging and query it to understand an issue
  • See interaction patterns between services
  • Correlate with infrastructure behavior to understand IO errors  

Many of these use cases are rare for a single team but relatively frequent for an entire organization. Satisfy these use cases with a centralized platform with these cases in mind. 

Improve O11Y With Airbrake

Airbrake Error Monitoring and Performance Monitoring embodies these o11y principles for developer-centric monitoring. In as little as three minutes, you'll have access to an error and performance monitoring tool that provides in-depth background information on errors within your code and how they impact your users. See for yourself with a free 14-day trial

Written By: Alexandra Lindenmuth