Intro to SLA, SLO, and SLI

What is Reliability?

Jun 24, 2024

∙ Paid

In the latest annual letter, Stripe emphasized the importance of its services' reliability. During the highest-load period, from Black Friday to Cyber Monday, Stripe maintained its target of 99.999% uptime while processing 300 million transactions worldwide. Wow!

But what does it mean for our services to be reliable?

SRE (Site Reliability Engineering) is a distinct branch of engineering practices, often a full-time role for multiple engineers or entire SRE teams. These teams monitor applications, track health indicators, and set reliability goals, among other things. This approach is common in large companies like Google, Netflix, and Amazon and is also gaining traction in fast-growing startup organizations.

However, many organizations lack mature observability practices and measure reliability through the number of issues in ticketing systems or customer requests from CS/Ops agents.

Today, I will explore how to bring reliability to life, set your SLOs and SLIs, understand error budgets, and more. This is not a comprehensive guide to SLA/SLO/SLI practices, as there are entire books, online publications, and workshops about SRE (e.g., visit Google's Site Reliability Engineering).

Instead, I want to share some ideas on how to kick-start these practices pragmatically so you can later iterate on these foundations.

What is Reliability?

Historically, reliability measures were mainly used by telecoms through SLAs (Service-Level Agreements) as a quality guarantees. For example, for a certain price, the provider guarantees either a minimum Internet connection bandwidth or a certain uptime. If SLAs are not met, the customer can get a refund or discount for the services.

Wikipedia says, "Reliability describes the ability of a system or component to function under stated conditions for a specified period." I prefer a simpler version: reliability is a set of measures describing whether our service (e.g., app, feature, API) can be used by customers and meets their expectations.

If Stripe aims for 99.999% uptime, it means, for example:

Out of 300 million transactions, up to 3000 could fail unexpectedly
Stripe's APIs could be unavailable for a total of ~24 seconds in a 28-day window (unavailable, returning unexpected responses, or slower than a given threshold for response time).

Targets for reliability usually define what should be enough to keep customers satisfied.

Reliable Enough

So, where do all those “nines” come from?

Due to complexity and real-world dynamics, achieving 100% reliability is impossible. Making a service more reliable requires extra commitment from engineering teams (refactoring, re-architecting, bug fixing). Sometimes, you have to add redundancy to the system (e.g., fallback to backup vendors, more failover infrastructure - more instances, DB replicas, etc.).

Achieving 100% uptime means no time for maintenance, so you need to develop solutions that keep the system alive even during maintenance, such as while performing a DB migration. Not to mention fighting entropy, a force of physics that affects both organizations and tech stacks.

All of this comes at an extra cost.

A rule of thumb says that every additional nine in reliability level costs 10x more for 10x less benefit.

Error Budget

That's why we must clearly define an acceptable level of unreliability. This is called an error budget, an agreement that helps decide when to invest in a service's reliability and when to focus on something else (delivering new features, operational work, experimentation, etc.).

For example, if your reliability level is 99.9%, it means that (depending on the case):

1 out of 1000 user sessions can crash the app
The total outage duration for a 28-day window is 40 minutes and 19 seconds
1 out of 1000 requests can fail, timeout, or return unexpected responses

Once you have your error budget, the rule of thumb is:

If you have enough budget to spare, move faster or experiment more,
If you are nearing the budget line, pay attention to quality or fix some outstanding bugs,
If you are below the line, prioritize reliability work over product velocity.

Outage math from Google's Art of SLOs handbook.

SLA, SLO, SLI

To organize your reliability targets, keep these three terms in mind:

SLI (Service Level Indicator) - a metric that measures a service's reliability. Multiple such measures can exist for a single service, e.g., availability, quality, latency, throughput, etc.

SLO (Service Level Objective) - your target for SLIs, e.g., 99.9% crash-free sessions, 99.99% uptime, 90% of requests under 500ms. Targets are often defined with specific measurement windows (day, week, four weeks, a quarter).

SLA (Service Level Agreement) - similar to an SLO, but external. An SLA is a promise to customers about the minimum level of service below which there are consequences (e.g., discount, refund, etc.).

There are dedicated workshops and materials where you can practice picking good SLIs and SLOs, and this process can be very nuanced.

I personally recommend Google's deck: The Art of SLOs if you want to explore this process in detail. However, If you are at the beginning of implementing reliability practices, I recommend doing something simpler.

Set the Baseline

Start by identifying the critical services and apps that directly impact your customers. These are the services where reliability and performance are most crucial.

Then follow these action steps:

List all the services your team is responsible for.
Prioritize them based on customer impact.
Select the top 3-5 services to focus on initially.

For each critical service or app, define the SLIs. These should be specific and measurable metrics that reflect the service’s performance and reliability. Don't overcomplicate this at this point.

Use off-the-shelf solutions, e.g.:

Measure mobile app stability with Firebase Crashlytics.
Measure web app stability and performance with Bugsnag, Sentry, or Instabug.
Monitor your backend services with cloud provider solutions (e.g., Google Cloud Observability) or separate tools like Grafana, New Relic, DataDog, Coralogix.

Pick the simplest SLIs, like crash-free users or sessions, request latency, and requests with errors 5xx. Before setting ambitious SLOs, see how far you are from reaching the baseline, such as 99.0% or 99.9%.

Involve Product and Business

Once you have your SLIs and SLOs, ensure they are known by Ops, Product, Business, and the rest of the organization. Reliability is not just a technical problem - it influences customers’ trust.

Additionally, having access to empirical data will help you:

Discuss priorities for building new features vs. addressing reliability needs.
Track quality improvement or degradation over time for your services and apps (this can influence the need for better testing practices, for example).
Build a common language with ops teams to clarify the level of reliability you guarantee vs. where the error budget allows you to skip some immediate bug fixes.

Measuring SLIs

There are several ways to measure SLIs, each with its own pros and cons and suited to specific cases.

These methods are:

Application-level Metrics - This involves monitoring the performance and behavior of the application itself. Exporting metrics directly from the code is usually fast and simple. However, the application cannot see requests that don't reach it, so coverage is partial. It's also more challenging to measure complex user journeys involving multiple requests.
- Example tools: Micrometer, Prometheus, Grafana
Logs Processing - This involves analyzing logs generated by applications and infrastructure to extract meaningful metrics. This approach can help with retroactively backfilling SLI metrics and can simplify processing complex user flows.
- Example tools: ELK Stack (Elasticsearch, Logstash, Kibana)
Infrastructure Metrics - These metrics focus on the performance of the infrastructure components that support the application, such as servers, databases, or network devices. In many cases, these metrics are already available in load-balancing infrastructure, requiring the least engineering effort to start measuring SLIs.
Synthetic Clients/Data - This method involves using simulated users to monitor the performance and availability of services. These synthetic transactions can help detect issues before they impact real users. SLIs can cover the full request path, including when requests don't reach the service. However, building such tests is costly, hard to cover all corner cases, and requires frequent probing for better accuracy.
- Example tools: End-to-end tests with extra measurements, Uptrends, DataDog Synthetic Testing
Client-side Instrumentation - This involves embedding monitoring code within the application itself to collect data on how the application is being used from the user's perspective. This is the most accurate measure of user experience, as it can also cover third-party integrations. However, because these SLIs often include factors outside our control, user consent is usually required for data collection.
- Example tools: Firebase Crashlytics, analytics tools, Bugsnag

Below, you will find more content for paid subscribers (my personal experience, ways of improving reliability, FigJam and Notion templates for defining SLIs and SLOs)

Intro to SLA, SLO, and SLI

What is Reliability?

What is Reliability?

Reliable Enough

Error Budget

SLA, SLO, SLI

Set the Baseline

Involve Product and Business

Measuring SLIs

This post is for paid subscribers