It’s been almost five years since we launched Uncapped, and we have never had an outage. However, good things can’t last forever, and last week, for the first time, we experienced a technical issue preventing customers from registering.
As we strive for complete transparency and a world-class engineering culture, we want to share what has happened and what we learned openly.
Architecture overview
Our system is based on a few microservices that run in a Google Kubernetes Engine cluster. Each service is available in 3 instances distributed to each zone. Each microservice is written in Java 21 and uses Spring underneath. Having multiple services has its pros and cons. One of the problems we used to have early on is keeping dependencies up to date. To deliver value and keep services up to date we rely on the opinionated starter-like (fans of Spring Boot will know the concept) library that provides capabilities like Observability or Messaging Autoconfigurations. A snapshot version of the Bill of Material provides the starter library version, which is later pulled by all services. That way, we can quickly update all services and introduce new capabilities without a hassle.
Timeline of what has happened
- At 3.04, around 8 AM GMT, one of our engineers started working on updating the version of the Open telemetry library we use to provide our system’s observability.
- The new starter was released on 3.04 at 10:34 AM and tested on 1 app—the application worked fine.
- After testing, as our apps are nearly the same, on 3.04 at 1:19 PM, the newest version of the starter was made available for the other services.
- At 10:12 PM, the User app received an updated starter and a new version of it landed in our production environment at 10:22 PM.
- At 3:14 PM, the team changed the value of the Open telemetry property OTEL_SDK_DISABLED=true. The change is system-wide and applied in Vault. Unfortunately, The User service wasn’t restarted to pick up the new property.
- At 9:52 PM, the first instance of the app crashing with out-of-memory errors. Kubernetes scheduler tried to recover it but failed to start up a new instance due to misconfiguration of properties introduced in the newest release.
- Over the next 10 minutes, 2 more instances died, making the whole app unavailable to users.
- At 10:02 PM, our external monitoring provider (Better Uptime) detected that the app didn’t work and informed us about downtime in 3 different places:
- Incident update posted to Slack channel
- Email to the on-call officer
- SMS to the on-call officer
- Unfortunately, the SMS didn’t reach the on-call person’s mobile phone, and no one saw other updates.
- At 05.04 5:09 AM CET, one of the employees reported issues to our Head of engineering via Slack.
- At 5.04 8:00 AM, other engineers noticed information about issues and started investigating issues.
- Engineers examining issues only see this message, which indicates property set on a per-service basis
Property: otel.resource.attributes
Value: "service.name=users"
Origin: System Environment Property "OTEL_RESOURCE_ATTRIBUTES"
Reason: org.springframework.core.convert.ConverterNotFoundException: No converter found capable of converting from type [java.lang.String] to type [java.util.Map<java.lang.String, java.lang.String>]
What went wrong, and what have we done to avoid it in the future
After the system had been stable, we started to find the root cause. After digging deep, we realized that it contained a bug. Disabling the SDK also disabled registering necessary converters, which prevented other beans from correctly initialising and the App from starting.
As you probably already realised, many things didn’t go as we could expect.
We were too dependent on the single-person on-call and non-formal incident management process. We must also improve internal communication so changes impacting multiple apps are communicated. Also, we found that the User app crashed regularly in the few days before downtime. But as OOM Kills was pushed to the general alert channel, it disappeared in the burden of messages.
As the next steps, we immediately added more people to the on-call rotation and created an escalation policy in case the downtime was not acknowledged. We are also improving internal, automatic communication about newly released changes in the Starter library and moving away from the snapshot version of BOM. Because we use the Renovate plugin to keep other dependencies up to date, it should help us achieve the pace we need and keep things running without breaking things. Another action point is to reiterate our approach to alerts and alert hell. OOM errors and app crashes will become the highest priority and be assigned per team.