Don't prevent risk, build ability to recover

As an engineering leader, you must give your teammates space to try and fail. There is no better way to learn, grow and take on new responsibilities.

Don't try to prevent mistakes. Build an environment that guarantees a soft landing.

Okay, what does it mean?

  • Cleanup noise in logs and monitoring systems
  • Create solutions that catch mistakes
  • Build an automatic alerting system
  • Setup post-mortem sessions
  • Introduce rollback strategies
  • Introduce rollout strategies
  • Cultivate blameless culture

Noise in logs, irrelevant crashes, failing/skipped unit tests - they all make you blind to the real issues happening in the system.

Create solutions that catch mistakes on each level of development/delivery - testing automation, code scanning (dependencies, security vulnerabilities), or code review process.

Automatic alerting system - make critical errors come to you automatically (sms, email, chat alert) rather than through the customer support team.

Set up post-mortem sessions - discuss every single disaster. What you've learned from them, how to prevent them in the future.

Introduce a rollback strategy. Things will go wrong, no matter how good your tech stack is. Build a plan to roll back broken changes.

Introduce rollout strategy - don't go all-in. Try with a few percent of customers, measure, increase adoption, and succeed.

Cultivate blameless culture. A team needs to feel safe when you ask, "why have we had a spike in crashes?" Rather than finding who is guilty, you try to learn what was wrong - where to invest more focus, which process failed, and what to do to prevent it in the future.