Don't prevent risk, build ability to recover

As an engineering leader, your role goes beyond just preventing mistakes; it's about creating an environment that encourages growth and rapid recovery.

What can it mean in practice? See some examples below:

  1. Cleanup Noise in Logs and Monitoring Systems:
    • Problem: Noise in logs and monitoring systems can distract your team from real issues.
    • Solution: Get rid of false positives and make sure what's alerting you in your logs are only true issues.
  1. Create Solutions That Catch Mistakes:
    • Problem: Mistakes often go unnoticed until it's too late.
    • Solution: Implement testing automation, code scanning, and robust code review processes to catch mistakes early in the development cycle.
  1. Build an Automatic Alerting System:
    • Problem: Critical errors can slip through the cracks, causing issues for your users.
    • Solution: Set up an automatic alerting system to notify your team of critical errors via SMS, email, or chat alerts.
  1. Setup Post-Mortem Sessions:
    • Problem: Failures happen, but without analysis, they become repeated mistakes.
    • Solution: Hold post-mortem sessions to review past failures, learn from them, and strategize on how to prevent them in the future.
  1. Introduce Rollback Strategies:
    • Problem: Even with the best tech stack, issues can arise.
    • Solution: Develop a rollback strategy to quickly revert problematic changes and maintain system stability.
  1. Introduce Rollout Strategies:
    • Problem: Rushing changes to all users can lead to widespread issues.
    • Solution: Gradually roll out changes to a small percentage of users, measure their impact, and incrementally increase adoption to ensure a smoother deployment.
  1. Cultivate a Blameless Culture:
    • Problem: Blaming individuals stifles growth and innovation.
    • Solution: Foster a blameless culture where the focus is on understanding what went wrong, improving processes, and preventing future issues.