You will own reliability for core services across multiple clouds, drive automation, and mentor more junior engineers. You will partner with developer teams to embed resilience into feature delivery.
Responsibilities
Define and maintain SLIs/SLOs, monitor alignment and error budget usage
Lead incident response and postmortems, implement corrective measures
Automate operations tasks via tooling (e.g. auto-remediation, scaling rules)
Build, improve, and maintain CI/CD pipelines, canary deployments, blue/green strategies
Lead technical discussions with customers to align on reliability, scalability, and performance requirements
Drive continuous platform improvements across the service lifecycle, including architecture, monitoring, and operational processes
Implement and extend observability systems (metrics, tracing, log aggregation)