Be at the forefront of infrastructure reliability as an AI Infrastructure Site Reliability Engineer. Focus on maintaining system performance, security, and incident management to support our growing platform.
You'll collaborate with a small yet passionate infrastructure team, working closely with DevOps and leadership to enhance the reliability of AI systems. This hands-on role demands your proactive approach in automating processes, improving observability, and ensuring services run cost-efficiently in production.
Key Responsibilities: • Sustain platform uptime and availability metrics • Optimize and secure infrastructure • Resolve scaling issues proactively • Collaborate on troubleshooting with product engineers • Build and maintain observability systems
Requirements: • Proven experience in Site Reliability Engineering or related field • Familiarity with Elixir desirable • Operating experience with Kubernetes clusters • Competence with Terraform • Expertise...