O pozici
We are looking for a Senior Site Reliability Engineer to join our Site Reliability Engineering (SRE) team. In this role, you'll drive the reliability, scalability, and performance of our platform, ensuring our systems remain stable as we grow. We value innovation and are seeking someone eager to bring fresh ideas – especially around building automation that reduces manual effort and improving distributed systems resilience.
This isn't a top-down organization; our engineers are the ones who flag technical challenges and design the solutions. You will collaborate closely with Platform Engineering, Security, AI Platform, and Product teams to design durable systems and make data-driven operational decisions.
Co budeš dělat
- Collaborate with Engineering, Platform, and Security teams to embed SRE best practices early in system design.
- Lead advancements in observability, monitoring, alerting, and incident-response workflows.
- Analyze platform performance to contribute to cost-optimization, performance tuning, and resilience planning.
- Build infrastructure and automation tooling that improves platform reliability and enhances deployment safety.
- Diagnose and resolve complex production issues across distributed systems, and drive open post-incident reviews so failures translate into durable improvements.
- Strengthen system consistency and author clear, concise documentation for runbooks and operational processes.
Koho hledáme
- 4+ years of experience in SRE, DevOps, platform engineering, or similar production-facing roles.
- Strong problem-solving and debugging skills in distributed systems to maintain higher platform stability.
- Eager to share operational guidelines, champion SRE practices across teams, and openly discuss what we can learn from system failures.
- Excellent communication skills (English is our default language) with a genuine, collaborative approach to working across diverse engineering teams.
- Strong hands-on experience with cloud environments (AWS, GCP, or similar) and proficiency with infrastructure-as-code and CI/CD pipelines.
- Familiarity with Kubernetes (or container orchestration), event-driven architectures, or supporting ML/AI workloads and GPU infrastructure.
Benefity
- Flexible working models with a base in vibrant Prague and options for hybrid setup.
- Competitive benefits designed to support your well-being, growth, and work-life harmony.
- 5 weeks of vacation, 5 sick/personal days, and extra 2 weeks of paternity leave.
- Personal development, education, and language courses budget.
- High-end tech (MacBook, external monitor, keyboard of your choice) and a MultiSport card.
- Team offsites, regular meetups, and a friendly, ambitious team.