[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Mozilla's Thunderbird is a trusted open-source email application, and they are seeking a Senior Site Reliability Engineer to establish and maintain the infrastructure that users depend on. The role involves designing and developing CI/CD systems, diagnosing production incidents, and implementing improvements for system reliability.
Responsibilities
- Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives
- Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows
- Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts
- Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design
- Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation
- Diagnose and debug production incidents; drive root-cause analysis and post-incident improvements to prevent recurring problems
- Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding
- Contribute to runbooks, architecture documentation, and team processes
Skills
- 7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management
- Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi
- Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls
- Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early
- Excellent async written communication skills; comfortable working with a geographically distributed team
- Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency
- Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes
- Experience with GitOps workflows (ArgoCD or Flux)
- Familiarity with Keycloak or similar identity platforms (OIDC, SAML, federation)
- Knowledge of email protocols and/or experience operating email infrastructure (SMTP, IMAP)
- Prior work in or alongside an open-source community
- French, German, Japanese, or other language proficiency in addition to English
Benefits
- Fully remote work & schedule flexibility
- Company-provided laptop
- Annual bonus program
- Monthly remote work stipend
- Annual professional development stipend
- Industry conferences
- Company all-hands and team gatherings
- 24 days PTO per year (prorated)
- Your birthday
- Year-end company shutdown
- 9 wellbeing days
- Public holidays
- Other paid leave
- Quarterly wellbeing stipend for personal / family activities
- 401(k) / RRSP contributions
- Health, dental, & vision insurance
- Disability insurance
- Life insurance
- Employee assistance program
- Paid parental leave
- Paid sick days
Company Overview