See all roles

Lead Site Reliability Engineer - Infrastructure

Work from home Full-time role Hiring

JOB DESCRIPTION We are seeking a Lead Site Reliability Engineer (Infrastructure) to act as technical lead for our Infrastructure SRE team in a fast-moving VSaaS engineering organization. In this role, you will own the team's technical direction and execution across reliability, scalability, and operability of our shared platform and production systems, combining hands-on technical leadership with responsibility for team outcomes. You will define SRE strategy and guide architecture across our GCP and Kubernetes ecosystem, setting standards for reliability, scalability, GitOps, and observability. You will also mentor senior and staff engineers, and lead incident response and high-impact operational work, contributing hands-on when needed. Role Overview Site Reliability Engineer - Infrastructure In this role, you will translate product and business needs into scalable infrastructure and clear technical direction. With a system-wide view of the platform, you will guide architectural decisions, surface non-obvious risks, and drive long-term improvements to system reliability and operability. Working closely with product and platform teams, you will shape the developer experience and ensure engineering teams can ship with speed and confidence. You will set engineering standards and continuously evolve our GitOps and observability practices. This role requires strong expertise in cloud infrastructure, distributed systems, and CI/CD, along with hands-on experience in Golang and/or Python to support automation and long-term system reliability.

Responsibilities

As a Lead Site Reliability Engineer, you will:

  • Team Leadership & Execution Ownership: Own technical direction and execution of the Infrastructure SRE team. Translate platform goals into actionable plans, ensuring alignment on priorities, reliability outcomes, and operational excellence across production systems.
  • Production Operations & Incident Management: Operate and evolve large-scale distributed systems in production, proactively identifying failure modes and mitigating risk. Own day-to-day operations including monitoring, alerting, incident response, coordination, post-incident analysis, and continuous improvement.
  • Architecture, Standards & Platform Governance: Provide architectural leadership across platform and infrastructure changes, identifying scalability constraints, system design risks, and long-term reliability gaps. Define and enforce engineering standards for GCP, Kubernetes, and ArgoCD, ensuring consistent, secure, GitOps-based delivery.
  • Reliability Engineering & Observability: Lead strategy for monitoring, alerting, and system observability, driving a shift from reactive incidents to proactive reliability engineering.
  • Enablement, CI/CD & Collaboration: Guide CI/CD and cloud-native delivery practices at scale to ensure safe, scalable releases. Mentor senior and staff engineers, conduct high-impact design and code reviews (Golang/Python), and partner with product and engineering teams to embed system-level thinking across development.
  • Hands-on Technical Contribution: Provide hands-on technical contribution where needed, including debugging production issues, reviewing and contributing to code, and supporting critical incident resolution to ensure system reliability and team effectiveness.
  • Other duties as assigned are absorbed into the above ownership and operational responsibilities.

Minimum Qualifications

  • Leadership & Experience: 10+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering, including demonstrated experience leading technical engineering teams, driving roadmaps, and owning delivery of large-scale production systems.
  • Cloud & Distributed Systems Expertise: Deep experience with cloud-native architectures and distributed systems at scale, particularly in GCP and Kubernetes environments. Ability to reason about system design, identify failure modes, and evaluate scalability and reliability risks.
  • GitOps & Delivery Engineering: Strong experience with GitOps-based delivery workflows, particularly ArgoCD, and CI/CD pipeline design. Ability to ensure safe, repeatable, and observable production deployments.
  • Infrastructure & Automation: Strong hands-on background in infrastructure-as-code (Terraform preferred), automation, and operational tooling. Proficiency in Golang and/or Python for building and reviewing production systems. Strong Linux systems knowledge and production troubleshooting experience.
  • Observability & Reliability Engineering: Experience designing or operating observability systems (logging, monitoring, alerting) and applying SRE principles such as SLOs, incident management, postmortems, and reliability engineering practices.
  • Technical Oversight & Engineering Quality: Ability to review and critique system design and production code, ensuring engineering quality across backend systems and infrastructure components.
  • Communication & Leadership Influence: Ability to influence technical direction, communicate trade-offs to stakeholders, and drive alignment across product and engineering teams on reliability and platform priorities.

Why Milestone? Milestone offers not only great benefits but also great culture. Employees here have flexible work environments, opportunities for further education, and the ability to effect change in our Organization directly. The annual salary for this position ranges from $160,000 to $180,000 range. Pay is based on the level, location, complexity, responsibility, and job duties of the specific position and is just one component of Milestone's total compensation package. Additionally, we offer an attractive benefits package that includes medical/dental benefits, FSA or HSA, 401k with 6% Safe Harbor employer match, paid parental leave, generous PTO (20 days' vacation, 10 days paid sick time, and 12 company holidays), fully paid Short Term disability policy, fully paid Long Term disability policy, and Life Insurance. If you are selected for an interview, please feel welcome to speak to our Talent Partner about our compensation philosophy. All employees must complete a background check. Employees in fiscal roles are also required to undergo a credit check. All information obtained during these checks is handled confidentially and shared only with authorized personnel. Milestone is committed to creating a diverse and inclusive workplace and is proud to be an equal opportunity employer. Contact and application Please apply at our website: www.milestonesys.com We are looking forward to receiving your application Apply tot his job Apply To this Job

You might like

IBM SFG with Docker/Kubernetes - Remote

Work from home Full-time role

Kubernetes Engineer ($28/hr. on w2)

Work from home Full-time role

Senior Engineer (Cloud, Terraform, Kubernetes)

Work from home Full-time role

Sr Site Reliability Engineer, Operations (US Federal)

Work from home Full-time role

Senior System Software Engineer, Kubernetes, KubeVirt

Work from home Full-time role

Senior Engineer, Kubernetes

Work from home Full-time role

Site Reliability Engineer (Rustici) US, Franklin, Remote

Work from home Full-time role

Vice President – Site Reliability Engineering, Data Centers

Work from home Full-time role

Sr. Site Reliability Engineer- Product Reliability Engineering

Work from home Full-time role

Software Engineer - Kubernetes, CI/CD, and DevOps

Work from home Full-time role

Experienced Customer Service Advisor – Remote Opportunity at arenaflex

Work from home Full-time role

Experienced Junior Data Entry Assistant – Remote Opportunity for Career Growth and Development

Work from home Full-time role

AWS Cloud Developer / Engineer

Work from home Full-time role

Experienced Entry-Level Remote Administrative Data Entry Assistant – Join arenaflex's Dynamic Team!

Work from home Full-time role

Field Electrical/Electronics Engineer

Work from home Full-time role

Remote Life Insurance Agent (High Commission | Build Your Own Team | Uncapped Earnings)

Work from home Full-time role

Account Director II - PUB SEC

Work from home Full-time role

Experienced Entry-Level Customer Success Advocate – Live Chat Support for arenaflex

Work from home Full-time role

Product Designer (Consumer/Digital Sales) [209629]

Work from home Full-time role

[Remote] Entry Level AI Jobs in Brazil (Remote)

Work from home Full-time role