See all roles

[Remote] Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Runpod is a rapidly growing company that provides a foundational platform for developers to build and run custom AI systems. As a Site Reliability Engineer, you will ensure the stability and resilience of Runpod’s distributed platform by partnering with engineering teams, improving system design, and enhancing observability to prevent incidents.

Responsibilities

  • Define and implement SLIs/SLOs for critical services
  • Lead incident response and coordinate cross-team mitigation efforts
  • Conduct blameless postmortems and ensure corrective actions are completed
  • Perform production readiness reviews for new services and features
  • Identify systemic risks and drive preventative improvements
  • Design and improve monitoring, alerting, and dashboards (Prometheus, Grafana, etc.)
  • Improve signal-to-noise ratio in alerts and reduce alert fatigue
  • Build internal tooling for reliability tracking and reporting
  • Improve visibility into GPU performance and distributed systems health
  • Automate recurring operational workflows
  • Build tools and scripts (Python, Go, Bash) to eliminate manual processes
  • Improve deployment safety through automation and guardrails
  • Strengthen CI/CD reliability and release processes
  • Partner with engineering teams to improve system resilience
  • Provide guidance on fault tolerance, scalability, and failure handling
  • Contribute to architectural discussions with a reliability-first mindset

Skills

  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering
  • Strong Linux systems and Networking expertise
  • Experience managing containerized production systems
  • Strong understanding of distributed systems and failure modes
  • Experience defining and managing SLIs/SLOs
  • Proven incident response and postmortem leadership experience
  • Strong scripting or programming skills
  • Experience with monitoring and alerting systems
  • Excellent written communication skills
  • Successful completion of a background check
  • Experience with GPU infrastructure or AI/ML platforms
  • Experience improving reliability in high-growth or large scale environments
  • Familiarity with GPU observability tooling
  • Experience with Infrastructure as Code
  • Experience working in startup environments
  • Experience building internal reliability platforms or frameworks

Benefits

  • Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.
  • Generous medical, dental & vision plans
  • Flexible PTO- take the time you need to recharge
  • Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication
  • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

Company Overview

  • Runpod is a cloud platform designed for GPUs, enabling developers to deploy customized full-stack AI applications. It was founded in 2022, and is headquartered in Mount Laurel, New Jersey, USA, with a workforce of 51-200 employees. Its website is https://www.runpod.io.
  • Company H1B Sponsorship

  • Runpod has a track record of offering H1B sponsorships, with 4 in 2025, 3 in 2024. Please note that this does not guarantee sponsorship for this specific role.
  • Apply To This Job

    You might like

    [Remote] Email Marketing Specialist- Global

    Work from home Full-time role

    [Remote] Commercial Restoration Project Manager

    Work from home Full-time role

    [Remote] Senior Product Manager

    Work from home Full-time role

    [Remote] Account Executive

    Work from home Full-time role

    [Remote] Engineering Manager - Front-End (UI/UX)

    Work from home Full-time role

    [Remote] Senior Product Marketing Manager

    Work from home Full-time role

    [Remote] Quality Assurance Engineer

    Work from home Full-time role

    [Remote] Member of Technical Staff, Financial Infrastructure

    Work from home Full-time role

    [Remote] Sr. Product Manager

    Work from home Full-time role

    [Remote] Senior Mobile Engineer

    Work from home Full-time role

    Nocturnist Psychiatrist (100% Remote) - $15k Sign-On Bonus

    Work from home Full-time role

    Senior Coding Specialist, PRN (Multispecialty Coding exp) - REMOTE

    Work from home Full-time role

    Experienced Live Chat Representative – Remote Customer Experience Expert at arenaflex

    Work from home Full-time role

    Remote LiveChat Customer Support Representative – Real‑Time Assistance, Issue Resolution & Customer Success

    Work from home Full-time role

    Entry-Level Data Entry Clerk – Remote Position | No Experience Required | Join arenaflex's Dynamic Team

    Work from home Full-time role

    Experienced Data Entry Specialist – Remote Opportunity for Career Growth and Development

    Work from home Full-time role

    Customer Support Specialist

    Work from home Full-time role

    Delta remote jobs (Virtual Assistant) US

    Work from home Full-time role

    Experienced Remote Customer Interaction Specialist – Flexible Hours, Competitive Pay, and Career Growth Opportunities at arenaflex

    Work from home Full-time role

    Senior Software Implementation Specialist

    Work from home Full-time role