See all roles

Consultant HPC Infrastructure Engineer

Work from home Full-time role Hiring

We are looking for a curious and driven engineer eager to step into the world of high-performance computing and AI infrastructure. In this role, you’ll gain hands-on experience supporting NVIDIA GPU clusters and automation pipelines that power some of the world’s most advanced AI workloads. Working alongside seasoned engineers, you’ll learn to apply Linux, Kubernetes, Terraform, and Prometheus in real-world environments where precision and scale truly matter.

If you’re passionate about technology that defines the future of computers, this is your chance to grow within a team shaping that frontier.

Office Travel: Frequent on-site work is required for this position (2–3 days/week) at our Santa Clara, CA office.

Job responsibilities

  • You will act as the initial responder to monitoring alerts, ensuring timely acknowledgment and preliminary triage of operational issues.
  • You will automate operational procedures and diagnostics using established Infrastructure as Code (IaC) tools, including Bash, Python, Ansible, Terraform, and Helm, under the guidance of senior engineers.
  • You will execute foundational diagnostics such as NCCL tests, DCGM (Data Center GPU Manager), Fabric Diagnostics, and designated test workloads for training and inference, following standard procedures.
  • You will apply a proactive and action-oriented mindset, resolving documented issues efficiently and suggesting improvements to runbooks or automation scripts based on recurring patterns.
  • You will analyze and interpret diagnostic outputs to assess system health and identify early signs of degradation or instability.
  • You will document all operational activities, system status changes, and troubleshooting steps with accuracy, clarity, and timeliness.
  • You will use observability tools such as Prometheus and Grafana to analyze logs and metrics, supporting senior engineers in the root cause isolation process.
  • You will develop hands-on familiarity with HPC workload management tools, including Slurm and/or Kubernetes.
  • You will actively participate in training sessions and knowledge-sharing initiatives to deepen your understanding of the GB200/GB300 architecture and operational best practices.
  • You will maintain a high level of discipline, attention to detail, and consistency across all operational tasks. 

Job qualifications

Technical Skills

  • You have foundational knowledge of Linux operating systems and are comfortable with the Unix command line, including using awk, Bash, and Python for log parsing and basic automation.
  • You are familiar with or have exposure to HPC systems, including HPC schedulers (e.g., Slurm) or container orchestration tools (e.g., Kubernetes).
  • You are comfortable using observability platforms such as Prometheus and Grafana for log and metric visualization.
  • You are familiar with Infrastructure as Code (IaC) concepts and can execute automation using tools like Ansible or Terraform.
  • You have familiarity with GPU-based workloads and are eager to deepen your understanding of AI and HPC operations.

Professional Skills

  • You demonstrate strong analytical ability and can follow complex procedures while interpreting technical results (e.g., NCCL tests).
  • You communicate with clarity and accuracy, producing clear documentation and reports for both peers and senior engineers.
  • You collaborate effectively with cross-functional teams, embracing mentorship and continuous feedback.
  • You bring curiosity, persistence, and discipline, with a strong desire to learn and grow in advanced HPC operations.
  • You work with attention to detail, ensuring consistency and accuracy in every task you undertake.
  • You thrive in an environment that values learning, precision, and shared ownership.

Growth Expectation

We value curiosity and a growth mindset. Candidates are expected to bring a strong foundation in Linux and scripting from academic or prior professional experience. Proficiency in advanced scripting, IaC practices, and observability tooling (e.g., Prometheus, Grafana) may be developed within the first six months through structured on-the-job training and mentorship from senior engineers.

Other things to know

Learning & Development

There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. But we also balance autonomy with the strength of our cultivation culture. This means your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. We see value in helping each other be our best and that extends to empowering our employees in their career journeys.

About Thoughtworks

Thoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. For 30+ years, we’ve delivered extraordinary impact together with our clients by helping them solve complex business problems with technology as the differentiator. Bring your brilliant expertise and commitment for continuous learning to Thoughtworks. Together, let’s be extraordinary.

#LI-Remote

Salary

Benefits: https://www.thoughtworks.com/en-us/careers/benefits

The annual salary range posted is subject to many factors and may vary depending on experience, geographic location, job responsibilities, performance, skills and/or training.

Salary$108,100—$162,000 USD

See here our AI policy.

apply to this job

You might like

Senior Software Engineer (Platform Engineer)

Work from home Full-time role

Regional Sales Manager, Northeast

Work from home Full-time role

Sales Executive / Veterinary channel

Work from home Full-time role

Battery Storage Technician

Work from home Full-time role

Product Management, Director

Work from home Full-time role

Product Manager, Machine Learning

Work from home Full-time role

Production Engineering

Work from home Full-time role

Principal Frontend Software Engineer (US Remote)

Work from home Full-time role

Senior Design Director – Freelance

Work from home Full-time role

Associate Counsel – Litigation

Work from home Full-time role

Experienced Live Chat Support Agent – Remote Part-Time/Full-Time Customer Service Representative for Exceptional Client Experience

Work from home Full-time role

Data Entry/Typing Jobs - No Experience (Remote Job)

Work from home Full-time role

Experienced Customer Service Section Chief – Fleet Management Operations

Work from home Full-time role

Senior Machine Learning Engineer

Work from home Full-time role

Data Entry Specialist (Part-Time, Home-Based)

Work from home Full-time role

Portfolio Manager, NextGear Capital job at Cox Enterprises in CA

Work from home Full-time role

Part-Time Remote Customer Support Specialist at Apple Inc

Work from home Full-time role

Experienced Customer Service Representative – Remote Customer Support for arenaflex

Work from home Full-time role

Work From Home Call Center Sales Specialist - Spokane, WA

Work from home Full-time role

Cloud Engineer Intern

Work from home Full-time role