Andromeda

Where technology meets empathy – pioneering the future of human-robot interaction.

Performance Engineer – AI Infrastructure

Full TimeRemoteTeam 11-50Company SiteLinkedIn

Location

California

Posted

6 days ago

Salary

Not specified

Bachelor DegreeEnglishCloudKubernetesPythonPy TorchRustTensorflow

Job Description

• Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O • Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution • Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime • Design technical processes that help the team operate effectively and avoid repeating performance regressions

Job Requirements

  • Proven experience running distributed training jobs on multi-GPU systems or HPC clusters
  • Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus)
  • Solid understanding of PyTorch, JAX, or TensorFlow, and large-scale training loops
  • Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code
  • Passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.

Benefits

  • Ownership and autonomy to shape how systems run
  • Celebrate diversity and create an inclusive environment

Related Job Pages