Where technology meets empathy – pioneering the future of human-robot interaction.

Performance Engineer – AI Infrastructure

Full TimeRemoteTeam 11-50Company Site LinkedIn

Location

California

Posted

6 days ago

Salary

Not specified

Bachelor DegreeEnglishCloudKubernetesPythonPy TorchRustTensorflow

Job Description

• Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O • Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution • Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime • Design technical processes that help the team operate effectively and avoid repeating performance regressions

Job Requirements

Proven experience running distributed training jobs on multi-GPU systems or HPC clusters
Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus)
Solid understanding of PyTorch, JAX, or TensorFlow, and large-scale training loops
Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code
Passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.

Benefits

Ownership and autonomy to shape how systems run
Celebrate diversity and create an inclusive environment

Related Categories

LLM Engineer Machine Learning Engineer AI Engineer AI Research Scientist Computer Vision Engineer NLP Engineer

Related Job Pages

LLM Engineer Jobs in California Remote Full-time Jobs (US)Remote Python Jobs (US)More US Remote Jobs