Andromeda Cluster

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

Performance Engineer - AI Infrastructure

Infrastructure EngineerInfrastructure EngineerFull TimeRemoteTeam 11-50

Location

United States

Posted

7 days ago

Salary

Not specified

No structured requirement data.

Job Description

Performance Engineer - AI Infrastructure

Location: Global Remote / San Francisco · Full-Time

About Andromeda

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers.

We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible.

Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth.

Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

The Opportunity

We are hiring a Performance Engineer to join our Growth team. In this role, your "product" is the efficiency and throughput of our massive-scale AI clusters. As we scale our network, the difference between a "working" cluster and an "optimized" one represents millions of dollars in value and weeks of saved research time for our customers.

You will sit at the intersection of systems engineering and research, profiling end-to-end training runs to hunt down bottlenecks in compute, communication, and storage.


What You’ll Do

  • Profile & Optimize: Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O.

  • System Refinement: Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution.

  • Observability: Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime.

  • Process Design: Design technical processes (e.g., postmortem reviews, incident response) that help the team operate effectively and avoid repeating performance regressions.

What We’re Looking For

  • Systems Intuition: You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware.

  • Distributed Training Experience: Proven experience running distributed training jobs on multi-GPU systems or HPC clusters.

  • Coding Proficiency: Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus).

  • ML Framework Depth: Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built.

  • Infrastructure Knowledge: Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code.

  • Rigor: A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.

Strong Candidates May Have

  • Low-Level Mastery: Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level.

  • Specialized AI Infra: Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX).

  • Security & Privacy: Expertise in security best practices for high-scale infrastructure.

  • Observability: Familiarity with monitoring tools like Prometheus and Grafana.

Why You’ll Love It Here

This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Infrastructure Manager

Andromeda Cluster

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

Infrastructure Engineer7 days ago
Full TimeRemoteTeam 11-50

The Infrastructure Manager will be responsible for matching incoming sales leads with internal and external compute capacity to maximize resource utilization. This role involves sourcing and onboarding new global compute suppliers while developing proactive compute strategies based on market intelligence and customer needs.

United States
Infrastructure Engineer7 days ago
Full TimeRemote

We are looking for a Lead of Trading Infrastructure to take care of the existing globally distributed infrastructure, ensuring fast go-to-market and reliable SLAs for the SWE and Trading teams. The ideal candidate will have strong hands-on experience and a willingness to dive dee...

LinuxNetworkingSLA ManagementScalabilityHardware AnalysisVendor ManagementNomadLXC
United States + 1 moreAll locations: United States, Canada
Infrastructure Engineer7 days ago
Full TimeRemoteTeam 10,001+Since 1970H1B Sponsor

Consulting AWS Cloud Network Infrastructure Engineer defining best practices at LexisNexis

AWSCloudDNSEC2FirewallsPythonSplunkTerraform
North Carolina + 1 moreAll locations: North Carolina, Ohio
$104.9K - $174.7K / year

Infrastructure Architect

ARC-One Solutions

Saving lives by providing market-leading blood supply solutions.

Infrastructure Engineer7 days ago
Full TimeRemoteTeam 51-200Since 2020H1B No Sponsor

An Infrastructure Architect plays a crucial role in designing, implementing, and maintaining the Cloud infrastructure of the next generation Blood Establishment (BECS) platform. All team members are considered problem-solvers and actively participate in identifying problems and b...

United States