Hydra Host

A distributed marketplace for compute

AI Infrastructure Engineer

Infrastructure EngineerInfrastructure EngineerFull TimeRemoteTeam 11-50Since 2021H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

32 days ago

Salary

$150K - $225K / year

Bachelor DegreeEnglishAnsibleCloudKubernetesLinuxPy TorchTcp/ipTerraform

Job Description

• Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware. • Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use. • Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. • Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types. • Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken. • Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding. • Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.

Job Requirements

  • Bare metal Linux depth — you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s.
  • NVIDIA GPU stack expertise — drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance.
  • Kubernetes and orchestration — production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
  • AI Networking fundamentals — TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads.
  • Customer-facing communication — you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
  • Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1.
  • Nice to Have HPC or large-scale distributed training environments.
  • AI workload experience (vLLM, PyTorch, inference frameworks).
  • Storage systems (NVMe, distributed filesystems, CEPH, WEKA).
  • IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS).

Benefits

  • Competitive salary
  • Equity ownership
  • Healthcare — medical, dental, vision for you and your family
  • Remote-first — with hubs in Phoenix, Boulder, and Miami
  • Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Software Engineer, Privacy Infrastructure Engineering

Netflix

Where you come to do the best work of your life. Follow @WeAreNetflix on Twitter, IG, Facebook, & Youtube for more

Infrastructure Engineer32 days ago
Full TimeRemoteTeam 10,001+Since 1997H1B Sponsor

Software Engineer building privacy solutions for Netflix infrastructure

Distributed SystemsJavaSparkSQL
United States
$260K - $459K / year

Senior Infrastructure Engineer

Pure IT CUSO

We’re a growing Managed Services Provider (MSP) that specializes in supporting credit unions with their IT needs. Our mission? To keep their technology running smoothly, securely, and efficiently—so they can focus on serving their members. We’re all about teamwork, innovation, and having fun while doing what we love.

Infrastructure Engineer33 days ago
Full TimeRemote

We are seeking a Senior Engineer who can deliver high quality project work today while helping us move toward a more automated, consistent, and scalable operating model over time. This is a hands-on role. You will execute deployments, perform configurations, troubleshoot issues, ...

RoutingSwitchingVPNNetwork SegmentationVMwareHyper-VMicrosoft 365AzureAWSPowerShellInfrastructure as CodeNetworkingVirtualizationCloud ServicesIdentity ManagementBackup TechnologiesReplication TechnologiesTroubleshootingAutomation
United States

Data Infrastructure Engineer

Funga

Harnessing forest fungal networks to address the biodiversity and climate crises.

Infrastructure Engineer33 days ago
Full TimeRemoteTeam 11-50Since 2021H1B No Sponsor

Data Infrastructure Engineer building scalable data solutions at Funga

AWSCloudDockerETLGoogle Cloud PlatformPostGISPostgresPythonSQLSQLite
United States
$120K - $150K / year
Infrastructure Engineer35 days ago
Full TimeRemoteTeam 51-200H1B Sponsor

Design and manage OpenShift clusters, automate tasks, diagnose issues, configure monitoring, and provide technical leadership to clients, ensuring robust infrastructure solutions.

AnsibleApi DevelopmentCi/Cd PipelinesCloud InfrastructureGitGrafanaJenkinsKubernetesMicroservices ArchitectureOpenshiftPrometheus
Texas