L3 Hardware Support Lead

Full TimeRemoteTeam 473Company Site

Location

United States

Posted

4 days ago

Salary

Not specified

No structured requirement data.

Job Description

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

We are looking for a Lead Hardware Support Engineer to build and lead a production-grade L3 hardware support and escalation function for large-scale, GPU-dense datacenter infrastructure. This role owns high-severity incident response, complex hardware and firmware investigations, and enterprise customer escalations under contractual SLAs.

  • Building and leading the L3 and escalation support function for datacenter server infrastructure across multiple regions
  • Acting as Incident Commander for high-severity production incidents, driving structured mitigation and communication
  • Owning incident response, problem management, and cross-team escalation workflows end-to-end
  • Supporting enterprise bare metal customers under contractual SLAs, including executive-level stakeholder communication
  • Driving root cause analysis for hardware, firmware, and platform-level failures with clear corrective actions
  • Managing vendor escalations with ODMs and OEMs through formal support channels and direct engagement
  • Partnering with datacenter operations, hardware engineering, and infrastructure teams to improve reliability at fleet scale
  • Establishing KPIs, escalation standards, and operational playbooks for production hardware support
  • Hiring, coaching, and scaling a high-performing support engineering team
  • Ensuring continuous improvement of response times, incident quality, and customer experience

Qualifications

  • Experience building or leading an L3 and escalation support function for datacenter server infrastructure in distributed, multi-region environments
  • Experience supporting enterprise bare metal customers under contractual SLAs
  • Strong incident management leadership experience, including serving as Incident Commander
  • Proven ability to build and formalize incident response, problem management, and cross-team escalation processes from scratch
  • People management experience, including hiring, coaching, and performance management
  • Strong English communication skills, written and verbal

Requirements

  • Deep troubleshooting capability across Linux, server hardware, and firmware (BIOS/BMC), with ability to guide investigations at a systems engineer level
  • Strong familiarity with GPU server platforms and common diagnostics (for example: nvidia-smi, dcgmi, Linux log correlation)
  • Experience driving ODM and OEM vendor escalations through support portals and direct channels
  • Scripting skills (bash and basic Python) for troubleshooting and lightweight analytics
  • Exposure to OCP-based hardware platforms

Benefits

  • Health insurance
  • 401(k) plan
  • Paid time off
  • Sick leave

Compensation

We offer competitive salaries, ranging from $125k–$180k base + quarterly performance bonuses.

What we offer

  • Competitive salary and comprehensive benefits package
  • Opportunities for professional growth within Nebius
  • Flexible working arrangements
  • A dynamic and collaborative work environment that values initiative and innovation

Job Requirements

  • Experience building or leading an L3 and escalation support function for datacenter server infrastructure in distributed, multi-region environments
  • Experience supporting enterprise bare metal customers under contractual SLAs
  • Strong incident management leadership experience, including serving as Incident Commander
  • Proven ability to build and formalize incident response, problem management, and cross-team escalation processes from scratch
  • People management experience, including hiring, coaching, and performance management
  • Strong English communication skills, written and verbal
  • Deep troubleshooting capability across Linux, server hardware, and firmware (BIOS/BMC), with ability to guide investigations at a systems engineer level
  • Strong familiarity with GPU server platforms and common diagnostics (for example: nvidia-smi, dcgmi, Linux log correlation)
  • Experience driving ODM and OEM vendor escalations through support portals and direct channels
  • Scripting skills (bash and basic Python) for troubleshooting and lightweight analytics
  • Exposure to OCP-based hardware platforms

Benefits

  • Health insurance
  • 401(k) plan
  • Paid time off
  • Sick leave
  • Compensation
  • We offer competitive salaries, ranging from $125k–$180k base + quarterly performance bonuses.
  • What we offer
  • Competitive salary and comprehensive benefits package
  • Opportunities for professional growth within Nebius
  • Flexible working arrangements
  • A dynamic and collaborative work environment that values initiative and innovation

Related Categories

Related Job Pages