L3 Hardware Support Lead
Location
United States
Posted
4 days ago
Salary
Not specified
No structured requirement data.
Job Description
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.
Role Description
We are looking for a Lead Hardware Support Engineer to build and lead a production-grade L3 hardware support and escalation function for large-scale, GPU-dense datacenter infrastructure. This role owns high-severity incident response, complex hardware and firmware investigations, and enterprise customer escalations under contractual SLAs.
- Building and leading the L3 and escalation support function for datacenter server infrastructure across multiple regions
- Acting as Incident Commander for high-severity production incidents, driving structured mitigation and communication
- Owning incident response, problem management, and cross-team escalation workflows end-to-end
- Supporting enterprise bare metal customers under contractual SLAs, including executive-level stakeholder communication
- Driving root cause analysis for hardware, firmware, and platform-level failures with clear corrective actions
- Managing vendor escalations with ODMs and OEMs through formal support channels and direct engagement
- Partnering with datacenter operations, hardware engineering, and infrastructure teams to improve reliability at fleet scale
- Establishing KPIs, escalation standards, and operational playbooks for production hardware support
- Hiring, coaching, and scaling a high-performing support engineering team
- Ensuring continuous improvement of response times, incident quality, and customer experience
Qualifications
- Experience building or leading an L3 and escalation support function for datacenter server infrastructure in distributed, multi-region environments
- Experience supporting enterprise bare metal customers under contractual SLAs
- Strong incident management leadership experience, including serving as Incident Commander
- Proven ability to build and formalize incident response, problem management, and cross-team escalation processes from scratch
- People management experience, including hiring, coaching, and performance management
- Strong English communication skills, written and verbal
Requirements
- Deep troubleshooting capability across Linux, server hardware, and firmware (BIOS/BMC), with ability to guide investigations at a systems engineer level
- Strong familiarity with GPU server platforms and common diagnostics (for example: nvidia-smi, dcgmi, Linux log correlation)
- Experience driving ODM and OEM vendor escalations through support portals and direct channels
- Scripting skills (bash and basic Python) for troubleshooting and lightweight analytics
- Exposure to OCP-based hardware platforms
Benefits
- Health insurance
- 401(k) plan
- Paid time off
- Sick leave
Compensation
We offer competitive salaries, ranging from $125k–$180k base + quarterly performance bonuses.
What we offer
- Competitive salary and comprehensive benefits package
- Opportunities for professional growth within Nebius
- Flexible working arrangements
- A dynamic and collaborative work environment that values initiative and innovation
Job Requirements
- Experience building or leading an L3 and escalation support function for datacenter server infrastructure in distributed, multi-region environments
- Experience supporting enterprise bare metal customers under contractual SLAs
- Strong incident management leadership experience, including serving as Incident Commander
- Proven ability to build and formalize incident response, problem management, and cross-team escalation processes from scratch
- People management experience, including hiring, coaching, and performance management
- Strong English communication skills, written and verbal
- Deep troubleshooting capability across Linux, server hardware, and firmware (BIOS/BMC), with ability to guide investigations at a systems engineer level
- Strong familiarity with GPU server platforms and common diagnostics (for example: nvidia-smi, dcgmi, Linux log correlation)
- Experience driving ODM and OEM vendor escalations through support portals and direct channels
- Scripting skills (bash and basic Python) for troubleshooting and lightweight analytics
- Exposure to OCP-based hardware platforms
Benefits
- Health insurance
- 401(k) plan
- Paid time off
- Sick leave
- Compensation
- We offer competitive salaries, ranging from $125k–$180k base + quarterly performance bonuses.
- What we offer
- Competitive salary and comprehensive benefits package
- Opportunities for professional growth within Nebius
- Flexible working arrangements
- A dynamic and collaborative work environment that values initiative and innovation