Manager of Site Reliability Engineering

Full TimeRemote

Location

United States

Posted

19 hours ago

Salary

Not specified

No structured requirement data.

Job Description

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more.

Role Description

We’re looking for a Manager of Site Reliability Engineering (SRE) who is passionate about building resilient systems and leading teams that keep critical services running smoothly. In this role, you’ll guide a team responsible for the reliability, performance, and operational health of our production environments.

You’ll partner closely with engineering leaders to ensure our systems remain secure, scalable, and available for the organizations and communities who depend on them.

As the Manager of Site Reliability Engineering, you will lead a team responsible for the operational reliability of Daxko’s production platforms. Your work will focus on creating stable, high-performing systems while empowering your team to continuously improve how we operate and support our products.

  • Lead and support a team responsible for the reliability and performance of production systems, which includes:
    • Setting clear performance expectations and goals for team members
    • Providing ongoing coaching and real-time feedback
    • Ensuring team members have the training and resources they need to succeed
    • Coordinating on-call rotations and operational coverage
    • Supporting the team during critical incidents and outages
    • Managing team staffing, including hiring and headcount planning
  • Prioritize and coordinate work across operational initiatives, deployments, upgrades, and infrastructure improvements
  • Ensure high levels of system uptime, data integrity, and operational stability
  • Partner with Engineering Leads to align platform operations with product development needs
  • Maintain business continuity across all production assets
  • Monitor system health, performance, and capacity to proactively identify and resolve issues
  • Serve as a technical escalation point for complex infrastructure or platform challenges
  • Provide regular reporting on system availability, response times, and capacity trends
  • Ensure operations meet security, compliance, and regulatory requirements
  • Support and coordinate the team’s on-call rotation and incident response processes
  • Continuously improve operational practices through automation, tooling, and monitoring

Qualifications

  • Bachelor’s degree in a technical discipline or equivalent professional experience
  • 3–5 years of experience leading or managing globally distributed engineering teams
  • 3–5 years of experience in a Site Reliability Engineering or similar infrastructure-focused role

Requirements

  • Strong analytical and problem-solving skills
  • Clear communication and collaboration skills
  • Experience leading teams in fast-moving technical environments
  • The ability to balance multiple priorities and make thoughtful decisions under pressure
  • Strong organizational and time management skills
  • A customer-focused mindset and commitment to system reliability

Preferred Experience

  • Experience serving as a technical lead on infrastructure or platform teams
  • Experience with modern observability and monitoring tools, such as OpenTelemetry, Instana, LogicMonitor, PagerDuty, or OpsGenie
  • Experience with infrastructure and automation tooling such as GitLab CI, Jenkins, Chef, Terraform, Elasticsearch, Kubernetes, or Rancher
  • Scripting experience in Ruby, Python, or Bash
  • Familiarity with SOC, PCI, or GDPR compliance standards
  • Experience working with issue tracking and collaboration tools such as the Atlassian suite
  • Experience supporting or developing applications built with Java, PHP, or Node
  • Experience automating operational processes and repetitive tasks

Job Requirements

  • Bachelor’s degree in a technical discipline or equivalent professional experience
  • 3–5 years of experience leading or managing globally distributed engineering teams
  • 3–5 years of experience in a Site Reliability Engineering or similar infrastructure-focused role
  • Strong analytical and problem-solving skills
  • Clear communication and collaboration skills
  • Experience leading teams in fast-moving technical environments
  • The ability to balance multiple priorities and make thoughtful decisions under pressure
  • Strong organizational and time management skills
  • A customer-focused mindset and commitment to system reliability
  • Preferred Experience
  • Experience serving as a technical lead on infrastructure or platform teams
  • Experience with modern observability and monitoring tools, such as OpenTelemetry, Instana, LogicMonitor, PagerDuty, or OpsGenie
  • Experience with infrastructure and automation tooling such as GitLab CI, Jenkins, Chef, Terraform, Elasticsearch, Kubernetes, or Rancher
  • Scripting experience in Ruby, Python, or Bash
  • Familiarity with SOC, PCI, or GDPR compliance standards
  • Experience working with issue tracking and collaboration tools such as the Atlassian suite
  • Experience supporting or developing applications built with Java, PHP, or Node
  • Experience automating operational processes and repetitive tasks

Related Categories

Related Job Pages