Senior System Software Engineer, NCCL – Partner Enablement
Full-stack EngineerSoftware EngineerFull TimeRemoteTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn
Location
California + 1 moreAll locations: California, Texas
Posted
56 days ago
Salary
$152K - $218.5K / year
Bachelor Degree5 yrs expEnglishAnsibleAWSAzureCloudDockerGoogle Cloud PlatformKubernetesLinuxNode.jsPython
Job Description
• Engage with our partners and customers to root cause functional and performance issues reported with NCCL
• Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
• Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)
• Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
• Document and conduct trainings/webinars for NCCL
• Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.
Job Requirements
- B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience.
- Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
- Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design
- Experience working with engineering or academic research community supporting HPC or AI
- Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control
- Expert in Linux fundamentals and a scripting language, preferably Python
- Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)
- Adaptability and passion to learn new areas and tools
- Flexibility to work and communicate effectively across different teams and timezones
Benefits
- Equity
- Benefits
Related Guides
Related Job Pages
More Full-stack Engineer Jobs
Full Stack Engineer
Fieldwire by HiltiThe all-in-one jobsite management software for field to office communication.
Full-stack Engineer56 days ago
Full TimeRemoteTeam 51-200Since 2013H1B No Sponsor
Mid-Level Fullstack Engineer developing core features for construction management platform
AngularBootstrapRubyRuby on RailsRustSCSS
Full-stack Engineer56 days ago
Full TimeRemoteTeam 501-1,000Since 2009H1B Sponsor
Software Engineer developing ticketing solutions at SeatGeek
Software Engineer I, Fullstack, Risk Engineering
FlexFlex splits your bills into smaller, stress-free payments throughout the month. Start today with your rent bill!
Full-stack Engineer56 days ago
Full TimeRemoteTeam 201-500Since 2019H1B Sponsor
Software Engineer I developing backend services and APIs for Flex's risk engineering systems
Distributed SystemsJavaReactReact NativeSpringSpring BootSpringBootSQLTypeScript
Full-stack Engineer56 days ago
Full TimeRemoteTeam 51-200Since 2020H1B No Sponsor
Full-Stack Developer at HOLYWATER creating AI-based entertainment products
AWSFirebaseGoogle Cloud PlatformJavaScriptNext.jsNode.jsReactTypeScript
United States