Train AI on distributed data
Founding ML Engineer – Flower Frontier Model Team
Location
California + 1 moreAll locations: California, New York
Posted
94 days ago
Salary
Not specified
Job Description
Job Requirements
- Exceptional software engineering skills (Python, deep learning frameworks, testing, profiling, refactoring, reproducibility)
- Expertise with modern ML training stacks: PyTorch, JAX or equivalent; experience implementing model architectures from scratch and working within libraries like DeepSpeed, Megatron or equivalent
- Ability to tune, debug, and profile large-scale training runs
- Hands-on experience working with large GPU clusters, including job orchestration, scheduling, multi-node runs, NCCL/RDMA issues, and GPU performance optimization
- Ability to collaborate effectively with both research-oriented and engineering-oriented colleagues; comfortable turning research ideas into robust, maintainable implementations
- Good engineering hygiene: modular design, code reviews, documentation, reproducibility, versioning of data/models/configurations
- Familiarity with common tools (Linux command line, git, Docker, …)
- Openness to adopting new tooling
- Solid understanding of distributed systems and networking
- Strong written English
- Open, honest and transparent communication skills.
Benefits
- Flexible working hours
- Professional development opportunities
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
Principal Decision Scientist – Machine Learning Engineer
Aimpoint DigitalAimpoint Digital is a premier analytics consulting firm with a mission to drive business value for clients through expertise in data strategy, data analytics, decision sciences, and data engineering and infrastructure.
Principal Decision Scientist, Machine Learning Engineer at Aimpoint Digital
Machine Learning Engineer – Deployments Team
RoboflowMaking computer vision easy to use for developers.
Machine Learning Engineer designing and delivering advanced AI solutions.
Machine Learning Engineer developing production ready code at Converge
AI/ML Engineer
DataVisorThe most powerful fraud and AML detection platform trusted by the world's largest brands.
AI/ML Engineer designing scalable fraud intelligence systems