Palo Alto, California, United States
• Spearheaded the creation of scalable machine learning platforms, enabling ML engineers to efficiently deploy distributed ML workloads.
• Utilized advanced ML technologies such as PyTorch, DDP, Ray, PyTorch Lightning, and Horovod, alongside Kubernetes, AWS, and HPC for distributed ML.
• Integrated tools like W&B for logging, visualizing, and tracking machine learning metrics, and LakeFS for managing and versioning training datasets to enhance platform efficiency and manage workflows.
• Conceptualized and built new software systems with a focus on scalable and efficient architecture, driving innovation in ML infrastructure.
• Collaborated cross-functionally with various teams to understand requirements, provide technical guidance, and promote the adoption of best practices in distributed machine learning technologies.
• Combined programming skills in Python, Node.js, TypeScript, Java, and C# with expertise in designing large-scale backend and frontend systems to develop robust, high-performance applications that meet our complex business requirements.
• Implemented cost-saving measures and optimized performance, leading to significant reductions in experiment runtime and cloud expenditure.