Spearheaded the development of a Kubernetes-native compute platform to run large-scale ML training and computer vision workloads, powering AR map generation and supporting a variety of GenAI and data processing use cases across the organization.
Drove core infrastructure enhancements, including:
• Migrating the platform to Ray
• Integrating Volcano for gang scheduling and optimized resource allocation
• Implementing zero-downtime node pool deployments
• Designing job queuing systems to ensure fair and effective resource usage
Solved critical scaling challenges for clusters exceeding 7,000 nodes, enabling large-scale ML training, massive data processing pipelines, and real-time AR operations with high reliability.
Led the design and execution of a high-throughput data pipeline processing 100M+ images daily, building a global 3D map to power advanced AR experiences.
Tech Stack: Kubernetes, Ray, Argo Workflows, Volcano, Cloud Infrastructure (AWS/GCP).