Experience

2020 — Now

San Francisco Bay Area

As an ML Infrastructure Engineer at Snap, I design, build, and maintain scalable machine learning infrastructure to support seamless and efficient model training across the company. My focus is on building reliable, cost-effective systems using Google Cloud Platform (GCP) technologies that empower internal teams to develop and deploy models faster and more efficiently.

I specialize in:

Automated ML pipelines using tools like Kubeflow and Temporal, enabling rapid iteration and deployment of ML models.

Kubernetes (GKE) and Vertex AI for orchestrating large-scale training jobs, with a focus on performance tuning and cost optimization.

GPU resource management with systems like Kueue to maximize resource utilization and ensure smooth scheduling.

Handle infrastructure support for Dataflow pipelines, focusing on resource optimization, operational reliability, and Cloud Spanner-based metadata management.

Building robust internal services including API layers, GCP IAM-based permission systems, and online training infrastructure for real-time model updates.

I'm also responsible for:

Supporting Snap’s internal ML platform users and ensuring high reliability of our systems.

Maintaining critical infrastructure components to streamline the end-to-end ML lifecycle.

Innovating on infrastructure to improve efficiency, reduce costs, and accelerate ML development timelines.

With a strong background in Docker, GCP, and open-source ML infrastructure tools, I'm passionate about empowering ML teams to scale their workflows, from experimentation to production.