Experience
2024 — Now
Palo Alto, California, United States
Focused on Core Product, Applied Evals, and Task Learning
2022 — 2024
Technical Lead for Offline Orchestration. Modernizing our orchestration stack at LI. Enabling Python+SQL first interfaces & event-driven data triggering on 4+ exabyte worth of data for Metrics, Tracking, Machine Learning, and products with $10B+ revenue streams as part of this modernization with. Collaborating across security & foundations to meet DMA requirements.
Technical Lead for Offline Data Foundations. Migrated and standardized LinkedIn’s exabyte scale data lake onto a modern compute/deployment/datacenter-fabric across areas of orchestration, compute, metadata, and storage. Cross functional collaborations across SREs, program managers, and multiple platform organizations stabilize our data lake amidst accelerating adoption of LLM, Azure, K8s.
Code Contributions:
* Introduced GroundHog Day, which snapshots a production data lake and deploys over 20000+ instances per year all on K8s to drive operational excellence. Accepted to KubeCon 2024 @ Paris to discuss our journey.
* Introduced and integrated OpenTelemetry+Azure Time Series+Grafana for ML & Data orgs in the company, then work with observability teams to standardize in company.
* Internally updated Airflow & Flyte & internal workflow systems to support machine learning needs
2020 — 2022
Founding member of LLM distributed training infra team on Kubernetes at LI. We started with Feed & Ads to enable distributed training on Horovod and support billion parameter models on over 1000 GPUs in 2021, hosting all use-cases on K8s from YARN. The same compute infrastructure is used today for experiments, distributed training, and inference for LLMS.
Code Contributions:
* Designed and implemented a ML training platform on top of 1000+ state-of-the-art GPUs for several 1 billion+ distributed training models.
* Scaled training cluster to 1000+ GPUs on K8s, moving everything over from YARN.
* Speed up ML training by introducing data parallelism to company for rec, search, and ads ranking models (MPI, Horovod).
Founding Member of Multi-Cluster, Multi-Cloud Platform at LinkedIn. I was a founding member of the Terraform Platform at LI for our Hybrid Azure & on-prem cloud. Played a pivotal role in an auto-fabric build system for entire data centers at LI on Azure, and platformization of infrastructure to code across compute, storage, and networking stack at LI.
Code Contributions:
* Initiated a self-service resource management system hosting, auditing, and remediating +300,000 infrastructures on-prem & on Azure in 6 months of launching, hosting the Azure's largest customer for cloud resources.
* Filed a patent to enable IaC @ LI. Presented at HashiConf.
2018 — 2020
2018 — 2020
San Francisco Bay Area
Worked on optimizing system performance of CI/CD. I identified and implemented solutions to holistically improve compute scheduling reliability for the entire company by 27%. I worked on reducing costs in CI. This involved VMs for resource efficiency and improving queuing systems for enhanced throughput.
2016 — 2018
San Francisco Bay Area
2016: Maharbiz Lab on motion tracking of nanoparticles
2017: Biomechanics Labs on ConvNets
2018: RISE Lab on self-driving car platform and optimization
Technology Used: Tensorflow, Keras, Python, OpenCV, Scikit, Pandas
Education
UC Berkeley College of Engineering
Bachelor of Science - BS
UC Berkeley College of Engineering