New York, New York, United States
Ads Machine Learning Infrastructure - Training Data Services
Launched a data pipeline benchmarking tool that automatically A/B tests code changes at commit-time and detects CPU/memory utilization regressions. Reduced operational costs by 2 engineer-months/year and validated a key service scaling initiative that increased label processing capacity from 20M to 100M.
Achieved 50% reduction in runtime for lazy-loaded training data pipelines and improved GPU reading efficiency by designing and implementing a data filter optimization library, which increased the output rate of CPU-based training data readers.
Championed service reliability for a training-time training data mutation service, moving from ~2 major outages per quarter to passing 99.95% SLOs across all service endpoints. Managed service scaling & re-sharding, established SLOs, closed oncall playbook coverage gaps on critical/major alerts, and raised integration test coverage from 45% to 100%. Evangelized alerting best practices, helping reduce the team's oncall toil score.