Menlo Park, California, United States
* [AI Agents] Architected and led the deployment of AI Agentic solutions for advanced proactive identification and root cause analysis of failures across a large-scale, critical infrastructure environment. This initiative resulted in a 50% reduction in Mean Time To Mitigation (MTTM) and substantially increased overall service uptime.
* [Traffic Telemetry Platform] Designed and deployed an Active Traffic Telemetry Platform to monitor Client-to-Edge availability for a substantial subset of Meta's "Family of Apps (FoA)" customers. This system provides real-time, actionable signals to infrastructure owners when customer reachability issues are detected. Crucially, the platform automatically root-causes and categorizes the issue, distinguishing between problems originating within the Meta infrastructure versus those arising from general Internet connectivity or upstream network conditions.
* [Large Scale Anomaly Clustering] Led and delivered a large-scale anomaly detection program across 2M of monitored metrics/servers, successfully reducing on-call alert noise by ~40% and significantly improving the signal-to-noise ratio for engineering teams.