Architected an automated alert routing system across 400+ engineering teams, eliminating manual operator triage for hundreds of daily alerts
Built a modernized Alert Management UI in React for infrastructure outage response
Implemented alert correlation logic to link related triggers across manual and automated resolutions
Reduced open ticket backlog from 170 to ~30 by restructuring triage with a tagging and prioritization system
Built a GPT-based tool during a company-wide stability effort, deflecting ~500 engineer questions and freeing 14 engineers from full-time support duty
Launched an engineering-wide Reliability Podcast surfacing lessons from historical outages; 9 episodes, up to 800 views per episode
Co-presented two Bloomberg Developer Conference talks on Effective Alerting and Improving Documentation