Palo Alto, California, United States
• Built an automated eval-driven release gating system for LLM serving changes, enforcing promotion/rollback decisions via side-by-side evaluation of quality, latency, and behavioral drift metrics under production load.
• Designed and built a real-time semantic observability service for LLM drift detection, comparing online logprob distributions against calibration baselines using non-parametric statistical methods to flag silent hardware faults and behavioral changes.
• Built a low-latency LLM inference platform using SGLang on Kubernetes, leveraging sticky routing to optimize KV-cache hit rate and prefix overlap across a multi-cloud GPU federation.
• Designed a forecast-driven proactive scaling control plane to align H200 GPU capacity with projected demand, virtually eliminating cold starts to meet conversational latency SLAs (400ms p90) — enabling the company to scale from 100k to 2M phone calls per month.