# Zaoyi Zheng > Machine Learning Engineer | LLM Serving & Optimization | Distributed Systems | AI Infrastructure & Model Serving | Kubernetes & Docker Location: San Francisco Bay Area, United States Profile: https://flows.cv/zaoyi I am a Machine Learning Infrastructure Engineer passionate about building high-stakes, large-scale production systems that bridge the gap between complex AI models and real-world reliability. Currently at SS&C Technologies, I architect the inference backbone for a mission-critical financial platform, handling 50M+ daily events with a focus on P99 latency (<50ms) and predictive capacity planning. I believe that in finance, reliability isn't just a metric—it's a requirement. Previously, as a Founding Engineer at Ampfie, I led the end-to-end GenAI lifecycle. I’m particularly proud of designing a cost-aware semantic router that slashed API costs by 40% and building an automated "LLM-as-a-judge" evaluation framework to ensure model groundedness and safety. My core toolkit includes: 🚀 Serving: Triton Inference Server, vLLM, TensorRT, RayServe. 🏗️ Infrastructure: Kubernetes (EKS), Kafka, gRPC, Redis, OpenTelemetry. 🧠 ML/GenAI: Agentic Orchestration, RAG Pipelines, LoRA/PEFT, LLM Evaluation. I thrive at the intersection of distributed systems and machine learning. Always open to discussing model serving, MLOps, or the future of agentic workflows. ## Work Experience ### Software Engineer @ SS&C Technologies Jan 2024 – Present | 旧金山湾区 Worked on production inference infrastructure for real-time financial ML models. - Designed inference gateway serving 300–600 QPS (1000+ peak capacity) with adaptive batching, improving GPU utilization from ~40% to ~70% - Reduced Triton model initialization latency from ~45s to ~8s via TensorRT engine caching and optimized loading strategies - Implemented queue-depth-based autoscaling to stabilize performance during 2–3× traffic surges - Built low-latency Redis feature cache (~90% hit rate, <5ms P95) for real-time model features - Improved sustained throughput by ~60% through batch-size × concurrency tuning ### Member @ AI Frontier Network Jan 2025 – Present | San Francisco Bay Area - Supported a real-time recommendation engine for 100M+ DAUs, maintaining sub-4ms P99 latency under high concurrency. - Built a distributed key-value store handling 1.5M QPS, boosting throughput by 38% and ensuring high availability. - Designed zero-downtime rolling updates on Kubernetes, cutting peak memory usage by 50% and minimizing service disruption. - Improved gRPC client performance with in-process caching and resilient retries, raising cache hit ratio by 65%. - Deployed monitoring and automated recovery workflows with Prometheus/Grafana, reducing MTTR for SLO violations by 75%. ### Founding Software Engineer @ Ampfie Jan 2023 – Jan 2024 | San Francisco Bay Area Mission: Architecting a cost-effective, production-grade GenAI platform for multimodal video understanding. - Agentic Routing: Designed a latency-aware router orchestrating queries between Gemini and self-hosted vLLM (Llama 3); reduced operational costs by 40% without sacrificing quality. - Automated Evaluation: Developed an LLM-as-a-judge pipeline to quantify groundedness and safety, reducing manual review cycles from days to minutes. - Multimodal RAG: Built an event-driven pipeline leveraging TensorRT-optimized vision models for real-time video metadata generation and cataloging. - Model Adaptation: Streamlined PEFT (LoRA) workflows, enabling 10x faster experimental iterations. ### Hacker @ TreeHacks Jan 2023 – Jan 2023 | Stanford, California, United States - Built a prototype music platform enabling users to upload tracks securely on a blockchain database (Estuary) to ensure immutability and copyright protection. - Designed a music similarity detection algorithm analyzing rhythm, melody, and harmony to identify plagiarism risks and discover related tracks. - Developed a user dashboard and music player with features for file management, playback, and similarity insights. - Focused on scalability, algorithm accuracy, and security to deliver a reliable and privacy-first music sharing experience. ### Software Engineer @ Socotra Jan 2022 – Jan 2022 | San Francisco Bay Area - DevOps & CI/CD: Migrated legacy monolith services to containerized Docker environments, reducing build-deploy times by 40%. - Modernization: Refactored legacy codebases to TypeScript, significantly improving type safety and reducing runtime errors for production APIs. ### Machine Learning Engineer @ Media Computing Lab in Nankai University Jan 2018 – Jan 2020 | Jinnan District, Tianjin, China Designed and implemented optimizations using FlashAttention and speculative decoding to improve LLM serving performance and reduce latency. ## Education ### Master of Science - MS in Computer Science University of California, Davis ### Bachelor of Science - BS in Applied Mathematics Nankai University ### Exchange Student in Computer Science University of Cambridge ## Contact & Social - LinkedIn: https://linkedin.com/in/zaoyi-zheng --- Source: https://flows.cv/zaoyi JSON Resume: https://flows.cv/zaoyi/resume.json Last updated: 2026-04-10