# Zaoyi Zheng

> Machine Learning Engineer | LLM Serving & Optimization | Distributed Systems | AI Infrastructure & Model Serving | Kubernetes & Docker

Location: San Francisco Bay Area, United States
Profile: https://flows.cv/zaoyi

I am a Machine Learning Infrastructure Engineer passionate about building high-stakes, large-scale production systems that bridge the gap between complex AI models and real-world reliability.

Currently at SS&C Technologies, I architect the inference backbone for a mission-critical financial platform, handling 50M+ daily events with a focus on P99 latency (<50ms) and predictive capacity planning. I believe that in finance, reliability isn't just a metric—it's a requirement.

Previously, as a Founding Engineer at Ampfie, I led the end-to-end GenAI lifecycle. I’m particularly proud of designing a cost-aware semantic router that slashed API costs by 40% and building an automated "LLM-as-a-judge" evaluation framework to ensure model groundedness and safety.

My core toolkit includes:
🚀 Serving: Triton Inference Server, vLLM, TensorRT, RayServe.
🏗️ Infrastructure: Kubernetes (EKS), Kafka, gRPC, Redis, OpenTelemetry.
🧠 ML/GenAI: Agentic Orchestration, RAG Pipelines, LoRA/PEFT, LLM Evaluation.

I thrive at the intersection of distributed systems and machine learning. Always open to discussing model serving, MLOps, or the future of agentic workflows.

## Work Experience
### Software Engineer @ SS&C Technologies
Jan 2024 – Present | 旧金山湾区
Worked on production inference infrastructure for real-time financial ML models.

- Designed inference gateway serving 300–600 QPS (1000+ peak capacity) with adaptive batching, improving GPU utilization from ~40% to ~70%
- Reduced Triton model initialization latency from ~45s to ~8s via TensorRT engine caching and optimized loading strategies
- Implemented queue-depth-based autoscaling to stabilize performance during 2–3× traffic surges
- Built low-latency Redis feature cache (~90% hit rate, <5ms P95) for real-time model features
- Improved sustained throughput by ~60% through batch-size × concurrency tuning

### Member @ AI Frontier Network
Jan 2025 – Present | San Francisco Bay Area
- Supported a real-time recommendation engine for 100M+ DAUs, maintaining sub-4ms P99 latency under high concurrency.
- Built a distributed key-value store handling 1.5M QPS, boosting throughput by 38% and ensuring high availability.
- Designed zero-downtime rolling updates on Kubernetes, cutting peak memory usage by 50% and minimizing service disruption.
- Improved gRPC client performance with in-process caching and resilient retries, raising cache hit ratio by 65%.
- Deployed monitoring and automated recovery workflows with Prometheus/Grafana, reducing MTTR for SLO violations by 75%.

### Founding Software Engineer @ Ampfie
Jan 2023 – Jan 2024 | San Francisco Bay Area
Mission: Architecting a cost-effective, production-grade GenAI platform for multimodal video understanding.

- Agentic Routing: Designed a latency-aware router orchestrating queries between Gemini and self-hosted vLLM (Llama 3); reduced operational costs by 40% without sacrificing quality.
- Automated Evaluation: Developed an LLM-as-a-judge pipeline to quantify groundedness and safety, reducing manual review cycles from days to minutes.
- Multimodal RAG: Built an event-driven pipeline leveraging TensorRT-optimized vision models for real-time video metadata generation and cataloging.
- Model Adaptation: Streamlined PEFT (LoRA) workflows, enabling 10x faster experimental iterations.

### Hacker @ TreeHacks
Jan 2023 – Jan 2023 | Stanford, California, United States
- Built a prototype music platform enabling users to upload tracks securely on a blockchain database (Estuary) to ensure immutability and copyright protection.
- Designed a music similarity detection algorithm analyzing rhythm, melody, and harmony to identify plagiarism risks and discover related tracks.
- Developed a user dashboard and music player with features for file management, playback, and similarity insights.
- Focused on scalability, algorithm accuracy, and security to deliver a reliable and privacy-first music sharing experience.

### Software Engineer @ Socotra
Jan 2022 – Jan 2022 | San Francisco Bay Area
- DevOps & CI/CD: Migrated legacy monolith services to containerized Docker environments, reducing build-deploy times by 40%.
- Modernization: Refactored legacy codebases to TypeScript, significantly improving type safety and reducing runtime errors for production APIs.

### Machine Learning Engineer @ Media Computing Lab in Nankai University
Jan 2018 – Jan 2020 | Jinnan District, Tianjin, China
Designed and implemented optimizations using FlashAttention and speculative decoding to improve LLM serving performance and reduce latency.


## Education
### Master of Science - MS in Computer Science
University of California, Davis

### Bachelor of Science - BS in Applied Mathematics
Nankai University

### Exchange Student in Computer Science
University of Cambridge


## Contact & Social
- LinkedIn: https://linkedin.com/in/zaoyi-zheng

---
Source: https://flows.cv/zaoyi
JSON Resume: https://flows.cv/zaoyi/resume.json
Last updated: 2026-04-10