# 1. Inference Optimization (Speculative Decoding)
• Led end-to-end research and development of speculative decoding systems for production LLM inference
• Architected and implemented online draft model training, enabling continuous training during real-time serving
• Designed a hybrid speculator (model-based + model-free) with dynamic routing based on scoring mechanisms
• Built and deployed multiple model-based and model-free speculators
• Trained 50+ draft models across Friendli Serverless and customer endpoints, optimized for diverse workload patterns
# 2. Inference Optimization (Kernel-Level)
• Developed high-performance kernels for core LLM operations and sampling
• Implemented specialized kernels for speculative decoding, improving end-to-end inference efficiency
# 3. Inference Runtime Development
• Contributed to core inference runtime systems, including memory management, scheduling, and API server
# 4. Distributed Training (PeriFlow)
• Led initial product development of PeriFlow, a distributed training platform for multi-cloud GPU environments
• Architected fault-tolerant and resource management systems for reliable large-scale training
• Led training and release of FAI-13B, ahead of Meta’s Llama 2
# 5. Solutions Architecture Leadership (US)
• Led US Solutions Architect team, supporting 100+ customer PoCs
• Managed strategic partnerships with cloud providers including AWS
# 6. Open Source & Ecosystem
• Contributed to major open-source ecosystems including LangChain and LlamaIndex