• Designed and deployed identity resolution microservices in Python/C++, unifying 50M+ Entra ID and HR records with versioned audit trails, self-healing workflows, and high availability.
• Refactored a monolithic system into scalable microservices architecture, implementing contract testing and phased rollouts to improve system throughput by 45% without regressions.
• Built high-throughput data pipelines using Kafka and Spark Streaming, processing 200M+ events/day with secure archival to S3/Glacier and compliance with FINRA/SEC retention policies.
• Built and deployed LLM inference services using Triton Inference Server and vLLM, enabling scalable, production-grade model serving for enterprise workloads.
• Optimized large language model performance (GPT, BERT, LLaMA) through quantization and efficient inference strategies, reducing memory usage by 60% and improving latency.
• Developed and optimized multi-modal inference pipelines with advanced decoding strategies (Greedy, Beam, Sampling), improving token generation performance by 20%.