🚀 AI/ML Engineer with 5+ years of experience designing, building, and optimizing Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) systems, and distributed MLOps pipelines across AWS, Azure, and GCP.
Experience
2025 — Now
2025 — Now
San Francisco, California, United States
• Engineered next-generation LLM backend leveraging Llama 3.3-70B and speculative decoding, enhancing factual Q&A reliability and massively scaling inference throughput across Cerebras clusters.
• Designed and implemented Mixture-of-Experts (MoE) routing and FP8 quantization pipelines, reducing inference latency and optimizing compute efficiency across large-scale distributed GPU instances.
• Architected Retrieval-Augmented Generation (RAG) stack with FAISS, BM25, and LangChain, boosting contextual precision by 20% and reducing hallucination rate by 18% on LM Arena benchmarks.
• Automated benchmark evaluation using MMLU, TruthfulQA, and GSM8K, improving F1 scores by 13% and cutting operational cost by 38% through optimized AWS SageMaker workloads.
• Deployed Sonar Pro across AWS EKS and Triton Inference Server, ensuring high-availability uptime and driving significant growth in API adoption across 1,200+ enterprise developers.
• Leveraged PyTorch, DeepSpeed, vLLM, and FlashAttention-3 for distributed fine-tuning, implementing LoRA, SFT, and RLHF pipelines to ensure efficient multi-node model deployment on AWS Cloud.
• Built hybrid retrieval systems integrating FAISS, ElasticSearch, and LangChain, optimizing response latency and contextual recall for large-scale factual Q&A applications.
• Utilized AWS Lambda, Step Functions, and EFS for orchestration of inference workflows, improving scalability and fault-tolerance across multi-cluster LLM serving pipelines.
• Monitored distributed inference with Prometheus, Datadog, and Weights & Biases, enabling automated alerting and 25% improvement in observability and cost-performance balance.
• Applied speculative decoding, MoE routing, RAG orchestration, FlashAttention, and self-verification frameworks to enhance model reasoning reliability and drive enterprise-grade AI adoption.
2024 — 2024
2024 — 2024
San Francisco, California, United States
• Engineered large-scale training pipelines for LLaMA 3 (70B–130B) using PyTorch 2.2, DeepSpeed, and AWS Cloud, enabling faster throughput and reduced compute cost across 12K GPU clusters.
• Optimized multimodal architecture by integrating text-vision encoders with FP8 quantization on AWS
infrastructure, enhancing inference efficiency and improving cross-modal accuracy on internal benchmarks.
• Implemented RLHF + DPO fine-tuning pipelines with human feedback curation, enhancing alignment safety metrics by 27% and reducing hallucination rate in production responses by 33%.
• Collaborated on the open-source LLaMA 3.2 release, developing deployment scripts and evaluation suites on AWS, improving reproducibility and accelerating global community adoption beyond 1.7M downloads.
• Built distributed data-processing workflows on AWS EMR and S3 using Spark, Ray, PyArrow, and Hydra for trilliontoken multilingual datasets, ensuring high reliability in continuous large-scale ingestion pipelines.
• Deployed optimized inference stack via Triton Inference Server, ONNX Runtime, TorchServe, leveraging MTIA v2 and H100 GPUs for latency-aware serving across Meta AI Assistant endpoints.
• Automated monitoring and performance analytics with Prometheus, Grafana, and Meta AI Dashboards, enabling predictive scaling and reducing infrastructure incidents by 22% in training clusters.
• Applied MLOps and DevOps practices using Kubernetes, Docker, CI/CD (GitHub Actions), accelerating
experimentation cycles by 35% and improving model deployment stability across environments.
• Utilized advanced AI stack Transformers, PyTorch Lightning, LangChain, Weights & Biases, TensorRT, Faiss, LLM Ops frameworks driving reproducible, scalable, and explainable model delivery in 2025 AI ecosystem.
2019 — 2023
2019 — 2023
India
• Built scalable knowledge graph embeddings on AWS Cloud using TensorFlow and Keras, assisting research teams in optimizing TransE and ComplEx models, boosting link prediction accuracy by 32%.
• Developed automated ML pipelines for training, evaluation, and deployment using MLflow and Docker, supporting faster experimentation and improving reproducibility by 45% across distributed cloud environments.
• Contributed APIs and integration modules to AmpliGraph’s open-source release on AWS, assisting enterprise teams in seamless adoption for large-scale graph analytics and AI-driven insights.
• Tuned GPU-based distributed training setups and analyzed performance metrics to enhance convergence rates, supporting cost-efficient optimization that reduced cloud compute expenses by 18%.
• Applied advanced Graph ML and embedding techniques using Python, TensorFlow, PyTorch, and Keras, assisting data scientists in extracting relational patterns from structured and semi-structured datasets.
• Designed and implemented MLOps pipelines leveraging Docker, Airflow, and MLflow, supporting CI/CD automation, reproducible experiments, and robust retraining workflows in production.
• Integrated AWS SageMaker, Azure ML, and Vertex AI for model lifecycle management, analyzing Generative AI and LLM capabilities to support knowledge graph reasoning and infrastructure scalability.
Education
University of North Texas