# Chetan Anand

> AI/ML Engineer | LLMs, RAG, and MLOps | Scaling Generative AI with PyTorch, DeepSpeed, Kubernetes, and AWS | Building Production-Grade AI Systems

Location: United States, United States
Profile: https://flows.cv/chetananand

🚀 AI/ML Engineer with 5+ years of experience designing, building, and optimizing Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) systems, and distributed MLOps pipelines across AWS, Azure, and GCP.

⚙️ Proficient in PyTorch, DeepSpeed, TensorFlow, vLLM, and Transformers, delivering scalable, high-performance AI solutions for enterprise-grade deployments.

🧠 Specialized in LLM fine-tuning, LoRA, RLHF, DPO, MoE routing, speculative decoding, and quantization (FP8) to optimize model efficiency and inference speed.

☁️ Experienced in cloud-native AI architecture, deploying Triton Inference Server, SageMaker, and EKS clusters to support high-availability, low-latency model serving pipelines.

🔍 Architected RAG frameworks using FAISS, BM25, and LangChain, improving contextual accuracy, retrieval precision, and factual grounding in production LLMs.

📊 Implemented end-to-end MLOps automation with MLflow, Weights & Biases, Airflow, and CI/CD (GitHub Actions), enabling reproducible training and continuous model delivery.

🧩 Skilled in vector databases, semantic search, and knowledge graph embeddings using FAISS, Pinecone, and Elasticsearch to enhance retrieval and reasoning systems.

🔬 Proven success in distributed AI training pipelines on AWS and Meta-scale clusters, leveraging Ray, Hydra, Spark, and DeepSpeed for trillion-token dataset processing.

📈 Focused on scalable AI infrastructure, performance benchmarking, and cost-efficient compute optimization for large-scale LLM workloads.

💡 Passionate about advancing LLMOps, Generative AI, and AI infrastructure engineering, aligning research innovation with real-world enterprise adoption.

## Work Experience
### Backend Software Engineer @ Perplexity
Jan 2025 – Present | San Francisco, California, United States
• Engineered next-generation LLM backend leveraging Llama 3.3-70B and speculative decoding, enhancing factual Q&A reliability and massively scaling inference throughput across Cerebras clusters.
• Designed and implemented Mixture-of-Experts (MoE) routing and FP8 quantization pipelines, reducing inference latency and optimizing compute efficiency across large-scale distributed GPU instances.
• Architected Retrieval-Augmented Generation (RAG) stack with FAISS, BM25, and LangChain, boosting contextual precision by 20% and reducing hallucination rate by 18% on LM Arena benchmarks.
• Automated benchmark evaluation using MMLU, TruthfulQA, and GSM8K, improving F1 scores by 13% and cutting operational cost by 38% through optimized AWS SageMaker workloads.
• Deployed Sonar Pro across AWS EKS and Triton Inference Server, ensuring high-availability uptime and driving significant growth in API adoption across 1,200+ enterprise developers.
• Leveraged PyTorch, DeepSpeed, vLLM, and FlashAttention-3 for distributed fine-tuning, implementing LoRA, SFT, and RLHF pipelines to ensure efficient multi-node model deployment on AWS Cloud.
• Built hybrid retrieval systems integrating FAISS, ElasticSearch, and LangChain, optimizing response latency and contextual recall for large-scale factual Q&A applications.
• Utilized AWS Lambda, Step Functions, and EFS for orchestration of inference workflows, improving scalability and fault-tolerance across multi-cluster LLM serving pipelines.
• Monitored distributed inference with Prometheus, Datadog, and Weights & Biases, enabling automated alerting and 25% improvement in observability and cost-performance balance.
• Applied speculative decoding, MoE routing, RAG orchestration, FlashAttention, and self-verification frameworks to enhance model reasoning reliability and drive enterprise-grade AI adoption.

### Software Engineer - Machine Learning @ Meta
Jan 2024 – Jan 2024 | San Francisco, California, United States
• Engineered large-scale training pipelines for LLaMA 3 (70B–130B) using PyTorch 2.2, DeepSpeed, and AWS Cloud, enabling faster throughput and reduced compute cost across 12K GPU clusters.
• Optimized multimodal architecture by integrating text-vision encoders with FP8 quantization on AWS
infrastructure, enhancing inference efficiency and improving cross-modal accuracy on internal benchmarks.
• Implemented RLHF + DPO fine-tuning pipelines with human feedback curation, enhancing alignment safety metrics by 27% and reducing hallucination rate in production responses by 33%.
• Collaborated on the open-source LLaMA 3.2 release, developing deployment scripts and evaluation suites on AWS, improving reproducibility and accelerating global community adoption beyond 1.7M downloads.
• Built distributed data-processing workflows on AWS EMR and S3 using Spark, Ray, PyArrow, and Hydra for trilliontoken multilingual datasets, ensuring high reliability in continuous large-scale ingestion pipelines.
• Deployed optimized inference stack via Triton Inference Server, ONNX Runtime, TorchServe, leveraging MTIA v2 and H100 GPUs for latency-aware serving across Meta AI Assistant endpoints.
• Automated monitoring and performance analytics with Prometheus, Grafana, and Meta AI Dashboards, enabling predictive scaling and reducing infrastructure incidents by 22% in training clusters.
• Applied MLOps and DevOps practices using Kubernetes, Docker, CI/CD (GitHub Actions), accelerating
experimentation cycles by 35% and improving model deployment stability across environments.
• Utilized advanced AI stack Transformers, PyTorch Lightning, LangChain, Weights & Biases, TensorRT, Faiss, LLM Ops frameworks driving reproducible, scalable, and explainable model delivery in 2025 AI ecosystem.

### Software Engineer @ Accenture
Jan 2019 – Jan 2023 | India
• Built scalable knowledge graph embeddings on AWS Cloud using TensorFlow and Keras, assisting research teams in optimizing TransE and ComplEx models, boosting link prediction accuracy by 32%.
• Developed automated ML pipelines for training, evaluation, and deployment using MLflow and Docker, supporting faster experimentation and improving reproducibility by 45% across distributed cloud environments.
• Contributed APIs and integration modules to AmpliGraph’s open-source release on AWS, assisting enterprise teams in seamless adoption for large-scale graph analytics and AI-driven insights.
• Tuned GPU-based distributed training setups and analyzed performance metrics to enhance convergence rates, supporting cost-efficient optimization that reduced cloud compute expenses by 18%.
• Applied advanced Graph ML and embedding techniques using Python, TensorFlow, PyTorch, and Keras, assisting data scientists in extracting relational patterns from structured and semi-structured datasets.
• Designed and implemented MLOps pipelines leveraging Docker, Airflow, and MLflow, supporting CI/CD automation, reproducible experiments, and robust retraining workflows in production.
• Integrated AWS SageMaker, Azure ML, and Vertex AI for model lifecycle management, analyzing Generative AI and LLM capabilities to support knowledge graph reasoning and infrastructure scalability.


## Education
### Master's Degree in Advanced Data Analytics
University of North Texas


## Contact & Social
- LinkedIn: https://linkedin.com/in/chetananand9

---
Source: https://flows.cv/chetananand
JSON Resume: https://flows.cv/chetananand/resume.json
Last updated: 2026-04-11