# Chetan Anand > AI/ML Engineer | LLMs, RAG, and MLOps | Scaling Generative AI with PyTorch, DeepSpeed, Kubernetes, and AWS | Building Production-Grade AI Systems Location: United States, United States Profile: https://flows.cv/chetananand πŸš€ AI/ML Engineer with 5+ years of experience designing, building, and optimizing Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) systems, and distributed MLOps pipelines across AWS, Azure, and GCP. βš™οΈ Proficient in PyTorch, DeepSpeed, TensorFlow, vLLM, and Transformers, delivering scalable, high-performance AI solutions for enterprise-grade deployments. 🧠 Specialized in LLM fine-tuning, LoRA, RLHF, DPO, MoE routing, speculative decoding, and quantization (FP8) to optimize model efficiency and inference speed. ☁️ Experienced in cloud-native AI architecture, deploying Triton Inference Server, SageMaker, and EKS clusters to support high-availability, low-latency model serving pipelines. πŸ” Architected RAG frameworks using FAISS, BM25, and LangChain, improving contextual accuracy, retrieval precision, and factual grounding in production LLMs. πŸ“Š Implemented end-to-end MLOps automation with MLflow, Weights & Biases, Airflow, and CI/CD (GitHub Actions), enabling reproducible training and continuous model delivery. 🧩 Skilled in vector databases, semantic search, and knowledge graph embeddings using FAISS, Pinecone, and Elasticsearch to enhance retrieval and reasoning systems. πŸ”¬ Proven success in distributed AI training pipelines on AWS and Meta-scale clusters, leveraging Ray, Hydra, Spark, and DeepSpeed for trillion-token dataset processing. πŸ“ˆ Focused on scalable AI infrastructure, performance benchmarking, and cost-efficient compute optimization for large-scale LLM workloads. πŸ’‘ Passionate about advancing LLMOps, Generative AI, and AI infrastructure engineering, aligning research innovation with real-world enterprise adoption. ## Work Experience ### Backend Software Engineer @ Perplexity Jan 2025 – Present | San Francisco, California, United States β€’ Engineered next-generation LLM backend leveraging Llama 3.3-70B and speculative decoding, enhancing factual Q&A reliability and massively scaling inference throughput across Cerebras clusters. β€’ Designed and implemented Mixture-of-Experts (MoE) routing and FP8 quantization pipelines, reducing inference latency and optimizing compute efficiency across large-scale distributed GPU instances. β€’ Architected Retrieval-Augmented Generation (RAG) stack with FAISS, BM25, and LangChain, boosting contextual precision by 20% and reducing hallucination rate by 18% on LM Arena benchmarks. β€’ Automated benchmark evaluation using MMLU, TruthfulQA, and GSM8K, improving F1 scores by 13% and cutting operational cost by 38% through optimized AWS SageMaker workloads. β€’ Deployed Sonar Pro across AWS EKS and Triton Inference Server, ensuring high-availability uptime and driving significant growth in API adoption across 1,200+ enterprise developers. β€’ Leveraged PyTorch, DeepSpeed, vLLM, and FlashAttention-3 for distributed fine-tuning, implementing LoRA, SFT, and RLHF pipelines to ensure efficient multi-node model deployment on AWS Cloud. β€’ Built hybrid retrieval systems integrating FAISS, ElasticSearch, and LangChain, optimizing response latency and contextual recall for large-scale factual Q&A applications. β€’ Utilized AWS Lambda, Step Functions, and EFS for orchestration of inference workflows, improving scalability and fault-tolerance across multi-cluster LLM serving pipelines. β€’ Monitored distributed inference with Prometheus, Datadog, and Weights & Biases, enabling automated alerting and 25% improvement in observability and cost-performance balance. β€’ Applied speculative decoding, MoE routing, RAG orchestration, FlashAttention, and self-verification frameworks to enhance model reasoning reliability and drive enterprise-grade AI adoption. ### Software Engineer - Machine Learning @ Meta Jan 2024 – Jan 2024 | San Francisco, California, United States β€’ Engineered large-scale training pipelines for LLaMA 3 (70B–130B) using PyTorch 2.2, DeepSpeed, and AWS Cloud, enabling faster throughput and reduced compute cost across 12K GPU clusters. β€’ Optimized multimodal architecture by integrating text-vision encoders with FP8 quantization on AWS infrastructure, enhancing inference efficiency and improving cross-modal accuracy on internal benchmarks. β€’ Implemented RLHF + DPO fine-tuning pipelines with human feedback curation, enhancing alignment safety metrics by 27% and reducing hallucination rate in production responses by 33%. β€’ Collaborated on the open-source LLaMA 3.2 release, developing deployment scripts and evaluation suites on AWS, improving reproducibility and accelerating global community adoption beyond 1.7M downloads. β€’ Built distributed data-processing workflows on AWS EMR and S3 using Spark, Ray, PyArrow, and Hydra for trilliontoken multilingual datasets, ensuring high reliability in continuous large-scale ingestion pipelines. β€’ Deployed optimized inference stack via Triton Inference Server, ONNX Runtime, TorchServe, leveraging MTIA v2 and H100 GPUs for latency-aware serving across Meta AI Assistant endpoints. β€’ Automated monitoring and performance analytics with Prometheus, Grafana, and Meta AI Dashboards, enabling predictive scaling and reducing infrastructure incidents by 22% in training clusters. β€’ Applied MLOps and DevOps practices using Kubernetes, Docker, CI/CD (GitHub Actions), accelerating experimentation cycles by 35% and improving model deployment stability across environments. β€’ Utilized advanced AI stack Transformers, PyTorch Lightning, LangChain, Weights & Biases, TensorRT, Faiss, LLM Ops frameworks driving reproducible, scalable, and explainable model delivery in 2025 AI ecosystem. ### Software Engineer @ Accenture Jan 2019 – Jan 2023 | India β€’ Built scalable knowledge graph embeddings on AWS Cloud using TensorFlow and Keras, assisting research teams in optimizing TransE and ComplEx models, boosting link prediction accuracy by 32%. β€’ Developed automated ML pipelines for training, evaluation, and deployment using MLflow and Docker, supporting faster experimentation and improving reproducibility by 45% across distributed cloud environments. β€’ Contributed APIs and integration modules to AmpliGraph’s open-source release on AWS, assisting enterprise teams in seamless adoption for large-scale graph analytics and AI-driven insights. β€’ Tuned GPU-based distributed training setups and analyzed performance metrics to enhance convergence rates, supporting cost-efficient optimization that reduced cloud compute expenses by 18%. β€’ Applied advanced Graph ML and embedding techniques using Python, TensorFlow, PyTorch, and Keras, assisting data scientists in extracting relational patterns from structured and semi-structured datasets. β€’ Designed and implemented MLOps pipelines leveraging Docker, Airflow, and MLflow, supporting CI/CD automation, reproducible experiments, and robust retraining workflows in production. β€’ Integrated AWS SageMaker, Azure ML, and Vertex AI for model lifecycle management, analyzing Generative AI and LLM capabilities to support knowledge graph reasoning and infrastructure scalability. ## Education ### Master's Degree in Advanced Data Analytics University of North Texas ## Contact & Social - LinkedIn: https://linkedin.com/in/chetananand9 --- Source: https://flows.cv/chetananand JSON Resume: https://flows.cv/chetananand/resume.json Last updated: 2026-04-11