# Rajeev P.

> Principal Engineer, AI @AMD

Location: San Francisco Bay Area, United States
Profile: https://flows.cv/rajeevp

I architect GenAI inference systems at AMD, enabling state-of-the-art LLMs, VLMs, Stable Diffusion model optimization and efficienct inference on NPUs and GPUs.

My work spans architecture, software, kernels, and runtime — building the infrastructure and frameworks that enable AMD to deploy next-generation AI models at scale. I lead cross-disciplinary teams, collaborate with top-tier AI companies (Meta, Google, emerging startups), and partner with AMD executives to define the roadmap for GenAI inference innovation.

Key contributions:

Architected AMD’s LLM inference backend for NPUs/iGPUs, now foundational for deploying LLMs, SD, and VLMs efficiently across Pytorch and ONNX frameworks.

Pioneered low-bit inference innovations (3 & 4-bit operators, multi-LoRA, LoRA-based fine-tuning), redefining the performance/efficiency frontier.

Filed 20+ patents in AI inference, operator design, and system-level optimizations.

Published and submitted research papers on generative AI inference, bridging industry and academic systems research.

Mentored and led cross-org innovation teams, ensuring solutions scale beyond individual projects.

My mission: build systems, frameworks, and reproducible artifacts that shape AMD’s position as the leader in generative AI inference, from kernel-level operators to industry-scale deployment.

Full research record: https://scholar.google.com/citations?hl=en&user=6BI01aMAAAAJ

## Work Experience
### Principal Software Engineer - LLMs, GenAI @ AMD
Jan 2023 – Present
• Principal Architect for LLM inference on AMD NPUs (Ryzen AI), defining the end-to-end system architecture enabling production deployment of state-of-the-art LLMs on client AI platforms.
• Technical owner of the Ryzen AI LLM inference stack, spanning operator/runtime frameworks, performance optimization, PyTorch backend, ONNX Runtime enablement, and Day-0 model deployment.
• Pioneered advanced inference techniques (3/4-bit operators, multi-LoRA, speculative decoding, inference forecasting), delivering industry-leading gains in efficiency, throughput, and latency while preserving model fidelity.
• Primary technical interface to AMD executive leadership (CVP/EVP) and external AI labs, driving adoption of LLMs, agents, LoRA-fine-tuned models, and VLMs across Ryzen AI.
• Recognized by CEO Lisa Su (2024) for delivering the first LLM on AMD Ryzen at Computex 2024; work later presented by the CEO at Computex 2025.
• Inventor on 20+ patents and published researcher in ML systems and inference optimization; built and led applied AI teams, mentoring PhDs, hiring interns, and scaling research into production systems.

### Machine Learning Architect @ Rivian
Jan 2022 – Jan 2023
• Architected silicon-aware ML inference optimization pipelines for autonomous driving, spanning fine-tuning, quantization, and model transformation to enable efficient execution on power and compute constrained hardware.
• Drove hardware–software co-design for perception and fusion ML models, aligning compute constraints for real-time inference.
• Invented and deployed a novel low-power inference optimization technique for self driving ML models; patent filed, demonstrating early leadership in hardware aware ML optimization.
• Influenced early processor architecture (Rivian RAP) by translating ML workload characteristics into system-level compute requirements, shaping silicon design decisions upstream.

### Senior Member of Technical Staff @ AMD
Jan 2021 – Jan 2022
* AMD Acquires Xilinx in Feb 2021.
* Promoted to Senior Staff in 2022. 
* Architected and implemented FP32 Super-Resolution CNN and Perceptron accelerators on AI-Engines and FPGA fabric, including configurable compute kernels and programmable non-linear activation engines, enabling efficient mapping of large NLP models (BERT, Transformers) to custom hardware. 
* Developed end-to-end ML model deployment tools for heterogeneous CPU+FPGA+AI-Engine systems, including automatic kernel generation, performance analysis, and codegen pipelines; presented innovations at TVMCon 2021.
* Model analysis and optimization of state-of-the-art ML models (DLRM, Transformer, Transformer-Transducer, Depth Estimation, SRCNN); architect and develop operators and software on custom accelerators

### Staff Design Engineer @ Xilinx
Jan 2016 – Jan 2021 | San Jose, California
* Architected and delivered high-performance heterogeneous accelerators for 5G Beamformer and HPC applications, including a 16-antenna Massive MIMO beamformer (1 GBps) and FP32 N-Body solver achieving 2 TFLOPs on a single ACAP. Demonstrated at Xilinx Developers Forum; solutions deployed in field and released on GitHub for HPC community. (Patent granted for on-chip memory access optimization)

* Pioneered sparse neural network inference techniques, including structured sparse data compression/decompression for block-sparse ResNet-50 on ACAP, enabling competitive performance and efficient FPGA utilization. (Patent pending)

* Developed FPGA prototyping systems and automated software tooling for MAC/FEC IP validation, including multi-SLR FPGA integration, C++ runtime software, and Python-based codegen and verification tools—reducing design and analysis time 3x and adopted across multiple hard-IP teams.

* Led micro-architecture design, system integration, and runtime software development for AI-Engine and FPGA IPs; mentored junior engineers and interns, establishing design and verification best practices.

* Published and presented at ISSCC 2020: co-authored “A Versatile 7nm Adaptive Compute Acceleration Platform Processor.”

Skills & Tools: FPGA/ACAP Architecture, AI Engine, Python, C++, Verilog, HLS, Vivado, Vitis, RTL Linting, CDC, codegen pipelines, system integration, sparse NN inference

### Senior Member Of Technical Staff @ Inphi Corporation
Jan 2010 – Jan 2016 | Westlake Village, California
Rambus acquires Inphi’s memory division.
Promoted to Senior Staff Engineer; lead software/validation responsibilities.

### Staff Engineer @ Inphi Corporation
Jan 2010 – Jan 2016 | Westlake Village, CA
• Operated in a fast-paced start-up, scaling from pre-IPO execution to full-scale production while delivering critical memory subsystem platforms.
• Architected and delivered memory interface training software and systems for enterprise servers using Load-Reduced DIMM SoCs, enabling reliable operation of up to 784GB DDR memory at scale.
• Led end-to-end architecture, validation, and deployment of software platforms for characterization and functional validation of high-speed memory buffer ICs, accelerating silicon bring-up and product readiness.
• Designed and owned Zynq MPSoC-based FPGA validation platforms for DDR4/DDR5 subsystems, spanning RTL design, embedded software, and customer-facing tooling.
• Built high-performance signal-processing and parallel compute frameworks for time/frequency-domain analysis, anomaly detection, and hardware-in-the-loop memory training, reducing field bring-up time and failure rates by ~4× while mentoring and scaling engineering talent.

### Graduate Teaching Assistant @ Dept. of ECE, North Carolina State University
Jan 2009 – Jan 2009 | Raleigh, NC
Teaching assistant for "Fundamentals of Logic Design" for sophomore students at Department of Electrical and Computer Engineering, North Carolina State University.

### Independent Research @ Centre for Efficient, Scalable & Reliable Computing, North Carolina State Univeristy
Jan 2009 – Jan 2009 | Raleigh, NC
1. Investigated retention time aware allocation in DRAMs by the use of software profiling. Developed scheduling algorithms in C for optimal memory allocation.
2. Designed and implemented FPGA based SDRAM memory controller for prototyping the hypothesis.

### Intern at Infineon Technologies, Germany @ Infineon Technologies
Jan 2007 – Jan 2008 | Warstein, Germany
• Architect, build systems for analysis of Insulated Gate Bipolar Transistors (IGBT) under special operating conditions.

### Intern @ ECIL
Jan 2006 – Jan 2006 | Hyderabad, India
•Designed PIC micro-controller assembly code for position measurement sensor systems.


## Education
### Graduate Coursework (SCPD) in Artificial Intelligence
Stanford University

### Master of Science in Computer Engineering
North Carolina State University

### Bachellor of Technology in Electronics & Communication Engineering (Honors)
SASTRA UNIVERSITY

### Master of Science - MS in Computer Engineering
North Carolina State University


## Contact & Social
- LinkedIn: https://linkedin.com/in/rajpatwari

---
Source: https://flows.cv/rajeevp
JSON Resume: https://flows.cv/rajeevp/resume.json
Last updated: 2026-04-12