I architect GenAI inference systems at AMD, enabling state-of-the-art LLMs, VLMs, Stable Diffusion model optimization and efficienct inference on NPUs and GPUs.
Experience
2023 — Now
• Principal Architect for LLM inference on AMD NPUs (Ryzen AI), defining the end-to-end system architecture enabling production deployment of state-of-the-art LLMs on client AI platforms.
• Technical owner of the Ryzen AI LLM inference stack, spanning operator/runtime frameworks, performance optimization, PyTorch backend, ONNX Runtime enablement, and Day-0 model deployment.
• Pioneered advanced inference techniques (3/4-bit operators, multi-LoRA, speculative decoding, inference forecasting), delivering industry-leading gains in efficiency, throughput, and latency while preserving model fidelity.
• Primary technical interface to AMD executive leadership (CVP/EVP) and external AI labs, driving adoption of LLMs, agents, LoRA-fine-tuned models, and VLMs across Ryzen AI.
• Recognized by CEO Lisa Su (2024) for delivering the first LLM on AMD Ryzen at Computex 2024; work later presented by the CEO at Computex 2025.
• Inventor on 20+ patents and published researcher in ML systems and inference optimization; built and led applied AI teams, mentoring PhDs, hiring interns, and scaling research into production systems.
2022 — 2023
2022 — 2023
• Architected silicon-aware ML inference optimization pipelines for autonomous driving, spanning fine-tuning, quantization, and model transformation to enable efficient execution on power and compute constrained hardware.
• Drove hardware–software co-design for perception and fusion ML models, aligning compute constraints for real-time inference.
• Invented and deployed a novel low-power inference optimization technique for self driving ML models; patent filed, demonstrating early leadership in hardware aware ML optimization.
• Influenced early processor architecture (Rivian RAP) by translating ML workload characteristics into system-level compute requirements, shaping silicon design decisions upstream.
2021 — 2022
2021 — 2022
* AMD Acquires Xilinx in Feb 2021.
* Promoted to Senior Staff in 2022.
* Architected and implemented FP32 Super-Resolution CNN and Perceptron accelerators on AI-Engines and FPGA fabric, including configurable compute kernels and programmable non-linear activation engines, enabling efficient mapping of large NLP models (BERT, Transformers) to custom hardware.
* Developed end-to-end ML model deployment tools for heterogeneous CPU+FPGA+AI-Engine systems, including automatic kernel generation, performance analysis, and codegen pipelines; presented innovations at TVMCon 2021.
* Model analysis and optimization of state-of-the-art ML models (DLRM, Transformer, Transformer-Transducer, Depth Estimation, SRCNN); architect and develop operators and software on custom accelerators
2016 — 2021
2016 — 2021
San Jose, California
* Architected and delivered high-performance heterogeneous accelerators for 5G Beamformer and HPC applications, including a 16-antenna Massive MIMO beamformer (1 GBps) and FP32 N-Body solver achieving 2 TFLOPs on a single ACAP. Demonstrated at Xilinx Developers Forum; solutions deployed in field and released on GitHub for HPC community. (Patent granted for on-chip memory access optimization)
* Pioneered sparse neural network inference techniques, including structured sparse data compression/decompression for block-sparse ResNet-50 on ACAP, enabling competitive performance and efficient FPGA utilization. (Patent pending)
* Developed FPGA prototyping systems and automated software tooling for MAC/FEC IP validation, including multi-SLR FPGA integration, C++ runtime software, and Python-based codegen and verification tools—reducing design and analysis time 3x and adopted across multiple hard-IP teams.
* Led micro-architecture design, system integration, and runtime software development for AI-Engine and FPGA IPs; mentored junior engineers and interns, establishing design and verification best practices.
* Published and presented at ISSCC 2020: co-authored “A Versatile 7nm Adaptive Compute Acceleration Platform Processor.”
Skills & Tools: FPGA/ACAP Architecture, AI Engine, Python, C++, Verilog, HLS, Vivado, Vitis, RTL Linting, CDC, codegen pipelines, system integration, sparse NN inference
2010 — 2016
Westlake Village, California
Rambus acquires Inphi’s memory division.
Promoted to Senior Staff Engineer; lead software/validation responsibilities.
Education
Stanford University
Graduate Coursework (SCPD)
North Carolina State University
Master of Science
SASTRA UNIVERSITY
Bachellor of Technology
North Carolina State University