I architect GenAI inference systems at AMD, enabling state-of-the-art LLMs, VLMs, Stable Diffusion model optimization and efficienct inference on NPUs and GPUs.

Experience

AMDPrincipal Software Engineer - LLMs, GenAI

2023 — Now

Principal Architect for LLM inference on AMD NPUs (Ryzen AI), defining the end-to-end system architecture enabling production deployment of state-of-the-art LLMs on client AI platforms.

Technical owner of the Ryzen AI LLM inference stack, spanning operator/runtime frameworks, performance optimization, PyTorch backend, ONNX Runtime enablement, and Day-0 model deployment.

Pioneered advanced inference techniques (3/4-bit operators, multi-LoRA, speculative decoding, inference forecasting), delivering industry-leading gains in efficiency, throughput, and latency while preserving model fidelity.

Primary technical interface to AMD executive leadership (CVP/EVP) and external AI labs, driving adoption of LLMs, agents, LoRA-fine-tuned models, and VLMs across Ryzen AI.

Recognized by CEO Lisa Su (2024) for delivering the first LLM on AMD Ryzen at Computex 2024; work later presented by the CEO at Computex 2025.

Inventor on 20+ patents and published researcher in ML systems and inference optimization; built and led applied AI teams, mentoring PhDs, hiring interns, and scaling research into production systems.

RivianMachine Learning Architect

2022 — 2023

Architected silicon-aware ML inference optimization pipelines for autonomous driving, spanning fine-tuning, quantization, and model transformation to enable efficient execution on power and compute constrained hardware.

Drove hardware–software co-design for perception and fusion ML models, aligning compute constraints for real-time inference.

Invented and deployed a novel low-power inference optimization technique for self driving ML models; patent filed, demonstrating early leadership in hardware aware ML optimization.

Influenced early processor architecture (Rivian RAP) by translating ML workload characteristics into system-level compute requirements, shaping silicon design decisions upstream.

AMDSenior Member of Technical Staff

2021 — 2022

* AMD Acquires Xilinx in Feb 2021.

* Promoted to Senior Staff in 2022.

* Architected and implemented FP32 Super-Resolution CNN and Perceptron accelerators on AI-Engines and FPGA fabric, including configurable compute kernels and programmable non-linear activation engines, enabling efficient mapping of large NLP models (BERT, Transformers) to custom hardware.

* Developed end-to-end ML model deployment tools for heterogeneous CPU+FPGA+AI-Engine systems, including automatic kernel generation, performance analysis, and codegen pipelines; presented innovations at TVMCon 2021.

* Model analysis and optimization of state-of-the-art ML models (DLRM, Transformer, Transformer-Transducer, Depth Estimation, SRCNN); architect and develop operators and software on custom accelerators

XilinxStaff Design Engineer

2016 — 2021

San Jose, California

* Architected and delivered high-performance heterogeneous accelerators for 5G Beamformer and HPC applications, including a 16-antenna Massive MIMO beamformer (1 GBps) and FP32 N-Body solver achieving 2 TFLOPs on a single ACAP. Demonstrated at Xilinx Developers Forum; solutions deployed in field and released on GitHub for HPC community. (Patent granted for on-chip memory access optimization)

* Pioneered sparse neural network inference techniques, including structured sparse data compression/decompression for block-sparse ResNet-50 on ACAP, enabling competitive performance and efficient FPGA utilization. (Patent pending)

* Developed FPGA prototyping systems and automated software tooling for MAC/FEC IP validation, including multi-SLR FPGA integration, C++ runtime software, and Python-based codegen and verification tools—reducing design and analysis time 3x and adopted across multiple hard-IP teams.

* Led micro-architecture design, system integration, and runtime software development for AI-Engine and FPGA IPs; mentored junior engineers and interns, establishing design and verification best practices.

* Published and presented at ISSCC 2020: co-authored “A Versatile 7nm Adaptive Compute Acceleration Platform Processor.”

Skills & Tools: FPGA/ACAP Architecture, AI Engine, Python, C++, Verilog, HLS, Vivado, Vitis, RTL Linting, CDC, codegen pipelines, system integration, sparse NN inference