Principal Systems Architect with 20+ years of expertise in AI Infrastructure, Compiler Engineering, and System Software. Specializing in Software-Hardware Co-design, bridging the gap between high-level AI frameworks and silicon performance.

Experience

AMDPrincipal Software Engineer

2019 — Now

San Jose, California

Architected and implemented an end to end MLIR based compiler(AIEHLC) for Spatial Computing NPU Architectures (AIE). Designed a robust progressive lowering pipeline leveraging custom hierarchical dialects and specialized transformation passes to automate tiling, automate routing, and automate scheduling.

Apache TVM Contributor: Developed and upstreamed the Heterogeneous Pipeline Runtime to enable parallel execution across CPU, GPU, and FPGA, resulting in 3 published papers and a 30% throughput increase for YOLO on AMD Ultra96. Additionally, spearheaded HW/SW co-design for the TVM VTA accelerator by customizing HLS hardware logic and implementing corresponding runtime HAL extensions

Architected and Engineered a low-latency Bare Metal Runtime (AEG API) for NPU Accelerators (AIE), specifically designed to bypass OS kernel mode-switching overhead for real-time inference and enhance security by eliminating dependencies on complex system layers.

Delivered a 500% (5x) improvement in kernel dispatch performance.Achieved this by re-architecting the kernel binary format and inventing a specialized loader that maximizes SRAM utilization and data locality, removing DRAM bandwidth bottlenecks during startup.

Led the Driver Development(AIE Driver) for multiple generations of NPU architectures(AIE), ensuring robustness and high throughput for both Linux, Bare metal, and RTOS environments.

Riverbed TechnologyMTS

2014 — 2019

Sunnyvale,California,USA

Achieved a massive 300% (3x) throughput improvement in critical data-path processing for SDWAN. This was realized by architecting a next-generation data plane using DPDK (kernel bypass) and high-concurrency lockless queues, combined with deep system-level tuning including NUMA awareness, VM scheduling/affinity orchestration, and micro-architectural optimizations for cache and branch efficiency.

Innovated hardware acceleration workflows by developing an internal source-to-source compiler, enabling the automatic conversion of C code into FPGA HLS code for rapid deployment on accelerated hardware.

Engineered critical performance enhancements within the Linux kernel network subsystem. Integrated advanced protocol features such as TCP Fast Open (TFO) directly into the kernel model, significantly reducing connection establishment latency for network-intensive applications.

Juniper NetworksStaff Software Engineer

2013 — 2014

sunnyvale,CA,USA

Enhanced system robustness and reliability for Juniper SRX security platforms by developing and debugging critical low-level components, including a microkernel architecture, bootloaders, and the FreeBSD-based Junos OS kernel.

DellSenior Software Engineer

2013 — 2013

San Jose,CA,USA

Optimized the performance of virtualized desktop environments, achieving a 70% reduction in boot time and 1 patent filed.

Fluke CorporationStaff Software Engineer

2011 — 2013

Santa clara,California,USA

Designed and implemented the Source-to-Source Compiler to automate the migration of Linux kernel drivers to the Windows platform.

Developed translation rules to automatically convert Linux-specific kernel APIs and data structures into their Windows equivalents, significantly reducing manual porting effort and ensuring code consistency.

Architected and implemented a high-performance OSPort runtime layer to bridge kernel execution models, dynamically mapping Linux Bottom Halves and performing real-time translation of kernel data structures.

Education

Nanjing University of Aeronautics and Astronautics

Experience+3

Education

BS

Experience