Principal Systems Architect with 20+ years of expertise in AI Infrastructure, Compiler Engineering, and System Software. Specializing in Software-Hardware Co-design, bridging the gap between high-level AI frameworks and silicon performance.
Experience
2019 — Now
2019 — Now
San Jose, California
Architected and implemented an end to end MLIR based compiler(AIEHLC) for Spatial Computing NPU Architectures (AIE). Designed a robust progressive lowering pipeline leveraging custom hierarchical dialects and specialized transformation passes to automate tiling, automate routing, and automate scheduling.
Apache TVM Contributor: Developed and upstreamed the Heterogeneous Pipeline Runtime to enable parallel execution across CPU, GPU, and FPGA, resulting in 3 published papers and a 30% throughput increase for YOLO on AMD Ultra96. Additionally, spearheaded HW/SW co-design for the TVM VTA accelerator by customizing HLS hardware logic and implementing corresponding runtime HAL extensions
Architected and Engineered a low-latency Bare Metal Runtime (AEG API) for NPU Accelerators (AIE), specifically designed to bypass OS kernel mode-switching overhead for real-time inference and enhance security by eliminating dependencies on complex system layers.
Delivered a 500% (5x) improvement in kernel dispatch performance.Achieved this by re-architecting the kernel binary format and inventing a specialized loader that maximizes SRAM utilization and data locality, removing DRAM bandwidth bottlenecks during startup.
Led the Driver Development(AIE Driver) for multiple generations of NPU architectures(AIE), ensuring robustness and high throughput for both Linux, Bare metal, and RTOS environments.
2014 — 2019
2014 — 2019
Sunnyvale,California,USA
Achieved a massive 300% (3x) throughput improvement in critical data-path processing for SDWAN. This was realized by architecting a next-generation data plane using DPDK (kernel bypass) and high-concurrency lockless queues, combined with deep system-level tuning including NUMA awareness, VM scheduling/affinity orchestration, and micro-architectural optimizations for cache and branch efficiency.
Innovated hardware acceleration workflows by developing an internal source-to-source compiler, enabling the automatic conversion of C code into FPGA HLS code for rapid deployment on accelerated hardware.
Engineered critical performance enhancements within the Linux kernel network subsystem. Integrated advanced protocol features such as TCP Fast Open (TFO) directly into the kernel model, significantly reducing connection establishment latency for network-intensive applications.
2013 — 2014
2013 — 2014
sunnyvale,CA,USA
Enhanced system robustness and reliability for Juniper SRX security platforms by developing and debugging critical low-level components, including a microkernel architecture, bootloaders, and the FreeBSD-based Junos OS kernel.
2013 — 2013
2013 — 2013
San Jose,CA,USA
Optimized the performance of virtualized desktop environments, achieving a 70% reduction in boot time and 1 patent filed.
2011 — 2013
2011 — 2013
Santa clara,California,USA
Designed and implemented the Source-to-Source Compiler to automate the migration of Linux kernel drivers to the Windows platform.
Developed translation rules to automatically convert Linux-specific kernel APIs and data structures into their Windows equivalents, significantly reducing manual porting effort and ensuring code consistency.
Architected and implemented a high-performance OSPort runtime layer to bridge kernel execution models, dynamically mapping Linux Bottom Halves and performing real-time translation of kernel data structures.
Education
Nanjing University of Aeronautics and Astronautics