I was a part of NVIDIA's core Deep Learning Architecture group working on HPC and ML kernel performance.
Before that, I was an Architect in the SM Architecture group working on improving energy efficiency. My contributions involved studying the mapping of Deep Learning applications on the SM and Tensor cores, and working on micro-architectural strong scaling features to reduce energy consumption. I also spent time prototyping JIT translators for generating performant GPU assembly.
May 2020 - June 2021
•Delivered High Performance ML Kernels for CUDA libraries
•Also contributed to CUTLASS OSS - https://github.com/NVIDIA/cutlass
Jun 2017 - May 2020
•Worked on the architecture and design of the GPU SM and Tensor cores
•Developed Binary translators for generating performant GPU assembly