Owned and authored highly parallel, performance optimized kernels that implements number theoretical algorithms on a tile-based accelerator.
Kernels included both compute bound and memory bound algorithms in high throughput and low latency settings such as FFTs, Merkle Trees, Bit-reversals, field arithmetic hash functions, etc.
Optimized kernels at the workload, algorithm, and instruction level by profiling data layouts, occupancies, instruction stalls, sync barriers, etc. and maximizing resource usage at the hardware level.
Optimizations included zero copy, reducing DRAM accesses, memory coalesces, hypercube swaps.
Worked with the infra team to build the acceleration stack from the bottom up, including profilers, debuggers, and kernel features such as dynamic input sizes, input/output metadata, declarative specifications of memory access patterns, sub-kernel calls, etc.
Worked with the hardware team to support chip bringup with RTL test suites and debugging on silicon.