Contributor to NVIDIA (FlashInfer, CUTLASS) and Meta (PyTorch) open-source projects:
⢠PyTorch: Rewrote CUDA upsample_bicubic2d kernel to parallelize across batch and channel dimensions ā 4.3-43x speedup for VLM position embedding resizing - https://github.com/pytorch/pytorch/pull/174578
⢠FlashInfer: Upstreamed FP8 Groupwise GEMM optimization for small-M decode shapes ā 10-40% faster at batch ā¤32 - https://github.com/flashinfer-ai/flashinfer/pull/2327
⢠CUTLASS: Fixed SM100 (Blackwell) FP8 profiler bug ā corrected epilogue shape_div divisibility condition for non-multiple-of-64 N tiles - https://github.com/NVIDIA/cutlass/pull/2946