My expertise lies in optimizing algorithm efficiency through techniques such as compression (quantization/sparsity) and distillation. I've spearheaded teams in core research, contributing to multiple patents and publications in these domains, and successfully scaled these techniques for Gemini models in production.

Experience

GoogleStaff Software Engineer

2017 — Now

Mountain View

2025: Gemma 3n launch; led model co-design efforts, focusing on efficient on-device model performance.

2023-2024: Spearheaded hardware-software co-design for TPU, optimizing inference and training efficiency; deployed efficiency techniques for a wide set of Gemini models.

2022-2023:First generative AI for mobile (led model compression to achieve optimal memory/latency-quality for the model).

2022: Developed efficient inference for transformers, quantized PALM, and co-led the effort to launch it into the first product.

2019-2024: Conducted research in quantization/sparsity for transformers, launching inference-efficient techniques for Translate models. Developed a compression library to enable research at production at scale scalable. Contributed to hardware/software co-design, implementing multiple efficiency features across several TPU generations.

2017-2019: Core developer for Tensorflow and efficient TPU runtime infrastructure.