I do have authorization to work in the US legally and do not require sponsorship. I am a Staff Software Engineer at Meta. Experienced in building large-scale software systems to solve problems at scale.

Experience

MetaSoftware Engineer

2023 — Now

New York, New York, United States

AI Infrastructure

Improving GPU utilization fleet-wide for Meta by improving frequently used CUDA kernels

Pytorch Domains

Media decoding, transforms and preproc libraries for ML workloads

Performance analysis and optimization

C++, Python, CUDA, Nvdec

GoogleSoftware Engineer

2010 — 2023

New York, New York, United States

Projects in reverse-chronological order:

1. Time series data ingestion and serving at scale for searches with commercial intent. Think of near-realtime data like stocks or currencies, etc. that needs to be ingested and served to hundreds of millions of users worldwide.

2. Structured search and indexing on Bigtable and Spanner.

Think of a corpus like Google Drive (petabytes of data).

Think of searching for a token within that corpus. Except the search should only find documents that you (the searcher) has access to. Documents can be granted access directly to a user or through a group (groups can contain other groups). We use Zanzibar [1], Google's planet-scale authorization system and build token and partitioning systems on top of that.

We ended up reduce latency by double digit percentage while keeping serving costs minimal.

3. ChromeOS profile-guided optimization using compiler techniques. Chrome is one of the most widely used apps in the world. We profiled it using sample-based, extremely low-overhead tools that are the same that we use to profile workloads in Google Datacenters [2]. These tools were used to profile real users data in the wild who were using Chrome on ChromeOS. This data (which is not user data, but instruction pointer data) is then anonymized and sent over to Google, only for opt-in users, being careful to respect their privacy settings. We then symbolize billions of samples using internal symbol servers and feed that data back into the compiler. The result is double digit performance improvement for Chrome.

[1] https://research.google/pubs/pub48190/

[2] https://research.google/pubs/pub36575/

QualcommSystems Engineer

2009 — 2010

Write kernels that ran on GPUs for image denoising, graphics shading, etc.

These were used to do low-level (instruction or cycle-level) performance analysis to guide the architecture of the next generation mobile GPU.

Wrote many kernels in OpenCL and Cuda. Tuned the kernels for maximum occupancy and throughput.

Predicted the performance of kernels using ML and other models.

NVIDIAGPU Architect

2008 — 2008

Worked on micro-architecture performance analysis. The specific project was to work on the MMU/TLB simulation for the Fermi GPU architecture.

IntelPerformance Architect

2007 — 2007

Captured GPU and CPU workloads for replay and performance analysis.

Did very low level (instruction and cycle-level) performance simulation and analysis for the next-generation CPU and GPU architectures. Predicted performance for GPU kernels and also tuned workloads to perform well on Intel next generation architectures.

Education

Georgia Institute of Technology

Experience

Education

Electrical Engineering and Computer Science