I do have authorization to work in the US legally and do not require sponsorship. I am a Staff Software Engineer at Meta. Experienced in building large-scale software systems to solve problems at scale.
Experience
2023 — Now
2023 — Now
New York, New York, United States
AI Infrastructure
Improving GPU utilization fleet-wide for Meta by improving frequently used CUDA kernels
Pytorch Domains
Media decoding, transforms and preproc libraries for ML workloads
Performance analysis and optimization
C++, Python, CUDA, Nvdec
2010 — 2023
2010 — 2023
New York, New York, United States
Projects in reverse-chronological order:
1. Time series data ingestion and serving at scale for searches with commercial intent. Think of near-realtime data like stocks or currencies, etc. that needs to be ingested and served to hundreds of millions of users worldwide.
2. Structured search and indexing on Bigtable and Spanner.
Think of a corpus like Google Drive (petabytes of data).
Think of searching for a token within that corpus. Except the search should only find documents that you (the searcher) has access to. Documents can be granted access directly to a user or through a group (groups can contain other groups). We use Zanzibar [1], Google's planet-scale authorization system and build token and partitioning systems on top of that.
We ended up reduce latency by double digit percentage while keeping serving costs minimal.
3. ChromeOS profile-guided optimization using compiler techniques. Chrome is one of the most widely used apps in the world. We profiled it using sample-based, extremely low-overhead tools that are the same that we use to profile workloads in Google Datacenters [2]. These tools were used to profile real users data in the wild who were using Chrome on ChromeOS. This data (which is not user data, but instruction pointer data) is then anonymized and sent over to Google, only for opt-in users, being careful to respect their privacy settings. We then symbolize billions of samples using internal symbol servers and feed that data back into the compiler. The result is double digit performance improvement for Chrome.
[1] https://research.google/pubs/pub48190/
[2] https://research.google/pubs/pub36575/
2009 — 2010
2009 — 2010
Write kernels that ran on GPUs for image denoising, graphics shading, etc.
These were used to do low-level (instruction or cycle-level) performance analysis to guide the architecture of the next generation mobile GPU.
Wrote many kernels in OpenCL and Cuda. Tuned the kernels for maximum occupancy and throughput.
Predicted the performance of kernels using ML and other models.
2008 — 2008
2008 — 2008
Worked on micro-architecture performance analysis. The specific project was to work on the MMU/TLB simulation for the Fermi GPU architecture.
2007 — 2007
2007 — 2007
Captured GPU and CPU workloads for replay and performance analysis.
Did very low level (instruction and cycle-level) performance simulation and analysis for the next-generation CPU and GPU architectures. Predicted performance for GPU kernels and also tuned workloads to perform well on Intel next generation architectures.
Education
Georgia Institute of Technology