Led cross-functional strategy for Pinterest’s Big Data Compute and Storage Platforms: 500+ PBs on S3, 100K+ tables, 400K+ daily compute jobs (20K+ Spark, 1K+ Trino nodes).
• --
Apache Iceberg Adoption (0 to 1)
Championed and led Iceberg platform adoption, securing investment and scaling the team (1 to 5+ engineers) by demonstrating high business value (improved quality/governance).
* Engineered integration tooling: Custom Spark Catalog, Thrift support, and Hive-to-Iceberg auto-migration service.
* Enhanced Iceberg: Added support for two-level Parquet lists/maps and long-running HMS transactions.
* Scalable Deletions: Enabled Row-Level Data Deletions, boosting capacity by 10x and cutting compute cost while ensuring compliance.
• --
Cross-Platform Initiatives & ML Acceleration
* Governance: Led table/workflow governance to reduce cost and improve data lake quality.
* Modernization: Contributed to platform modernization via Moka Project (Spark on EKS) for large-scale data processing efficiency.
* ML Enablement: Contributed Fast Feature Backfill support, drastically accelerating ML feature iterations and model time-to-market.
• --
Spark SQL Platform Leadership & Growth (0 to 70K+ Jobs)
Founded and led the Spark SQL platform (0 to 1), establishing it as the primary data processing engine. Drove adoption to over 70,000 jobs per day and scaled the team (1 to 10+ engineers).
* Engineered E2E infrastructure (Terraform/Puppet, monitoring) with key features: Scalable direct S3 committer, split-splitting for compressed codecs, custom Thrift schemas, auto-tuning, and Apache Livy integration.
* Security: Custom-built high-scale Fine-Grained Access Control (FGAC) on BDP using STS tokens to meet stringent security and scale requirements.
• --
Big Data Platform (BDP) Foundation & Scaling
Founding member of the team that created the in-house BDP, transitioning off third-party vendors. Specifically led the build and scale of the SQL platform (Hive, Parquet, Presto).