Experience
2024 — Now
2024 — Now
• Building and deploying the Zipline AI control plane on top of Chronon OSS (https://chronon.ai/), integrating it with multi cloud (GCP and AWS) to manage lifecycle execution of Spark and Flink jobs on Dataproc / EMR and perform compute sharing.
• Expanded data interoperability of Spark jobs to support heterogeneous sources, including Apache Iceberg, BigQuery native/external tables, and Apache Hudi format.
• Optimized application stability and resource footprint by diagnosing and resolving critical Docker JVM memory leaks using Java Flight Recorder for heap and thread dump analysis.
• Reduced feature fetching latency by 30% by migrating payload serialization from JSON to Avro binary, significantly decreasing vector payload sizes and network overhead.
• Stabilized the deployment pipeline by implementing safe, idempotent database migrations using Liquibase and containerizing the full local development environment via Docker.
• Added robust end-to-end integration testing suite covering the CLI, compute engine, and inference service on different cloud providers.
• Tuned Spark and Flink configurations to ensure job stability. For Spark, analyzed jobs to identify bottlenecks like skew and adjusted configurations appropriately; for Flink, added new operators to expand stream processing logic such as writing out to new outputs.
Scala + Spark + Flink + Iceberg + Docker + Liquibase + Hudi + Dataproc + EMR + BigTable + Postgres
2021 — 2024
2021 — 2024
• Machine learning infrastructure engineer on Stripe's feature engineering platform Shepherd.
https://stripe.com/blog/shepherd-how-stripe-adapted-chronon-to-scale-ml-feature-development
Spark + Flink + Airflow + Scala + Java + Hive + Iceberg
Led migration of Stripe's early merchant fraud detection offline ML model from legacy feature engineering platform to Shepherd. Implemented the Shepherd based features and conducted extensive offline evaluation to ensure new features matched old ones. Backtested features to confirm score distributions and recall of new features were inline with the pre-existing features.
Led technical implementation of first asynchronous Shepherd based ML model for merchant fraud. Built out online flow consisting of an event consumer subscribed to various Kafka topics that fetched online features from feature store and conditionally trigger model scoring downstream. Worked with team of ML engineers to assemble an automated backtesting and training data pipeline using offline computed point-in-time training data involving Airflow + Flyte + Iceberg.
• Led technical design and development of first cut of Stripe's new core product infrastructure managing multi-entity accounts/businesses. See https://docs.stripe.com/get-started/account/orgs.
Responsible for ideation + development of the read optimized tree graph storage strategy to represent a Stripe enterprise business with multiple accounts and/or entities. Built read + write APIs and workflows including various locking schemes to support concurrent updates to the tree while maintaining correctness. Java + RPC + Protobuf + Bazel + MySQL + Mongo
• Developed real time stream processing jobs using Flink + Scala + Bazel + Kafka to monitor the merchant experience of Stripe's customer base across the world.
2021 — 2021
2021 — 2021
Led and designed architecture plans to upgrade existing data pipeline infrastructure away from MongoDB including:
• pros and cons proposal of replacing MongoDB with AWS Redshift or Snowflake
• proof of concept detailing use of Airbyte to load from multiple sources of AgileMD data to AWS Redshift and DBT to own transformations within Redshift.
• cost benefit analysis between building and hosting open source data frameworks listed above versus buying a managed ETL/ELT service
2019 — 2021
2019 — 2021
Building asynchronous micro-services with Java using AWS CloudFormation, Lambda, DynamoDB, SQS, and EventBridge to solve problems in cross border movement for Amazon Logistics.
Architected and developed end to end solution to migrate production data in AWS DynamoDB to internal Amazon data lake for further processing by dependent business intelligence and data engineer teams using AWS CloudwatchEvents, S3, and SNS.
Led performance readiness assessment for team’s services (> 10) in preparation for increased traffic due to 2020 Q4 holiday season. Analyzed previous traffic patterns and initiated load testing of individual services to understand if services’ limits would handle the forecasted traffic. Horizontally or vertically scaled team’s services on a case by case basis (larger AWS EC2 instances, more EC2 instances, configuring AWS DynamoDB to autoscaling, etc).
Completed integration of Amazon Devices Logistics onto team’s platform and served as lead and primary point of contact for new technical issues and features.
2019 — 2019
Education
Columbia University
Bachelor of Science (B.S.)
Elon University