I am passionate about understanding large-scale datasets using machine learning and interactive visualizations. I have extensive experience tackling large scale data modeling and the associated distributed computing challenges using Scala/Java and the Hadoop stack.

2021 — 2024StripeStaff Software Engineer

2021 — 2024

2018 — 2021StripeData Science Engineer

2018 — 2021

San Francisco Bay Area

My team support data scientists and everyone at Stripe to practice data science. We believe in extending data scientists' skillset and capability more than making it easy to repeat the same-old-work.

In addition to abstracting complexity, we spend a lot of time and effort to smooth out learning curve and build the right foundation so data scientists feel confident using advanced languages and frameworks, such as building dashboards in React or preparing analytical data in Spark and make "impossible" things possible.

2016 — 2018AirbnbMachine Learning Engineer

2016 — 2018

San Francisco Bay Area

Machine Learning and Data Analytics

Evolved Airbnb Home teams’ understanding, communication, and

prioritization over supply and demand via work on market definition, market

intelligence dashboard, and guest perceived availability.

Reconciled main Booking Probability model with Theoretical Elasticity for

listing revenue forecasting model.

Designed Long Lead Day Pricing model pioneering Transfer Learning and

Deep Learning. Iterated Booking Probability model using distributed GAM.

Distributed Computing and Infrastructure

Optimized various distributed data pipeline with 10-100x speedup.

Documented and shared optimization insights.

Developed Spark library to significantly reduce boilerplate code, user errors

and boosting iteration speed on writing distributed ETL application in Airbnb.

Developed data normalization framework for external data harmonization.

Data Science Enrichment and Partnership

Top contributor to internal R packages.

Established best practices for R package development, R dependency

isolation in Airflow and other R infrastructure for iterations and deployment.

Served as engineering partner on Data Science Technology Council.

Advocate for internal R education. Designed and taught Data Visualization in R course.

2013 — 2016BlackRockAssociate

2013 — 2016

Founding member of Advanced Data Analytics team within BlackRock's Financial Modeling Group. Primarily focus on large-scale data processing, modeling, and visualization using Apache Spark and D3.js.

 Architected data warehousing and modeling pipeline for mortgage borrower level dataset (TB+ size) covering data onboarding, feature extraction, aggregation and modeling using Scala, Protobuf, Spark, and Parquet. Contributed bug fixes identified from the pipeline back to the Spark project.

 Iterated on mortgage prepayment machine learning models using R and Spark MLLib. Models included k-Means, GLM, k-NN, Random Forest etc.

 Authored novel data visualization for mortgage data (parallel coordinates, scatter plot matrix etc.), and model performance using R, D3.js, and Tableau.

 Collaborated on high dimensional big data visualizer: binned aggregation using Spark and HBase; web app interface using Angular.js and D3.js.

 Designed and developed a SparkR DSL package which dynamically bootstraps itself from Scala reflection using metaprogramming.

 Developed R packages integrating enterprise environment and Hadoop platform with R.

 Evangelized use of R Markdown and R Shiny for reproducible and interactive data science work.

 Worked on Pig based ETL, analytics pipeline and Pig UDFs in Java/Scala.

 Experienced in using Scala Macros to eliminate boilerplate code while maintaining static type safety and native performance.

2012 — 2012BlackRockIntern Analyst

2012 — 2012

Recruited to develop data visualization and reporting tools for financial modelling analytics. Worked on unifying data retrieval process and providing interactive reporting application.