• Optimized Spark/Scala identity-graph pipelines processing billion-scale datasets, reducing runtime from 4 hours to 1 hour (4× faster) through partitioning, persistence tuning, and memory-safe execution
• Designed and implemented a “graph vitals” monitoring system computing 10+ statistical metrics (percentiles, distributions, time-window trends) to improve data reliability and downstream analytics
• Built production-grade observability using Datadog + Terraform, launching 10+ dashboards and alerts to track runtime, failures, executor OOMs, shuffle spill, and data freshness
• Persisted analytics outputs to Apache Iceberg, enabling consistent, queryable datasets for downstream reporting and decision-making
• Collaborated with platform and data teams to improve pipeline stability and reduce on-call incidents across mission-critical workloads