Build and maintain data pipeline software infrastructures, including batch data jobs scheduling system, real-time data streaming pipeline, monitoring/alerting, tooling, backup, data synchronization between clusters, etc.
Realtime user events streaming pipeline:
▪ Designed, developed, and owned the data streaming pipeline that handles user events and user attributes using Filebeat, Kafka, Flink on Kubernetes
▪ Handles data that averages at 600 event/s and peaks at 2000 events/s, with 95% of the events having latency below 2s
▪ Implemented and supported monitoring, alerting, and replay to ensure zero data loss
Batch data pipeline job scheduling system:
▪ Developed and owned the core batch data pipeline with customized Luigi framework, led the team for maintenance and support
▪ Built job templates to support different types of data ETL jobs (Sqoop, Hive, Spark, MapReduce, etc.)
▪ Designed and built auto synchronization between main data cluster and read-only data cluster (for ad-hoc query usage)
▪ Built data validation system to ensure data quality, monitoring system to track job delay, and CI/CD on Jenkins