Developed Apache Flume-based ETL pipeline infrastructure across Google Cloud Platform (GCP).
•Conducted tests to determine optimal Flume Agent configurations in cloud environments.
•Designed, codified, and deployed infrastructure using Terraform, SaltStack and Github Actions.
•Updated Dstillery-tech-stack specific Apache Hive dependencies to work on Google Dataproc Clusters.
•Implemented metrics to monitor throughput, filtration, and failure rates of over 100 billion daily events through Apache Kafka topics and Apache Flume Agents (macro view of pipeline health).
•Designed metrics to monitor success rates & failure reasons in Flume Sinks and Interceptors (micro view of pipeline health).
•Built Grafana dashboards to visualize such metrics for both business and engineering stakeholders.
•Wrote system to unify GCP security credential Secret access across all Dstillery applications via SaltStack.
•These contributions aided in a company-mandated cloud migration which saves Dstillery millions of dollars per year in data center costs and minimizes spending on cloud computation and storage costs.
Designed system using Google Dataplex to organize data across Hive, BigQuery, MySQL, and Google Cloud Storage using both technical & business metadata.
Migrated core datasets and ETL pipelines from on-premises server environments to Google Cloud Environments.
•Managed migration of 7 core Hive databases from on-prem Hadoop environment to GCP Dataproc clusters.
•Reconfigured Flume sinks (streaming Protobuf messages w/ Hive table data) to GCS buckets instead of HDFS directories.
•Wrote a program for Hive table partition management, deployed via GCP Cloud Functions.
•Estimated, planned, and tested GCP resource usage to optimize configurations to minimize storage/computing costs.
•Developed core service to enable all 17 Groovy apps to interface with Hive in GCP environments.
•Redesigned Spring Boot, Groovy, and Python pipeline components to run in GCP compute instances.