I worked as a data infrastructure engineer, rebuilding the data pipeline to reduce report delivery time from 48 hours down to 3. I touched all three parts of the ETL pipeline during my time there:
1. In extraction, I wrote jobs in Scala (with akka and Zookeeper/Curator) to download audit logs from 6000 servers continuously and it real-time. This equates to 300 GB compressed data per 10 minutes, or 80 TB of data uncompressed per day.
2. In the transformation phase, I utilized Spark structured streaming to read the raw audit logs, convert from raw text to snappy-compressed parquet (with a schema) format to allow for read-optimized operations.
3. In the quasi-load phase, I worked on report aggregation and optimization of vanilla Spark batch jobs. I worked with scala/Spark and HDFS to do the plethora of joins (broadcast and merge-sort), unions, shuffles, and repartitions necessary to answer the question: how much money are we making on an hourly basis? We further optimized the job UDFs and implemented a custom solution to the ever-dreaded Spark small-files problem involving the spark metadata.
On a division-wide level, I also set the CI/CD pipelines for this new project, utilizing sbt.