Managed the reliability and optimization of a 1.4 petabyte database cluster consisting of 100 sharded Postgres databases unified by Citus, a distributed query engine. Optimized all parts of the data pipeline and serving stack including Kafka data ingestion, Linux kernel tuning, Postgres configuration, and SQL code generation.
Designed and implemented a control plane in Go with a REST API to automate database recovery, backups, and upgrades. Reduced manual work from 4 hours to 10 minutes and database recovery time from 2 days to 8 hours.
Led 3 month technical evaluation of commercial database vendors to investigate feasibility of a database migration to a distributed column store. Implemented, benchmarked, and evaluated Heap’s workload on SingleStore, Postgres, Snowflake, dbX, and Rockset.