Software Developer at Databricks | Master Student of CMU
I will graduate in May 2020. I am interested in building robust large-scale distributed systems and low-level systems (database/kernel/hypervisor/compiler).
Designed and implemented a static analysis pipeline for nightly and release builds, used for tracking API breaking changes and dependency upgrades, checking symbol conflicts and generating dependency lists in release notes
•
Built a new pipeline to synchronize internal Spark fork with the company's monorepo, reduced the latency to integrate new code changes from days to less than 3 hours. Improved monitoring by adding dashboards and alerts for out of sync
•
Developed and deployed the pipeline for updating aarch64 images, enables the product on ARM-based instance
•
Coordinated the important dependency update from Hadoop 2 to Hadoop 3 during Databricks Runtime 9 to 10 major release
•
Contributed several user experience improvements to Spark History Server, a debugging tool for Spark jobs
•
Contributed to several optimization passes in the query compiler, like common subexpression elimination
•
Removed all old log4j dependencies and replaced them with reload4j project, helped mitigate log4j vulnerabilities
Built a new pipeline to annotate network traffic metrics with service and hardware information, and boosted query speed by 2x through forcing Presto to use broadcast join in physical plan
•
Designed and developed a tool in Python to search for adjustments of service capacity placement among data centers which can reduce cross-datacenter network traffic, and proposed several suggestions which each can reduce the cross-datacenter network traffic by 12%(several Tb/s) with the help of that tool