● Worked on Spark CBO stats estimation fixes for Aggregate and Sort operators. Speed up of almost 2x in select queries like Q83 of TPCDS. Contributed back to open source - (commit id b1857a )
● Fixed deadlock issue in Spark’s UnsafeExternalSorter affecting one of the largest Qubole customer workloads. Contributed back to open source - (commit id 6c4552c6 )
● Spark - S3 Select connector for Qubole Spark to push down projects and filters for CSV and JSON automatically; TPCDS benchmarks Geo-mean - 2.9x ; Max speedup - 5x (Blog - https://www.qubole.com/blog/amazon-s3-select-integration/)
● Serverless Spark on AWS Lambda - Spark executors completely runs as Lambda functions with S3 being the external storage to manage shuffle data (Blog - https://www.qubole.com/blog/spark-on-aws-lambda/)
● Worked on Qubole Spark autoscaling based on stage progress - pluggable, custom auto-scaling policies can be defined.
● Implemented Workload based Scaling limits leveraging Apache YARN’s Fair Scheduler queue limits.
● Implemented HDFS auto-scaling , scales up nodes based on DFS disk capacity and incoming data velocity.
● Mentoring new grads and interns; PR reviews; Spark version upgrades; On-call, Customer issues troubleshooting etc.