I am a Software Engineer with extensive experience in building large scale distributed data analytics platforms. I have worked on Big Data systems both as a developer and as a user. I am passionate about building scalable and reliable distributed systems.

Experience

DremioSoftware Engineer

2023 — Now

Santa Clara County, California, United States

MetaSoftware Engineer

2019 — 2023

Menlo Park, California, United States

At Meta, I initially worked on building data platform for Integrity-related products. I currently work in Spark team. In Spark, My focus has been on query planning, optimization and platform reliability. The highlighted projects include:

Row level Acess Permissions for Spark: Several Data security use-cases within Meta required that the restricted users should get access to only the subset of rows in the warehouse tables. When users submit queries on the whole warehouse table, access restricted rows should get automatically filtered out.

As a solution, We built system that would rewrite the user queries, report to the users about queries getting rewritten and prevent users from going around the service when running the query.

Community Integrity Data Platform: My team owned the data platform and products for appeals, penalties and communication products within CI. We worked on developing data platform for metrics for A/B testing (Deltoid) and user behavior analysis. Insights derived from the data platform are used for understanding violator behavior and effect of bad contents on Facebook users.

Tooling: Java, Scala, Presto, Spark, Python, Dataswarm

HERE TechnologiesSoftware Engineer

2016 — 2019

Berlin Area, Germany

I was a member of Search Analytics team. We were responsible for building tools to help in analysing search performance. I worked on building and tuning up log aggregation pipelines to make log analysis easy.I also worked on building data visualisation tools backed by Kibana. I have also been contributing to Maps Search Ranking engine which is responsible to show more relevant search results on top.

Tooling: Java, Scala, Hadoop, Hive, Spark, Pig, AWS

Rocket Internet SEData Infrastructure Engineer

2015 — 2016

Berlin Area, Germany

Rocket Internet is a Berlin-based startup incubator which builds and funds Internet startups in emerging markets. I was a member of Analytics team in the Online Marketing group. I have worked in the following areas

Advanced reports for Adwords: I worked on building advanced reports based on Bigquery data and Adwords API which are not available through Adwords reporting. One of them was adgroups performance report, in which we evaluated CIRs of adgroups/keywords based on custom attribution models.

Reporting for Rocket Advertising: We were working on programmatic bidding platform to show ads on external DSPs. I worked on aggregating performance logs and creating reports on it. I also worked on creating user segments for targeting them.

Facebook campaigns reporting: I have contributed to Facebook Campaigns reporting tool which aggregates performance data about various campaigns over time and reports it at more granular level than Facebook does.

Tooling: Hive, Bigquery, Python, Pandas, Ansible, Docker, AWS

Vizury Interactive Systems Pvt LtdSoftware Engineer

2013 — 2015

Bangalore

Vizury Systems is an Indian startup in online advertising. Primary business of Vizury was in behavioural retargeting for e-commerce websites (Now DMP). I was a developer in Analytics team. I was mainly responsible for performance and usability improvements in Hive. We built a custom distribution of Hadoop which is hosted in AWS. I worked in the following areas

Hive+Spark as Hybrid querying platform: In Vizury, we were using Hive to run batch analytics queries in Spark. At that time, Spark was capable of performing join operations in-memory only. In-memory-only joins caused reliability issues for complex joins.

As a solution, we built Hive query analyzer that would detect complexity of the join operations. Queries with the light-weight joins would run in Spark and complex ones in Mapreduce. This query redirection was automatic and adaptive. End users were only supposed to submit valid Hive queries to the platform. Query analysis and execution engine decisions used to be made on the fly.

Because of this, smaller queries would complete very quickly in-memory. Complex queries would run slowly but eventually succeed.

Hive+Spark cluster deployments with autoscaling: Vizury provides Hive-as-a-service to the data analysts userbase. I have worked on setting up cluster from scratch. I had also implemented custom autoscaling layer to quickly scale up the cluster based on demand. We used to autoscale it from few 10’s of machines to hundreds. Autoscaling helped us reduce dollar cost of AWS infra by about 66%.

Optimising Pig/Hive queries: I have worked on optimising Pig/Hive scripts which are frequently submitted by the users. I also used to implement Hive-based ETLs.

Front-end application for Hive: I have contributed in few features of front-end application as well. Important features include Google sign-in integration, developing API calls and query results history per user, query logs, output previews, etc.

Tooling: Hadoop, Hive, Pig, Spark, Java, Python, AWS

Education

Indian Institute of Science (IISc)

ME

Savitribai Phule Pune University

Experience+1

Education

ME

BE

Experience