# Saumitra Shahapure

> Staff Software Engineer at Dremio | Ex-Facebook | Ex-Adobe

Location: San Francisco Bay Area, United States
Profile: https://flows.cv/saumitrashahapure

I am a Software Engineer with extensive experience in building large scale distributed data analytics platforms. I have worked on Big Data systems both as a developer and as a user. I am passionate about building scalable and reliable distributed systems.

## Work Experience
### Software Engineer @ Dremio
Jan 2023 – Present | Santa Clara County, California, United States

### Software Engineer @ Meta
Jan 2019 – Jan 2023 | Menlo Park, California, United States
At Meta, I initially worked on building data platform for Integrity-related products. I currently work in Spark team. In Spark, My focus has been on query planning, optimization and platform reliability. The highlighted projects include:

• Row level Acess Permissions for Spark: Several Data security use-cases within Meta required that the restricted users should get access to only the subset of rows in the warehouse tables. When users submit queries on the whole warehouse table, access restricted rows should get automatically filtered out.
As a solution, We built system that would rewrite the user queries, report to the users about queries getting rewritten and prevent users from going around the service when running the query.
• Community Integrity Data Platform: My team owned the data platform and products for appeals, penalties and communication products within CI. We worked on developing data platform for metrics for A/B testing (Deltoid) and user behavior analysis. Insights derived from the data platform are used for understanding violator behavior and effect of bad contents on Facebook users.

Tooling: Java, Scala, Presto, Spark, Python, Dataswarm

### Software Engineer @ HERE Technologies
Jan 2016 – Jan 2019 | Berlin Area, Germany
I was a member of Search Analytics team. We were responsible for building tools to help in analysing search performance. I worked on building and tuning up log aggregation pipelines to make log analysis easy.I also worked on building data visualisation tools backed by Kibana. I have also been contributing to Maps Search Ranking engine which is responsible to show more relevant search results on top.

Tooling: Java, Scala, Hadoop, Hive, Spark, Pig, AWS

### Data Infrastructure Engineer @ Rocket Internet SE
Jan 2015 – Jan 2016 | Berlin Area, Germany
Rocket Internet is a Berlin-based startup incubator which builds and funds Internet startups in emerging markets. I was a member of Analytics team in the Online Marketing group. I have worked in the following areas

• Advanced reports for Adwords: I worked on building advanced reports based on Bigquery data and Adwords API which are not available through Adwords reporting. One of them was adgroups performance report, in which we evaluated CIRs of adgroups/keywords based on custom attribution models.
• Reporting for Rocket Advertising: We were working on programmatic bidding platform to show ads on external DSPs. I worked on aggregating performance logs and creating reports on it. I also worked on creating user segments for targeting them.
• Facebook campaigns reporting: I have contributed to Facebook Campaigns reporting tool which aggregates performance data about various campaigns over time and reports it at more granular level than Facebook does.

Tooling: Hive, Bigquery, Python, Pandas, Ansible, Docker, AWS

### Software Engineer @ Vizury Interactive Systems Pvt Ltd
Jan 2013 – Jan 2015 | Bangalore
Vizury Systems is an Indian startup in online advertising. Primary business of Vizury was in behavioural retargeting for e-commerce websites (Now DMP). I was a developer in Analytics team. I was mainly responsible for performance and usability improvements in Hive.  We built a custom distribution of Hadoop which is hosted in AWS. I worked in the following areas

• Hive+Spark as Hybrid querying platform: In Vizury, we were using Hive to run batch analytics queries in Spark. At that time, Spark was capable of performing join operations in-memory only. In-memory-only joins caused reliability issues for complex joins.
As a solution, we built Hive query analyzer that would detect complexity of the join operations. Queries with the light-weight joins would run in Spark and complex ones in Mapreduce. This query redirection was automatic and adaptive. End users were only supposed to submit valid Hive queries to the platform. Query analysis and execution engine decisions used to be made on the fly. 
Because of this, smaller queries would complete very quickly in-memory. Complex queries would run slowly but eventually succeed. 
• Hive+Spark cluster deployments with autoscaling: Vizury provides Hive-as-a-service to the data analysts userbase. I have worked on setting up cluster from scratch. I had also implemented custom autoscaling layer to quickly scale up the cluster based on demand. We used to autoscale it from few 10’s of machines to hundreds. Autoscaling helped us reduce dollar cost of AWS infra by about 66%.
• Optimising Pig/Hive queries: I have worked on optimising Pig/Hive scripts which are frequently submitted by the users. I also used to implement Hive-based ETLs.
• Front-end application for Hive: I have contributed in few features of front-end application as well. Important features include Google sign-in integration, developing API calls and query results history per user, query logs, output previews, etc.

Tooling: Hadoop, Hive, Pig, Spark, Java, Python, AWS

### Software Engineer @ Adobe Systems
Jan 2011 – Jan 2013 | Noida Area, India
Adobe is one of the biggest tech software company building products mainly in Creative and Marketing domains. I have worked as a developer in Adobe Story product. It is web-based script-writing tool for TV/film script writers. It allows collaborative writing and document sharing among the authors. I have worked in the following product areas

• Log generation and analysis framework: I built a framework on the top of Amazon SQS to allow logging on EC2 servers. Each server call generated several log messages which were posted to SQS in the end of the call. Use of SQS offered reliability, scalability and almost real-time log propagation. Offline log collector used to periodically collect these logs. For log analysis, we used Splunk as well as Mongo-DB for several use-cases.
• Client and server-side development: I implemented several client and server side features, such as schedule workspaces, story order view in schedules, XHTML document exporter, report generator UI. These features helped users interact with the documents in more efficient way and make UI better.
• Isolating and Removing Memory Leaks: Story client application is a Flash app written in Actionscript. Some programming patterns in Actionscript end up leaking memory at run-time. I was responsible for profiling them and fixing all issues related to it.

Tooling: Flex, PHP, Javascript, C++, AWS


## Education
### ME in Computer Science and Automation
Indian Institute of Science (IISc)

### BE in Information Technology
Savitribai Phule Pune University


## Contact & Social
- LinkedIn: https://linkedin.com/in/saum
- Portfolio: http://clweb.csa.iisc.ernet.in/saumitra.shahapure
- Portfolio: https://story.adobe.com

---
Source: https://flows.cv/saumitrashahapure
JSON Resume: https://flows.cv/saumitrashahapure/resume.json
Last updated: 2026-04-11