# Sandish Kumar H N

> Staff Software Engineer (ML, Distributed Systems) at Hewlett Packard Enterprise company

Location: United States, United States
Profile: https://flows.cv/sandish

I'm a Senior Staff Software Engineer with 12+ years of experience working on products within various business domains, focusing on Distributed Systems, Cloud Platforms, Open Source, Machine Learning, Data Science, and Data Mining. I possess expertise in designing, problem-solving, debugging, and analyzing requirements to achieve software performance and efficiency. My approach is result-oriented and hands-on, effectively managing resources and time constraints to deliver the best solutions.

Languages: Proficient in Java, Scala, and Python (have worked on Go, Rust)
Technologies: Apache Spark (Core, SQL, Streaming), Apache Kafka (Core, Streams), Kubernetes, Hadoop (HDFS, YARN, MR), NIFI
Columnar Data Bases: Cassandra, Kudu, HBASE
Data Warehouse/MPP: Apache Pinot, Druid, Presto, Athena, Apache Hive, Impala, Redshift 
Cloud Platforms: Amazon Web Services (EC2, S3, Cloud Formation, EMR, Redshift, Kinesis, Glue, API Gateway, VPN, IAM, Athena), Google Cloud Platform(Cloud storage, Big Query, Compute Engine, Container Services)
Observability and Metrics: Prometheus, Grafana, PagerDuty 
Container Services: Docker, Kubernetes 
Development Env Tools: Jenkins, Gerrit, Git 
Workflow: Airflow, Oozie, Rundeck 
Search: Lucene, Elasticsearch, SOLR 
Data Formats: Protobuf, Avro, Parquet, Json, CSV, XML 
Others: MySql, Bash, JavaScript, Sentry, Kerberos, SSL/SAL, LDAP/AD 
NOTE: The majority of the technologies mentioned above have been developed over the course of 12+ years of work.

## Work Experience
### Staff Software Engineer (ML, Distributed Systems) @ Hewlett Packard Enterprise
Jan 2019 – Present | San Francisco Bay Area
Designing, Developing & Deploying Scalable Machine Learning Models in Production (Airflow, Spark, Python, Kubernetes, MLFlow)

Spark, Kafka (Kafka Streams), Airflow, Kubernetes, Cloud, Java, Scala, Python, Go, GrapphQL, Protobufs, Cassandra, Elastic Search, CI-CD.

Building IoT Products on Cloud with Cloud Native Support (K8s, Docker, Jenkins, Spinnaker)

Building Data Pipelines using Airflow, Spark, Kafka

### Apache Spark Contributor @ The Apache Software Foundation
Jan 2016 – Jan 2022 | San Francisco Bay Area
Designed and Developed spark-protobuf. EPIC - https://issues.apache.org/jira/browse/SPARK-40653
- Designed and Implemented spark-protobuf support from the ground up. 
- Added from_protobuf and to_protobuf functions in Scala, Java, and PySpark
- Made changes to new use error classes for Spark-Protobuf exceptions
- Worked on fixing bugs, unit tests, and adding documentation

https://github.com/apache/spark/pull/37972
https://github.com/apache/spark/pull/38212
https://github.com/apache/spark/pull/38344
https://github.com/apache/spark/pull/38515
https://github.com/apache/spark/pull/38603

### Apache Pinot Contributor @ The Apache Software Foundation
Jan 2016 – Jan 2022 | San Francisco Bay Area
- Designed and Implemented Scaler functions for array types, and string types. array_reverse_int, array_reverse_string, array_sort_int, array_sort_string, array_index_of_int, array_index_of_string, array_contains_int, array_contains_string, strpos, strrpos, strrpos, to_utf8, normalize, normalize, split, replace, hammingDistance, array_concat_int, array_union_string, array_union_int, array_remove_int, array_remove_string, array_distinct_int, array_distinct_string, array_slice_int, array_slice_string

https://github.com/apache/pinot/pull/6446

### Apache Druid Contributor @ The Apache Software Foundation
Jan 2016 – Jan 2022
- I worked on bug fixes such as style checks and nit checks. 
- Made changes to use only google annotations GuardedBy instead JavaX
- Made changes to set Test timeout higher for SegmentManagerThreadSafetyTest
- Added StructuralSearchInspection, Prohibit check on Thread.getState()

https://github.com/apache/druid/pull/8394
https://github.com/apache/druid/pull/8386
https://github.com/apache/druid/pull/8060
https://github.com/apache/druid/pull/7889
https://github.com/apache/druid/pull/7890

### Apache Kudu Contributor @ The Apache Software Foundation
Jan 2017 – Jan 2020
- Designed and Implemented Helm chart to build Apache Kudu Kubernetes cluster. A cloud-native Support.
- Added Kudu kubernetes statefulset manifest and worked on docker image changes. 
- Implemented kudu client tools for Hadoop and spark import/export(csv,parquet,avro)

### Apache Nifi Contributor @ The Apache Software Foundation
Jan 2017 – Jan 2020
- Designed and Implemented Kudu Put Operations support for Nifi
- Made changes to improve Nifi HBase Scan Processor
- Worked on adding Cassandra Connection Enable Compression, the Capability to activate compression on Cassandra connection.

### Apache Sqoop Contributor @ The Apache Software Foundation
Jan 2016 – Jan 2018
- Designed and Implemented Incremental Merging for Parquet File Format
- worked bugs, nit, and unit tests

### Senior Software Engineer @ phData
Jan 2016 – Jan 2019 | Greater Minneapolis-St. Paul Area
- Identified and resolved performance issues in long-running Spark applications using advanced techniques like analyzing thread dumps, memory usage, and garbage collection metrics. Tuned Spark applications to achieve faster performance by adjusting garbage collectors and memory settings. Developed a Spark application that significantly reduced Solr index backup time from 30 days to under 2 hours. Designed, architected, and implemented a low-latency Disaster Recovery solution, evaluating active-active and active-passive clustering architectures with Kafka and Spark. 

- Reduced donor matching time from 23 minutes to just a few seconds. Collaborated with the Spark and Kudu open-source communities to address project-related issues. Assisted Junior and Senior Engineers in adopting distributed systems such as Spark, Kafka, Kudu, and Impala organization-wide. Worked directly with high-profile customers as a successful Big Data Engineer within the Big Data and Ecosystem domain. 

- Architected and deployed a production web tracking application using AWS API Gateway, Kinesis, Stream sets, and Kudu to track customer website pages. Developed AWS CloudFormation scripts for automated deployment on AWS. Utilized AWS Glue for Apache Spark ETL, handling different data formats like Avro and Parquet to process PB-sized data and store results in S3 for Athena Analytics. 

- Developed Apache Kudu client tools to facilitate data import and export with support for formats like CSV, Parquet, and Avro, enabling seamless integration with Apache Hadoop and Apache Spark, committed to Cloudera open-source account. 

- Conducted multiple POCs using technologies like Hadoop, Spark, Spark Streaming, Kafka, Apache Kudu, Hive, and Impala. Implemented Sqoop, Impala, Kudu workflows in Apache Airflow for data ingestion from relational databases into the Hadoop ecosystem.

### Senior Software Engineer @ McAfee
Jan 2015 – Jan 2016 | Bangalore Area, India
Worked as an R&D engineer, I collaborated with the business analyst team to gain insights from the company's internal software data. I developed numerous big data pipelines to process vast amounts of McAfee security data, storing them in analytics databases for informed business decisions. Additionally, I created multiple Spark Core, SQL, Streaming, and Kafka applications to extract data from various darknet sites, validating, normalizing, and enriching it before storing it in columnar databases. Moreover, I built APIs on Cassandra and HBase, enabling seamless data transfer to web application dashboards and BI analytics.

### Senior Software Engineer @ ThirdEye Data
Jan 2013 – Jan 2015
- I have an extensive track record working on numerous Big Data projects for esteemed clients like Intel, Hortonworks, Grid Gain, and Microsoft. One notable accomplishment was developing Yardstick Apache Spark, a set of benchmarks written on the Yardstick framework to assess Apache Spark's performance. Additionally, I conducted a Comparative Analysis of Big Data Analytical Tools, evaluating Hive, Hive on Tez, Impala, Spark SQL, Apache Drill, Big Query, and Presto dB running on Google Cloud and AWS.

- Moreover, I successfully created a Real-Time Truck Events Analysis system using Kafka, Storm, HDFS, and Hive. For the mobile app "Cymbal," designed to enhance business sales and offer targeted deals, I built HBase Spring-based REST APIs. I leveraged technologies like Lucene, Solr, ELK Stack, and HBase + MR to craft a sophisticated fashion-based Search Engine for the client Obsessory. 

- Furthermore, my expertise in Data Mining enabled me to extract Affiliate Data feeds from diverse sources such as Rakuten, CJ, and Web Gains. I handled Pre-Processing of raw data, Data Indexing, Color Extraction from Images, and Post-Processing of data for analytics purposes, optimizing the entire data analysis pipeline.

### Senior Software Engineer @ Positive Bioscience
Jan 2013 – Jan 2014 | Mumbai Area, India
- I designed and developed software for Bioinformatics and Next Generation Sequencing (NGS) using the Hadoop MapReduce framework and Mongo DB on Amazon S3, Amazon EC2, and Amazon Elastic MapReduce (EMR). 

- In this role, I created a custom Hadoop MapReduce program for conducting Quality Checks on genomic data, which featured automatic error handling for file-format and sequencing-machine errors, as well as platform agnostic capabilities for various input formats (Illumina, 454 Roche, Complete Genomics, ABI Solid). 

- Additionally, I developed a Hadoop MapReduce program for sequence alignment, implementing algorithms such as Burrows-Wheeler Transform (BWT), Ferragina-Manzini Index (FMI), and Smith-Waterman dynamic programming algorithm using Hadoop distributed cache. 

- To handle large volumes of NGS genomics data, I configured and ran all MapReduce programs on 20-30 node clusters using Amazon EC2 spot instances and Apache Hadoop-1.4.0. 

- Moreover, I set up a 20-30 node Hadoop cluster on Amazon EC2 spot instances to facilitate data transfer between Amazon S3 and HDFS and utilized Amazon S3 for Input and Output while running Hadoop MapReduce programs on the Amazon Elastic MapReduce framework. 

- Furthermore, I developed Java RESTful web services for data upload to Amazon S3, listing S3 objects, and file manipulation operations. These services enabled Quality Checks, Sequence Alignment, SNP calling, and SV/CNV detection on single-end/paired-end NGS data through MapReduce programs. 

- Finally, I designed and migrated a relational database (RDBMS SQL) to a NoSQL Cassandra Database for enhanced efficiency and scalability.

### Software Developer (Hadoop) @ PointCross
Jan 2011 – Jan 2013 | Bangalore
- Implemented Java HBase MapReduce to load raw seismic and sensor data into the HBase database on a 4-node Hadoop cluster.

- Developed RESTful web service application for fetching HBase values in XML and JSON Response format to the UI. 

- Utilized separate Hive and Pig scripts to aggregate multiple drilling parameter values on a 4-node cluster for downstream processing.

- Programmed Java MapReduce to port Indexed data from HBase to the Solr search engine. Implemented cross-connection settings between Hive and HBase.

- Installed and configured Hadoop 1.X, Hive, Pig, and HBase on a 4-node local cluster. Customized Open Layers maps to display KML information from the HBase database, including geolocation data for oil well information.


## Education
### Bachelor's Engineering in Computer Science
Visvesvaraya Technological University


## Contact & Social
- LinkedIn: https://linkedin.com/in/sandishkumar
- Portfolio: http://www.hadoop-sandy.blogspot.com

---
Source: https://flows.cv/sandish
JSON Resume: https://flows.cv/sandish/resume.json
Last updated: 2026-04-12