Experience

2020 — NowNCR CorporationSoftware Engineer / Data Engineer

2020 — Now

Redwood City, California, United States

Building Streaming API allows third-party Financial Institutions to consume the real-time banking activity data through Apigee API service using Pub/Sub.

Migrate the existing Map-Reduce based on-prem data pipelines/platform to GCP streaming based data pipelines/platform by using Apache Beam.

Build and deploy streaming Dataflow pipelines processing ~2k per second syslog messages. The pipelines consume data from Pubsub and Firestore, transform syslog data (filtering, validation, data replay, de-duplication and grouping) to banking activity Avro data and ingest into BigQuery, Bigtable which allows third party Financial Institutions to query the real-time data through Apigee API service.

Build and deploy batch Dataflow pipelines to read data from BigQuery, transform (filtering, validation and grouping) and generate daily reports for third party Financial Institutions.

Introduce and deploy Apache Airflow as workflow scheduling tool and Cloud functions to run daily/hourly batch data flow pipelines generate reports in Google Cloud Composer.

2017 — 2019LumiataSoftware Engineer

2017 — 2019

San Mateo, CA

Migrated and evolved manual on premise data processing tool to cloud based automated data pipeline

Designed a generic ETL pipeline can receive different schema healthcare raw data and ingest to the standard format using Spark 2.4 and Scala.

Introduced Apache Airflow as workflow scheduling tool and brought to production on AWS EMR.

Migrated Lumiata ETL pipeline with ~ 40 million patient healthcare records from on premise to AWS EMR and GCP Dataproc.

Transformed ~ 40 million patient healthcare raw data with CSV format to standard healthcare data format (HAPI FHIR) and imported the standard output into BigQuery.

Built a generic Pyspark application to generate summary report for ~ 40 million patient healthcare raw data. Validated healthcare raw data based on data types and generate statistics report for all the values.

Built integration test for Lumiata ETL pipeline. Generated statistics reports for both raw data and standard healthcare data to test ETL.

Verified and debugged data using Jupyter Notebook.