•Migrated and evolved manual on premise data processing tool to cloud based automated data pipeline
Designed a generic ETL pipeline can receive different schema healthcare raw data and ingest to the standard format using Spark 2.4 and Scala.
Introduced Apache Airflow as workflow scheduling tool and brought to production on AWS EMR.
Migrated Lumiata ETL pipeline with ~ 40 million patient healthcare records from on premise to AWS EMR and GCP Dataproc.
Transformed ~ 40 million patient healthcare raw data with CSV format to standard healthcare data format (HAPI FHIR) and imported the standard output into BigQuery.
•Built a generic Pyspark application to generate summary report for ~ 40 million patient healthcare raw data. Validated healthcare raw data based on data types and generate statistics report for all the values.
•Built integration test for Lumiata ETL pipeline. Generated statistics reports for both raw data and standard healthcare data to test ETL.
•Verified and debugged data using Jupyter Notebook.