# Data Processing Framework based on Apache Spark, Kafka, Zookeeper, YARN and AWS EMR, S3.
•Developed and tested highly configurable Apache Spark based data processing ETL framework
•Automated complex data processing workflows & operationalized pipelines using Kafka, Zookeeper, Jenkins, EMR
•Re-architected and optimized release process for the offline data processing pipelines to reduce deployment time of software and configuration by 75%
•Prototyped Apache Spark based streaming pipeline for real-time incremental updates
# Web service for EMR Cluster management and YARN Job Submission
•Developed RESTful web service for centralized cluster management of all deployed AWS EMR clusters
•Implemented frontend and activity dashboard for this web service using Flintjs & Javascript
•Implemented single-click YARN job submission & tracking using predefined job templates and job submit history
•Deployed the distributed software stack using Docker, Jetty, AWS RDS, AWS EC2 and AWS EMR.
# Content Discovery and Publishing tool
•Architected & developed scraper for AWS S3/Aliyun OSS that discovers content using topological folder access
•Prototyped content injection tool that leverages AWS SNS and SQS for reliable and seamless content discovery
•Operationalized software using Jenkins, Zookeeper and Kafka
Technologies used: Apache Spark, Apache Kafka, Apache Zookeeper, AWS EMR, AWS S3, AWS RDS, flintjs, Bash, Jenkins, Scrum, Git/gitflow