Description: The project involves setting up highly available, scalable, centralized Real-time Data pipeline for DataCenter Data. The data is used for Analytics/Reporting.
Technology Stack – Hortonworks 2.x platform, Apache/Confluent Kafka, Apache Spark, AWS(S3, EC2), Hadoop, Prometheus, Grafana, Hive, Scala, Jenkins
Accomplishments/Responsibilities:
• As part of Phase 1 : Developed data pipelines using Apache Spark + Scala moving data from Kafka to HDFS/Hive (on Hortonworks 2.x platform).
• As part of Phase 2 : Developed data pipelines using Apache Spark + Scala moving data from Kafka to AWS/S3.
• Confluent Kafka - Evaluated platform capabilities, published best practices, implemented/optimized key Confluent modules - Control Center, Auto DataBalancer, Schema Registry, Replicator, Kafka Connect, Data Stream monitoring etc
• Evaluated & Defined Hadoop security capabilities for HDP 2.x components - HDFS, Kafka, Hive, Hbase, Spark, OpenTSDB, Grafana
• Implemented Security for HDP 2.x, some of the components implemented include
• Kerberos for Authentication
• SSL for Confluent Kafka
• Apache Ranger (role based access)
• Apache Knox (perimeter security)
• Encryption using Hadoop KMS
• Defined best practices for Hadoop governance - encompasses Security, Lifecycle Management, Data Quality, Metadata management, Operations & reporting