•Developed and deployed the project that automatically raises alert on daily basis for the unexpected data on the cluster. This project replaced manual effort of detecting and handling non-compliant data by 80%.
Technologies: Python, Hadoop, HIVE, Unix, Cron, MySQL
•Researched on data governance solutions that AT&T can take advantage of to ensure that data meets compliance and privacy requirements
•Researched on blockchain technology to create a data audit trail
•Working on creating a solution that can help analyze and visualize the data so as to find various areas of profitability:
1. Data Ingest
•Ingest data from different file formats into HDFS
•Load data into and out of HDFS using the Hadoop File System commands
2. Transform, Stage, and Store
•Convert a set of data values in a given format stored in HDFS into new data values or a new data format and wrote them into HDFS.
•Load RDD data from HDFS for use in Spark applications
•Read and write files in a variety of file formats
•Perform standard extract, transform, load (ETL) processes on data
3. Data Analysis using Spark SQL
•Query DataFrames in Spark
•Write queries that calculate aggregate statistics
•Join disparate datasets using Spark
•Built a solution for text analysis to identify and categorize the cost as spend or repair.
•Developed an Ingestion Pipeline to pull data from RDBMS, transform the data and automatically generate matrix every month
Technologies: Spark, SparkSQL, Scala, Hadoop, HIVE, Pig, Solr, Banana, Sqoop