•Designed and deployed a stand-alone Machine Learning framework in Python for automated root-cause categorization of workflow failures for the ‘Jupyter Notebooks as a Service’ team in Amazon Cloud Machine Learning Platform (AWS Sagemaker)
•Developed a Dataset management system integrated with the classifier on AWS S3 cloud supporting version management control for continuous update of training data, making it scalable, fault-tolerant, cost effective, and easy to manage & use
•Lead the design and development workflow (Agile Scrum) of the entire framework by making the first cross-disciplinary effort facilitating Machine Learning solutions in the internal operations repo on a distributed computing environment (AWS EC2)
•Built CLI scripting tools for retrieval & updating of data, training of classifier & region-based statistics report generation; added 5 different classification models to the classifier giving top 2 root causes with probability scores showing 91% accuracy
•Performed end-to-end unit testing and integration testing using Pytest; and code reviews using Amazon CRUX tool