Sunnyvale, California, United States
* Architected a KPI analytics pipeline using AWS Glue, Redshift, and QuickSight, publishing platform metrics and saving $100K+ in cost through data-driven insights.
* Engineered a filesystem monitoring service, providing a dashboard and automated alerts that empowered admins to optimize storage utilization and reduce costs by $120K per year.
* Developed an AI-driven agent integrated with MCP servers to analyze HPC job data, reducing SLA incidents by 20% and boosting productivity through automated recommendations.
* Extended on-demand DCV based virtual desktop service by implementing idle-detection and automated termination workflows, notifying users of unused sessions and saving $50K+ annually in compute costs.
* Architected multi-region support for the HPC platform, empowering users to provision jobs in optimal regions and achieve significant cost savings through compute-aware allocation
* Architected a comprehensive audit trail system capturing user and system-initiated changes across the platform, providing admins with a clear, time-ordered view of state transitions to accelerate debugging and incident resolution.
* Led and owned the end-to-end release management lifecycle, coordinating cross-functional teams to build, test, and roll out images while ensuring seamless upgrades across production clusters.