Served as a core contributor to Cloudera’s Site Reliability Engineering team, owning reliability, automation, and operational excellence for production Kubernetes platforms.
– Designed and automated incident response workflows on AWS using Terraform (IaC), enabling repeatable, auditableninfrastructure recovery during production incidents.
– Operated and optimized large-scale EKS Kubernetes clusters, handling cluster upgrades, node group lifecycle management, autoscaling, and CVE remediation with zero-downtime strategies.
– Led Helm-based application deployments, standardizing release processes and reducing configuration drift across environments.
– Administered and upgraded the Istio service mesh, including mTLS enforcement, traffic routing, canary rollouts, and sidecar lifecycle management to improve security and resilience.
– Improved platform reliability by automating CI/CD pipelines (GitLab CI, Helm packaging, container builds), reducing manual intervention and deployment-related failures.
– Performed production upgrades and patching for Istio, Datadog agents, and container images, proactively addressing security vulnerabilities and stability risks.
– Actively participated in on-call rotations, diagnosing distributed system failures across Kubernetes, networking, and cloud infrastructure, and driving root cause analysis (RCA) for post-incident reviews.