Senior Software Engineer @ Oracle Cloud AI Platform team | AI/ML Training Infra using ML Pipelines | Kubernetes Controllers and Multi-Homed Networking
Senior Engineer specializing in cloud-native distributed systems, Kubernetes networking, and AI platform architecture, with 9+ years of experience across Oracle OCI, Itential, and TransUnion (Data Science).
Designed and implemented K8s data plane and networking infrastructure for OCI’s multi-tenant AI training platform.
Implemented OSS mounts in Kubernetes using Rclone as a sidecar and custom FUSE mounts for ML workloads to support checkpointing of their distributed batch training jobs.
Built custom multi-homed networking for large-scale training and inference workloads using Multus CNI and contributed to the development of a custom IPAM solution to enable scalable, cloud-native pod networking.
ML Pipelines
Tech lead for ML Pipelines and shipped multiple features since it's GA.
Designed and implemented a novel solution for our serverless spark service’s integration with ML Pipelines using Kafka events and a cross tenant rule which protects event delivery, which will allow customers to make their data storage of model artifacts transient using a set of ephemeral tokens which bolstered Oracle’s stand on data protection.
Architected a multi-region distributed integration testing framework acting as a global canary system, proactively detecting failures and improving reliability across OCI realms.
Key contributor to the GA launch of ML Pipelines on OCI, enabling end-to-end machine learning workflow orchestration for large-scale distributed training and inference; authored the official Terraform provider (Golang) to support infrastructure-as-code automation.
Co-architected a real-time streaming model inference platform (“Stream Manager”) with a Java control plane and Golang binary inference runtime, delivering scalable, production-grade model serving using co-located containers with internal streaming endpoints safeguarded by linux network namespaces.