I’m an ML Infrastructure & MLOps engineer with a deep foundation in network software engineering, distributed systems, and high-performance data center networking. For 14+ years, I’ve built and optimized the systems that move data at scale across NICs, switches, kernels, and GPU clusters.
Experience
2022 — Now
2022 — Now
Mountain View, CA
ML Infrastructure & GPU Networking
Architecting next-generation GPU cluster networking for ML training workloads, including performance testing and validation for H100/B200 class accelerators
Designed distributed observability infrastructure (Grafana, Prometheus) for ML pipeline metrics, GPU utilization, and network telemetry across training clusters
Built batch workflow orchestration for large-scale ML training jobs with fault tolerance and automatic retry logic
Optimized NCCL configs to mitigate network bottlenecks.
Time-Sensitive Networking & Synchronization
Architected high-availability PTP/gPTP time synchronization network achieving 75% reduction in sync faults
Designed automated fault detection and mitigation systems including ARP protection, firewalls, and real-time monitoring
Implemented QoS tuning and traffic shaping to prioritize high priority traffic over background data movement
Low-Latency Distributed Systems
Led networking architecture across platform, ML infrastructure, and autonomous driving stacks
Built status and fault-reporting frameworks using C++ and protobuf for sub-millisecond latency monitoring Authored comprehensive network architecture documentation and design specifications
2022 — 2022
Palo Alto, California, United States
Onboard Network Architecture for ML Inference:
Owned end-to-end networking architecture connecting 12+ sensors and GPU/ compute pods for real-time ML inference
Designed low-latency inter-process communication paths optimized for sensor fusion and perception model data flows
Implemented L2 multicast optimizations and TCAM tuning to eliminate bandwidth bottlenecks in high-throughput sensor streams
Time Synchronization for ML Workloads:
Led IEEE 802.1AS (gPTP) implementation achieving sub-microsecond synchronization across distributed sensors, enabling accurate temporal correlation for perception models
Built real-time monitoring tools for time-sync drift detection and automatic correction
2020 — 2022
2020 — 2022
Santa Clara, California, United States
GPU Cluster Networking & AI Infrastructure:
Led development of In-Service Software Upgrade (ISSU) enabling zero-downtime upgrades for GPU cluster fabrics, critical for continuous AI training operations
Spearheaded RDMA over Converged Ethernet (RoCE) proof-of-concepts and deployment strategies for GPU Direct RDMA, reducing inter-GPU communication latency for distributed training
Designed and implemented kernel-bypass networking using DPDK for high-throughput, low-latency data paths in AI training clusters
Data Center Fabric Optimization:
Optimized VXLAN, EVPN, and MLAG configurations for L2/L3 GPU cluster fabrics, improving bisection bandwidth and reducing tail latency
Led QoS feature development for traffic prioritization in mixed AI training and inference workloads
Implemented SPAN/ERSPAN for network telemetry and performance debugging in production GPU clusters
Control Plane & High Availability:
Designed Smart Manager Daemon using multi-threaded ZMQ for control plane orchestration.
Implemented graceful restart protocols (BGP, OSPF, MLAG, BFD) ensuring network stability during upgrades and control plane restarts.
2018 — 2020
2018 — 2020
Mountain View
Data Center Networking:
Led development of high availability Hardware VXLAN Tunnel End Points (VTEP) control plane solution integrated with VMware NSX.
Published Cumulus Linux as a solution on VMware Solution Exchange Under Technology Alliance Partner program to increase awareness of the solution thus increasing customer base and revenue for the company.
Co-led design and integration of Fastboot solution which reduced the downtime and traffic loss of Cumulus Linux by 65 % and improved reboot as well as upgrade performance.
Rewrote critical daemons to improve the scalability and speed of kernel to hardware configurations which improved the performance.
Handled critical customer escalations on Broadcom and Mellanox hardware platforms. Also, maintaining hardware vendor SDK and adding patches as required to improve performance.
Co-led integration of code sanitization software and fixed critical memory leaks and corruption thus improving code quality and system reliability.
Developed SPAN-ERSPAN feature for global support teams and customers to quickly triage the reasons for critical path issues.
Added feature enhancements and fixes to data center features and protocols including but not limited to, VxLAN, EVPN, BGP, ACL, QoS, etc.
2017 — 2018
Santa Clara, California
Design, develop, and implement software systems while achieving quality and delivery objectives. Work on early enablement and feature development of multi-layer switches.
Successfully Led and completed platform independent and dependent plug-ins for ACL, and QOS.
Education
San Francisco State University
Master's degree
Sinhgad college of engineering
Bachelor's degree
SVCP