# Ard W. > Machine Learning Infrastructure @ Kumo.ai | Distributed Systems, Infrastructure, & Machine Learning Location: United States, United States Profile: https://flows.cv/ard Broadly interested in the intersection of distributed systems and machine learning. Particularly interested in fast inference, distributed training, and model compression. Areas of focus: distributed systems, data/machine learning infrastructure, high performance machine learning, machine learning ## Work Experience ### Software Engineer, Machine Learning Infrastructure @ Kumo Jan 2025 – Present | Mountain View, CA - Currently leading the effort for the next generation of kumo warehouse connectors - Designed and implemented support for ephemeral databricks job cluster session management with pooling which is utilized throughout training and inference workflows. - Designed and implemented stateless observability framework and component to monitor temporal workflows, and detect issues with external platforms. - Lead effort to remove error handling tech debt in order to improve and standardize error handling throughout the infrastructure codebase. - Refactored and reimplemented data operations to utilize warehouse compute to decrease latency, and improve reliability and scalability. - Improved testing coverage and made enhancements to the testing platform. - Part of the on-call rotation providing operational support ### Graduate Student (Part-time) @ Columbia University Jan 2022 – Jan 2025 Some interesting school projects I worked on: • Open Source Distributed Deep Learning - Modernized Tensor Parallelism for IBM's Foundation Model Stack (FMS) https://github.com/foundation-model-stack/foundation-model-stack/pull/371 • MapReduce Library - Wrote master node that distributes jobs to workers in addition to handling worker failures. • Paxos-based Key-Value Service - Implemented a Paxos library and key-value servers that utilize the Paxos library for replication across replicas while dealing with failures. • Sharded Key-Value Service - Implemented Shardmaster service that handles the global configuration across all replica groups and automatically balances shards across replica groups. The Shardmaster is replicated via Paxos and is also utilized by replica groups of key-value servers using Paxos for replication in the face of server failures. • Model Checking Paxos - Built a bare-bones model checker and wrote test cases which utilized bread first search to examine various scenarios for Paxos reaching consensus. The entire distributed system (several nodes + messages + network) is viewed as one state machine. • Model Compression - Distill knowledge from a transformer model into CNNs via knowledge distillation, followed by quantization. ### Software Engineer, Perception Infrastructure @ Bear Robotics Jan 2022 – Jan 2023 | Redwood City, California, United States • Spearheaded an organization-wide ROS migration. - Communicated, collaborated, and worked with multiple teams to drive the project phase to completion. - Documented and reported progress updates and findings to stakeholders throughout the migration. - Triaged and investigated ROS/memory issues. - Conducted manual testing and QA of the robot prior to deployment. • Independently designed and implemented a scalable data and machine learning pipeline utilizing Airflow and Kubernetes, enabling more efficient batch processing and supporting various complex workflows with the capacity to execute hundreds to thousands of jobs in parallel. Replaced a single VM solution, resulting in >80x speed up in data preprocessing for an image localization model job. • Took ownership over the robot data collection stack and made improvements to existing ROS nodes. • Redesigned, refactored, and generalized ROS node to handle model installations of all types. • Conducted benchmarking and testing of ROS nodes responsible for uploading data to the cloud under different Wi-Fi scenarios. • Mentored interns/peers on various teams. ### Software Engineer, Machine Learning Infrastructure @ Fiddler Jan 2021 – Jan 2022 | Palo Alto, California, United States Machine Learning Platform • Led the analysis to uncover issues, inconsistencies, and bottlenecks in complex and tangled APIs. Proposed a range of solutions to simplify and decouple these APIs, in alignment with architecture redesign. • Developed new REST APIs (for projects, datasets, models, etc.) to support a new metadata service, which was an integral part of the new system architecture. • Created and managed an OpenAPI specification and documentation for metadata APIs. • Conducted performance benchmarking of event publishing in the ingestion service, comparing Postgres and Clickhouse. • Contributed to the implementation and automation of a load testing framework. • Designed and prototyped a barebones cluster health service. • Developed an ingestion service cleanup API and integrated Prometheus metrics tracking. • Implemented support for multi-model event ingestion. ### Software Engineer @ 8th Wall Jan 2021 – Jan 2021 | Palo Alto, California, United States Worked on frontend software development and cloud engineering. ### Student @ University of California, Berkeley Jan 2018 – Jan 2020 | Berkeley, California, United States ### Amazon Fresh Associate @ Amazon Jan 2018 – Jan 2018 | Brisbane, California ### MESA Mathematics Tutor @ City College of San Francisco Jan 2017 – Jan 2017 | San Francisco Bay Area Tutored calculus and developed quizzes and lessons with code built off Drupal. ### Engineering Lab Computer Science and Engineering Tutor @ City College of San Francisco Jan 2017 – Jan 2017 | San Francisco Bay Area Tutored Java, Python, MATLAB and SOLIDWORKS. ## Education ### Bachelor of Science - BS in Electrical Engineering and Computer Sciences University of California, Berkeley ### Master of Science - MS in Computer Science - Machine Learning Track Columbia University ### Transfer Coursework in EECS and IEOR City College of San Francisco ## Contact & Social - LinkedIn: https://linkedin.com/in/ardw --- Source: https://flows.cv/ard JSON Resume: https://flows.cv/ard/resume.json Last updated: 2026-04-10