I worked with an impressive team of engineers to build a GPU-enabled distributed deep learning compute and experiment tracking platform. I mentored our clients’ ML and perception teams, helping to integrate their models and build pipelines with our platform and educating them on best practices as they transitioned from training on one to many GPUs. Using the data from these interactions, I worked with our team to design and program new products and features.
Deep Learning Writings and Presentations:
•Led research showing how layer-wise optimizers (e.g. LAMB) can train object detectors (e.g. Mask-RCNN) with large batch sizes in a fraction of the time without performance degradation. Results can be found on our company blog at https://bit.ly/35gfM0P.
•Built a cat detector that was trained live in 5 minutes on 64 GPUs at VentureBeat Transform 2019 using a TensorFlow implementation of RetinaNet. The demonstration by our CEO can be found at https://bit.ly/2YdMbnr.
Software Development Examples:
•Led the design and programming effort of our local offering that allowed users to run deep learning experiments on their own hardware and compare the results in the Engine Dashboard alongside their cloud jobs. The product tracked and persisted code changes, logs, outputs, model performance metrics, system utilization metrics, and dataset metadata. Technologies include: Kotlin, Python, NGINX, PostgreSQL, Hasura, GraphQL, InfluxDB, Elasticsearch.
•Designed and programmed an email alerting service that notified users when their experiments entered a terminal state. Technologies include: Kubernetes, Docker, Prometheus, PromQL, Python.
•Designed and programmed a feature to pre-fetch training data from S3 buckets, storing it in an in-memory read-through cache using Alluxio and Alluxio’s FUSE-based POSIX API, resulting in up to a 5x speedup when reading a remote file.