ML Infrastructure Engineer.
• ML Infrastructure Tools:
• Engineered and currently maintain a customizable HDFS data loading tool for ML engineers, designed to perform runtime preprocessing from remote Parquet files across a distributed architecture, utilizing PyTorch and PyArrow.
• Collaborated with the team to develop an in-house tool for tracking and visualizing ML experiment metrics.
• Data Pipelines:
• Developed the Python backend for data pipelines processing over 40 billion image/text pairs and managing more than 20 petabytes of data.
• Developed multimedia active learning pipelines that perform batch inference from remote sources, minimizing disk and memory usage while maximizing GPU utilization, and selectively saving keys and metadata of key data samples.
• Architected and implemented generative PyTorch pipelines employing GAN and diffusion models for images, and Text-to-Speech (TTS) models for audio.