• Built a scalable evaluation framework for MLLMs that isolates perceptual, spatial, grounding, and fairness failures under 30+ perturbations across 62K+ samples (scalable to 1M+), supporting deployment-facing reliability analysis for foundational VLMs
• Designed Map&Make, a schema-guided and agentic text-to-table pipeline for high-fidelity structured extraction from long-form narratives; evaluated frontier LLMs with 5+ structural and semantic metrics.
• Implemented distributed post-training with LoRA, DPO, and GRPO on A100, H100, and H200 clusters, improving GSM8K and MATH by 20-40% across multiple open-weight LLMs while balancing compute efficiency and stability.
• Co-developed SPORTSQL, an interactive NL-to-SQL and visualization system over live English Premier League data; contributed 1,793 benchmark queries and reached up to 80% exact match and 94% LLM-as-judge accuracy.
• Built annotation and quality-control protocols for human and LLM-assisted evaluation, improving reproducibility, failure analysis, and benchmarking rigor.