• Designing and evaluating robustness frameworks for large language models to improve reliability, failure detection, and production-readiness in real-world AI systems.
• Investigating security and misuse risks in LLM-based applications, developing structured evaluation methodologies and guardrail strategies to mitigate prompt injection, data leakage, and unsafe model behaviors.
• Building scalable evaluation pipelines in PyTorch and JAX to benchmark model performance, reasoning accuracy, and multimodal robustness across text, audio, and vision systems.