I am a Master’s student in Computer Science at Georgia Tech working on large-scale multimodal AI systems across the full stack, from model behavior to inference infrastructure.
Experience
2025 — Now
Atlanta, GA
• ICML 2026 submission (first author)
Built a causally controlled audit framework for LLM decision revision, distinguishing belief updating from reputation-driven compliance. Introduced token-level log-odds probing and preference–report divergence analysis across 31K+ trials, revealing systematic expertise-sensitivity failures.
• NeurIPS 2026 submission (first author)
Identified a post-retrieval evidence-ignoring failure mode in multimodal RAG, and introduced a retrieval-conditioned auditing framework revealing that matched retrieval success can still hide sharply different evidence-use behavior across VLMs.
• NeurIPS 2026 submission (first author)
Developed a distillation framework that teaches compressed vision–language models when to doubt by transferring uncertainty trajectories from large teachers, significantly improving calibration, robustness, and selective prediction under visual corruption.
• EMNLP 2026 (in preparation; first author)
Developed a localization-based evaluation framework for event-boundary understanding in LLMs, using temporal negative controls and human calibration to reveal fragile alignment with human temporal segmentation.
• Software Engineering (Multi-LLM engine)
Architected a scalable evaluation platform integrating 9+ chat services and 100+ API models, enabling reproducible large-scale reliability audits. Reduced browser automation memory footprint by ~40% via a custom BrowserView layer (vs. Playwright/Selenium) and built a structured LLM-as-a-Judge backend with persistent storage and automated deployment.
2025 — Now
Atlanta, GA
• ME 4710 : Foundations in Machine Learning for Engineers (Fall 2025).
• MGT 6655 : Business Data Preparation & Visualization (Spring 2026).
• Campus Academic Integrity TA Team supporting OMS Analytics and OMS Cybersecurity (Spring 2026).
• Built a privacy-preserving TA Q&A system for MGT 6655 (graduate level, 100+ students) from the ground up, developing an end-to-end Python pipeline to transform Ed Discussion data into RAG and SFT datasets (JSONL) with schema-tolerant parsing and metadata traceability, and deploying a grounded LLM-based assistant with configurable retrieval and embedding backends plus GPU-ready evaluation workflows for scalable, reproducible inference.
2025 — 2025
2025 — 2025
Mountain View, CA
• Optimized Flux-Schnell (12B DiT) multimodal inference on H100 by implementing GPU memory persistence, offload strategies, and kernel-level tuning, achieving ~30 images/min and 1–2s latency per request on a single GPU compared to the 10–15× slower baseline.
• Designed a multi-GPU–ready inference architecture (NCCL-compatible, ONNX → TensorRT conversion pipeline) and validated linear-scaling behavior on single-GPU prototypes to support future distributed deployment.
• Built production-grade serving infrastructure including queueing, heartbeat monitoring, structured logging, GCS integration, and safety filtering, enabling stable long-running operations under high request volume.
• Implemented a video super-resolution pipeline (Real-ESRGAN + FastAPI) with PSNR/SSIM evaluation, reducing 5s@24fps clip runtime by ~65% (284s → 100s) when integrated with Wan2.2 text-to-video.
• Developed an AI-powered e-commerce try-on service (ComfyUI, Flux-Kontext + Segformer), delivering <5s per image outfit changing, background removal, and style transfer via secure RESTful APIs.
• Synthesized research papers and open-source model documentation to produce a technical review of multimodal generation systems, covering text-to-image, text-to-video, and super-resolution/upscaling model families and summarizing key benchmark findings for internal evaluation.
2025 — 2025
Atlanta, GA
• Developed a spatiotemporal modeling framework for high-frequency sensor data (947K samples) with large-scale training on HPC infrastructure.
• Proposed a physics-informed sequence model with structured inductive biases, achieving strong out-of-distribution generalization across unseen locations (Temp RMSE 0.43 °C; RH RMSE 1.3%).
• Built a scalable sparse-to-dense inference pipeline for high-resolution prediction. Resulted in a first-author Q1 journal submission.
2024 — 2024
Atlanta, GA
• Reworked the llama.cpp decode path (C++) for multi-request inference by introducing request batching and a concurrency-aware scheduler, improving throughput by 1.5–2.0× while reducing tail latency under load.
• Performed system-level profiling over long-horizon generations (10K+ tokens) and 1–16 concurrent requests, identifying KV-cache reads and memory bandwidth pressure as the primary bottlenecks in autoregressive decoding.
• Optimized KV-cache access and memory behavior across CPU and GPU paths, including CUDA kernel-level improvements to reduce redundant memory movement and improve memory access efficiency during decoding.
• Refined KV-cache reuse and allocation strategy to mitigate fragmentation and stabilize latency, achieving a 30%+ reduction in variance under sustained workloads.
• Built a modular benchmarking framework for throughput (tokens/sec), latency, and scaling curves, enabling reproducible evaluation of batching, scheduling, and memory optimization strategies.
Education
Georgia Institute of Technology
Master of Science - MS
Shandong University
Bachelor of Engineering - BE
Bazhong Tanghu Foreign Language Experimental School