San Jose, California, United States
RAS (Reliability, Availability, Serviceability) Management | AI Infrastructure Team
Designed and implemented a DRAM Fault Analyzer within the OpenBMC firmware environment, enabling parsing and persistent storage of CPER (Common Platform Error Record) logs.
Designed and optimized the memory logging mechanism in BMC firmware to ensure efficient storage and retrieval of diagnostic data, improving system performance and reliability under constrained memory and resource environments.
Conducted in-depth fault analysis and field studies across multiple memory technologies including HBM, DDR4, and DDR5, contributing to system-level RAS improvements in hyperscale data center environments.
Developed new adaptive scrubbing operation method for DRAM using the concepts of data poisoning and hardware/firmware level principles.
Developed a DRAM Fault Emulator to model and simulate various memory failure scenarios, significantly improving validation coverage.