- Adept at linux kernel, OS, computer architecture, SoC architecture, performance modeling (cycle-accurate CPU/SoC architecture simulation), algorithms, C/C++/x86/Python/Java/Go/SQL/Bash scripting languages - Experience in machine learning, cache side channel attacks, computer security, computer networks, distributed...
Experience
2023 — Now
San Jose, CA
• hypervisor, virtualization and kernel related design work
• pmu and spe sharing support
• task affinity&cpu_mask and sched balancing work
• vm-exit profiling design and performance measurement
• suspend-to-ram related support
• various passthrough device support and config (coresight, cpu&cache topology etc.)
• various shmem driver unification work
• LLM infra related backend-op/engine/serving layers implementations and optimizations for inference
• emulation
• s32g multicore IPC emulation
• sysbus ufs device emulation support + sel4 ufs driver support for qemu ufs device
2020 — 2023
Sunnyvale, CA
• Currently working on the Google SoC project. Being responsible for the Google SoC performance simulation and keeping critical collaboration between US teams and the Israel CI2 team to get Cedar (2nd gen SoC) POR study results and Cedar tape-out.
• Initiated the whole 1st version Google Mesh Simulator under gem5. Implemented the important Cedar features in the Mesh XPs (input VCs, route computation, dual channel support, RNI support, E2E tests, output-unit connections, code-base refactoring etc.). Mentored and ramped up multiple fellow workers; worked on other necessary features together (arbiter connections, credit-links, the bypass feature etc.). Generated study results (performance, mesh utilization) for meeting reviews from the US and CI2 performance/architecture/design teams.
• Took repsonsibility for most of the initial mesh studies and verifications under gem5 (Cypress (1st gen SoC) mesh utilization correlation under various SLC hit rates, mesh back-pressure scheme study, gem5 arbitration scheme, buffer-size and link-latency verifications, gem5 request transaction laddar charts correlation, mesh req-latency avalanche point study, 3-cycle vs 2-cycle per hop latency study etc.).
• Contributed to most of the initial simulator infrastructure development in gem5, include but not limited to stats collection protobuf framework, mesh performance/utilization visualization automation, etc.
• Platform performance related projects: engaged in many aspects of the system architecture design for Google's internal services and cloud platforms (e.g. computing units, servers, storage, networking, accelerators etc.) by utilizing computer architecture, OS, perf modeling, data analysis skills. 1. CCX-aware scheduling (study of the IPC, QPS, kernel scheduling latencies, query latency, throughput differences for high-tier Google workloads). 2. Silent Data Corruption project. 3. Hyperthreading performance and efficiency studies for various workloads under AMD/Intel processors.
2016 — 2020
2016 — 2020
Santa Clara, CA
• Currently working on a multi-purpose companion core project for trending workloads under Advanced Architecture Group
• Took responsibility for CPU OOO/EXE performance modeling in the C++ simulator for the evolutional next generation core (NGC) development; worked with other CPU architects and design/validation engineers to guide the NGC design
• Developed the path-finding features (mostly OOO/EXE units) into the CPU simulator (dual-dest uops study, port-ganging vs dual-dispatching study, double-banked PRF+freelists, EXE units hibernation study etc.); collected and analyzed performance data to give insights to the micro-architecture team to understand the benefits and trade-offs of the newly proposed micro-architectures
2015 — 2016
2015 — 2016
Santa Clara, CA
• Maintained infrastructures and scripts; developed and ameliorated infrastructures: bug filing automation, dashboard label/ticket linking automation, data collection automation, benchmark performance drift tracking tools
• Performed Shasta CPU RTL debugging for OS boot; debugged and fixed psim simulator bugs (codec, decoder, front-end, scheduler, MMU, etc.); monitored test regressions for functional and performance issues
• Developed new features into psim simulator codes (scheduler with different cancellation policy etc.), measured benchmark data with different psim/JIT configurations (different size BBR scheduler, ins blocks w/wo pairing and packing, etc.), added new features into disassembler, added new features into trace-driven-mode, collected performance data for presentation to the micro-architecture team.
2011 — 2015
Princeton, New Jersey, United States
• Did a thorough performance measurement(e.g. IPC, Cache Miss Rate etc.) of a new secure cache design (Newcache) as data cache, L2 cache and instruction cache for carefully selected cloud server benchmarks under gem5
• Reconstructed representative RSA instruction cache side-channel attacks(towards libgcrypt 1.5.3 under Linux using normal 8-way SA L1 I-cache), and did experiments to evaluate Newcache's secure mechanism as instruction cache
• Reconstructed the hooking functions for each operation of the Square-and-Multiply implementation of RSA(towards libgcrypt 1.5.3 under Linux), trained an SVM classifier, and used the classifier to do operation classifications, the accuracy of which represents a metric of vulnerability of different cache configurations
• Visited the Institute of Parallel and Distributed Systems (IPADS) in Shanghai for 3 months, and tried to extend the I-cache side channel attacks to ARM Trustzone with TrustKernel OS developed by IPADS
• Extended representative side-channel techniques to GUI-related shared-libraries (libgtk, libX11) under Linux
• Studied on secure processor designs like Bastion, Intel SGX, ARM Trustzone, etc, and tried to implement Bastion on gem5
• Built a regression framework and a regression report generator, which helps the group to store and compare the history running-time of benchmarks with different LLVM backend optimization techniques
• Implemented a theoretic offline x86 pass (similar to a compiler pass), by using induction variable expansion to do dynamic instruction renaming optimization
• Tried to extend the implementation of the renaming-architecture under gem5, the main idea of which is to transform the work usually done by compiler optimizations, such as induction variable expansion, into simple hardware component and evaluate the benefits
Education
Princeton University
Master of Science - MS
Shanghai Jiao Tong University
Bachelor of Science - BS
University of Michigan