- Adept at linux kernel, OS, computer architecture, SoC architecture, performance modeling (cycle-accurate CPU/SoC architecture simulation), algorithms, C/C++/x86/Python/Java/Go/SQL/Bash scripting languages - Experience in machine learning, cache side channel attacks, computer security, computer networks, distributed...

Experience

NIOStaff Software Engineer (Kernel + LLM)

2023 — Now

San Jose, CA

hypervisor, virtualization and kernel related design work

pmu and spe sharing support

task affinity&cpu_mask and sched balancing work

vm-exit profiling design and performance measurement

suspend-to-ram related support

various passthrough device support and config (coresight, cpu&cache topology etc.)

various shmem driver unification work

LLM infra related backend-op/engine/serving layers implementations and optimizations for inference

emulation

s32g multicore IPC emulation

sysbus ufs device emulation support + sel4 ufs driver support for qemu ufs device

GoogleSW Engineer/SoC + Platform Performance Architect (Google Cloud)

2020 — 2023

Sunnyvale, CA

Currently working on the Google SoC project. Being responsible for the Google SoC performance simulation and keeping critical collaboration between US teams and the Israel CI2 team to get Cedar (2nd gen SoC) POR study results and Cedar tape-out.

Initiated the whole 1st version Google Mesh Simulator under gem5. Implemented the important Cedar features in the Mesh XPs (input VCs, route computation, dual channel support, RNI support, E2E tests, output-unit connections, code-base refactoring etc.). Mentored and ramped up multiple fellow workers; worked on other necessary features together (arbiter connections, credit-links, the bypass feature etc.). Generated study results (performance, mesh utilization) for meeting reviews from the US and CI2 performance/architecture/design teams.

Took repsonsibility for most of the initial mesh studies and verifications under gem5 (Cypress (1st gen SoC) mesh utilization correlation under various SLC hit rates, mesh back-pressure scheme study, gem5 arbitration scheme, buffer-size and link-latency verifications, gem5 request transaction laddar charts correlation, mesh req-latency avalanche point study, 3-cycle vs 2-cycle per hop latency study etc.).

Contributed to most of the initial simulator infrastructure development in gem5, include but not limited to stats collection protobuf framework, mesh performance/utilization visualization automation, etc.

Platform performance related projects: engaged in many aspects of the system architecture design for Google's internal services and cloud platforms (e.g. computing units, servers, storage, networking, accelerators etc.) by utilizing computer architecture, OS, perf modeling, data analysis skills. 1. CCX-aware scheduling (study of the IPC, QPS, kernel scheduling latencies, query latency, throughput differences for high-tier Google workloads). 2. Silent Data Corruption project. 3. Hyperthreading performance and efficiency studies for various workloads under AMD/Intel processors.

Intel CorporationCPU Architect

2016 — 2020

Santa Clara, CA

Currently working on a multi-purpose companion core project for trending workloads under Advanced Architecture Group

Took responsibility for CPU OOO/EXE performance modeling in the C++ simulator for the evolutional next generation core (NGC) development; worked with other CPU architects and design/validation engineers to guide the NGC design

Developed the path-finding features (mostly OOO/EXE units) into the CPU simulator (dual-dest uops study, port-ganging vs dual-dispatching study, double-banked PRF+freelists, EXE units hibernation study etc.); collected and analyzed performance data to give insights to the micro-architecture team to understand the benefits and trade-offs of the newly proposed micro-architectures

Intel CorporationCPU Performance Validation Architect/SW Engineer (SMI acquired by Intel)

2015 — 2016

Santa Clara, CA

Maintained infrastructures and scripts; developed and ameliorated infrastructures: bug filing automation, dashboard label/ticket linking automation, data collection automation, benchmark performance drift tracking tools

Performed Shasta CPU RTL debugging for OS boot; debugged and fixed psim simulator bugs (codec, decoder, front-end, scheduler, MMU, etc.); monitored test regressions for functional and performance issues

Developed new features into psim simulator codes (scheduler with different cancellation policy etc.), measured benchmark data with different psim/JIT configurations (different size BBR scheduler, ins blocks w/wo pairing and packing, etc.), added new features into disassembler, added new features into trace-driven-mode, collected performance data for presentation to the micro-architecture team.

Princeton UniversityGraduate Research Assistant

2011 — 2015

Princeton, New Jersey, United States

Did a thorough performance measurement(e.g. IPC, Cache Miss Rate etc.) of a new secure cache design (Newcache) as data cache, L2 cache and instruction cache for carefully selected cloud server benchmarks under gem5

Reconstructed representative RSA instruction cache side-channel attacks(towards libgcrypt 1.5.3 under Linux using normal 8-way SA L1 I-cache), and did experiments to evaluate Newcache's secure mechanism as instruction cache

Reconstructed the hooking functions for each operation of the Square-and-Multiply implementation of RSA(towards libgcrypt 1.5.3 under Linux), trained an SVM classifier, and used the classifier to do operation classifications, the accuracy of which represents a metric of vulnerability of different cache configurations

Visited the Institute of Parallel and Distributed Systems (IPADS) in Shanghai for 3 months, and tried to extend the I-cache side channel attacks to ARM Trustzone with TrustKernel OS developed by IPADS

Extended representative side-channel techniques to GUI-related shared-libraries (libgtk, libX11) under Linux

Studied on secure processor designs like Bastion, Intel SGX, ARM Trustzone, etc, and tried to implement Bastion on gem5

Built a regression framework and a regression report generator, which helps the group to store and compare the history running-time of benchmarks with different LLVM backend optimization techniques

Implemented a theoretic offline x86 pass (similar to a compiler pass), by using induction variable expansion to do dynamic instruction renaming optimization

Tried to extend the implementation of the renaming-architecture under gem5, the main idea of which is to transform the work usually done by compiler optimizations, such as induction variable expansion, into simple hardware component and evaluate the benefits

Education

Princeton University

Master of Science - MS

Shanghai Jiao Tong University

Bachelor of Science - BS

University of Michigan

Experience+4

Education

Master of Science - MS

Bachelor of Science - BS

Bachelor of Science - BS

Experience