# Wei Zhang

> Machine Learning Software Engineer

Location: San Francisco Bay Area, United States
Profile: https://flows.cv/weizhang2

I am a software engineer specialized in machine learning and computer architecture. I have strong expertise in ML systems, training and inference optimization, recommendation and AI personalization, computer vision, language processing, and GPU and CPU architecture. I am also a tech lead with good project management experience who has led multiple ML projects from research to production.

## Work Experience
### Staff Machine Learning Software Engineer @ Coupang
Jan 2023 – Present | Mountain View, California, United States
Search Ads retrieval.

### Machine Learning Software Engineer @ Meta
Jan 2020 – Jan 2023 | San Francisco Bay Area
Work on applied research in Machine Learning. Research, prototype, and productize new ideas to solve the ML problems in production, e.g., training optimization for recommendation and AI personalization models. Lead multiple projects to production as a tech lead.

### Senior Engineer @ Alibaba Group
Jan 2018 – Jan 2020 | San Francisco Bay Area
AI & Deep Learning benchmarks for high-performance computing.
Training and inference performance optimization of deep learning applications.
Worked on recommendation, computer vision, and language processing models.

### Senior Engineer @ Samsung
Jan 2016 – Jan 2018 | San Francisco Bay Area
GPU systems/architecture design. Develop bit-accurate C++ functional model of Samsung GPU. Good expertise in GPU architecture.

### Research Asistant @ University of Virginia
Jan 2012 – Jan 2016
Propose novel power-aware CPU architectures, implement & evaluate them using circuit/software infrastructures.

Low-Power Set-Associative L1 Instruction Cache
• Proposed early tag lookup technique to reduce dynamic read energy of set-associative L1 instruction caches.
• Redesigned the instruction cache, BTB, branch predictor, and the instruction fetch stage of the experimental superscalar processor to support early tag lookup.
• Evaluated the new processor’s performance, the overhead of the proposed technique, and the area, access time, and read/write energy of the new instruction cache.

Dynamic Core Scaling for Performance and Energy Trade-Off
• Proposed dynamic core scaling that scales pipeline resources of superscalar processors, including front-end width, issue width, and sizes of issue queue, load/store queue, and ROB, to trade-off performance and energy.
• Implemented dynamic core scaling on FabScalar generated RTL superscalar core by modifying various pipeline stages including fetch, issue, memory, and retire.
• Implemented a store-set memory dependence predictor, various two-level branch predictors, and an LSU that is able to process multiple loads/stores per cycle on the RTL processor.
• Performed clock gating, synthesized the new reconfigurable processor, did timing and power analysis based on circuit implementation, evaluated performance and energy using SPEC benchmarks.

Adaptive Front-End Throttling for Superscalar Processors
• Proposed adaptive front-end throttling technique that dynamically adjusts the instruction delivery bandwidth of wide-issue superscalar processors to improve energy efficiency.
• Implemented the proposed technique on FabScalar generated RTL superscalar core by modifying the core’s fetch, decode, rename, dispatch, issue, memory, and retire pipeline stages.
• Designed a two-level non-blocking cache, implemented it in RTL code, and integrated it with FabScalar core.

### GPU Power Intern @ NVIDIA
Jan 2015 – Jan 2015
GPU Power Analysis
• Performed pre-silicon full-chip power analysis of NVIDIA’s next-generation GPUs, identified power bugs, and helped design teams to improve power efficiency.
• Gained in-depth knowledge of power analysis methodology, low-power design, GPU power management, and GPU architecture.

### FPGA Intern @ Information Sciences Institute
Jan 2012 – Jan 2012 | Washington D.C. Metro Area
Investigating Voltage Transients on FPGA
• Designed a digital voltage sensor that can detect nanosecond-scale voltage transients on 28 nm Kintex-7 FPGA.
• Built an EDK embedded system, including MicroBlaze and peripherals, on FPGA to study the voltage transients.
• Wrote C driver programs running on the MicroBlaze to control the peripherals.

### Visiting Student Research Collaborator @ Princeton University
Jan 2011 – Jan 2012
Prototype the Secret-Protection Processor Architecture
• Prototyped the Secret-Protection architecture, a secure architecture used to protect critical secrets in general- purpose processors, on the OpenSPARC FPGA platform.
• Modified the RTL code of the existing OpenSPARC processor to integrate the new security features.

### Research Asistant @ City University of Hong Kong
Jan 2010 – Jan 2012
Single-Chip Security-Aware Processor
• Proposed a single-chip secure processor architecture that provides memory encryption/decryption protection and memory integrity verification functionality.
• Designed security modules, including AES and TRNG (RTL), and memory integrity verification (firmware).
• Integrated the security modules with the OpenSPARC T1 processor and prototyped the system on FPGA.


## Education
### Bachelor of Engineering (BE) in Electronic Science and Technology
Huazhong University of Science and Technology

### Doctor of Philosophy (PhD) in Computer Engineering
University of Virginia


## Contact & Social
- LinkedIn: https://linkedin.com/in/rabbitwayne

---
Source: https://flows.cv/weizhang2
JSON Resume: https://flows.cv/weizhang2/resume.json
Last updated: 2026-04-12