# Wei Zhang > Machine Learning Software Engineer Location: San Francisco Bay Area, United States Profile: https://flows.cv/weizhang2 I am a software engineer specialized in machine learning and computer architecture. I have strong expertise in ML systems, training and inference optimization, recommendation and AI personalization, computer vision, language processing, and GPU and CPU architecture. I am also a tech lead with good project management experience who has led multiple ML projects from research to production. ## Work Experience ### Staff Machine Learning Software Engineer @ Coupang Jan 2023 – Present | Mountain View, California, United States Search Ads retrieval. ### Machine Learning Software Engineer @ Meta Jan 2020 – Jan 2023 | San Francisco Bay Area Work on applied research in Machine Learning. Research, prototype, and productize new ideas to solve the ML problems in production, e.g., training optimization for recommendation and AI personalization models. Lead multiple projects to production as a tech lead. ### Senior Engineer @ Alibaba Group Jan 2018 – Jan 2020 | San Francisco Bay Area AI & Deep Learning benchmarks for high-performance computing. Training and inference performance optimization of deep learning applications. Worked on recommendation, computer vision, and language processing models. ### Senior Engineer @ Samsung Jan 2016 – Jan 2018 | San Francisco Bay Area GPU systems/architecture design. Develop bit-accurate C++ functional model of Samsung GPU. Good expertise in GPU architecture. ### Research Asistant @ University of Virginia Jan 2012 – Jan 2016 Propose novel power-aware CPU architectures, implement & evaluate them using circuit/software infrastructures. Low-Power Set-Associative L1 Instruction Cache • Proposed early tag lookup technique to reduce dynamic read energy of set-associative L1 instruction caches. • Redesigned the instruction cache, BTB, branch predictor, and the instruction fetch stage of the experimental superscalar processor to support early tag lookup. • Evaluated the new processor’s performance, the overhead of the proposed technique, and the area, access time, and read/write energy of the new instruction cache. Dynamic Core Scaling for Performance and Energy Trade-Off • Proposed dynamic core scaling that scales pipeline resources of superscalar processors, including front-end width, issue width, and sizes of issue queue, load/store queue, and ROB, to trade-off performance and energy. • Implemented dynamic core scaling on FabScalar generated RTL superscalar core by modifying various pipeline stages including fetch, issue, memory, and retire. • Implemented a store-set memory dependence predictor, various two-level branch predictors, and an LSU that is able to process multiple loads/stores per cycle on the RTL processor. • Performed clock gating, synthesized the new reconfigurable processor, did timing and power analysis based on circuit implementation, evaluated performance and energy using SPEC benchmarks. Adaptive Front-End Throttling for Superscalar Processors • Proposed adaptive front-end throttling technique that dynamically adjusts the instruction delivery bandwidth of wide-issue superscalar processors to improve energy efficiency. • Implemented the proposed technique on FabScalar generated RTL superscalar core by modifying the core’s fetch, decode, rename, dispatch, issue, memory, and retire pipeline stages. • Designed a two-level non-blocking cache, implemented it in RTL code, and integrated it with FabScalar core. ### GPU Power Intern @ NVIDIA Jan 2015 – Jan 2015 GPU Power Analysis • Performed pre-silicon full-chip power analysis of NVIDIA’s next-generation GPUs, identified power bugs, and helped design teams to improve power efficiency. • Gained in-depth knowledge of power analysis methodology, low-power design, GPU power management, and GPU architecture. ### FPGA Intern @ Information Sciences Institute Jan 2012 – Jan 2012 | Washington D.C. Metro Area Investigating Voltage Transients on FPGA • Designed a digital voltage sensor that can detect nanosecond-scale voltage transients on 28 nm Kintex-7 FPGA. • Built an EDK embedded system, including MicroBlaze and peripherals, on FPGA to study the voltage transients. • Wrote C driver programs running on the MicroBlaze to control the peripherals. ### Visiting Student Research Collaborator @ Princeton University Jan 2011 – Jan 2012 Prototype the Secret-Protection Processor Architecture • Prototyped the Secret-Protection architecture, a secure architecture used to protect critical secrets in general- purpose processors, on the OpenSPARC FPGA platform. • Modified the RTL code of the existing OpenSPARC processor to integrate the new security features. ### Research Asistant @ City University of Hong Kong Jan 2010 – Jan 2012 Single-Chip Security-Aware Processor • Proposed a single-chip secure processor architecture that provides memory encryption/decryption protection and memory integrity verification functionality. • Designed security modules, including AES and TRNG (RTL), and memory integrity verification (firmware). • Integrated the security modules with the OpenSPARC T1 processor and prototyped the system on FPGA. ## Education ### Bachelor of Engineering (BE) in Electronic Science and Technology Huazhong University of Science and Technology ### Doctor of Philosophy (PhD) in Computer Engineering University of Virginia ## Contact & Social - LinkedIn: https://linkedin.com/in/rabbitwayne --- Source: https://flows.cv/weizhang2 JSON Resume: https://flows.cv/weizhang2/resume.json Last updated: 2026-04-12