I am a compiler writer and designer. My work has been focused on SSA, flow and loop optimizer design and implementation, in code generator design and implementation, on profile based optimizations and on highly optimized compiler runtime support on a large number of compilers, tools and targets.

Experience

SiFivePrincipal Compiler Engineer

2021 — Now

San Mateo, California, United States

AppleGPU Compiler Engineer

2017 — 2021

Cupertino, CA

I was a member of the backend code generation and performance optimization team. My work was focused on enabling current and future devices, analyzing generated code and implementing optimizations to enhance performance. This includes enabling numerics support, creating new math compilation models and designing/implementing math algorithms and features in the LLVM compiler with aspects in Open Source.

Intel CorporationSr Staff Compiler Engineer

2012 — 2017

Santa Clara, CA

While at Intel I was a member of the JVM compiler team, leading Intel's Graal compiler effort while doing major compiler implementations in the c2 optimizing compiler at the same time. I implemented loop transformations, extended vectorization, added new machine descriptions, code generators, assemblers and related support, scalar reductions, unrolling guidance of vectorizable loops, vector drain loops, loop splitting, range check elimination and unrolling and designed and implemented a pre-register allocation register pressure scheduler; all for current and future Intel architectures and on multiple implementations of the Java optimizing compiler. While working in a prior role, I Developed optimizations in the OpenCL compiler tool chain for Intel-GPU architectures. Authored and implemented Region Prescheduling for GPUs, including efficient register pressure management, inner-most loop and basic block code motion, global code motion and other region based optimizations. Co-authored SSA based treescan register allocation, contributing complex flow functionality of nested and inner most loop support among other areas. Implemented outlining via function calls, loop optimizations, context sensitive liveness, hierarchical flow based inlining and its heuristics as well as OpenCL language features in LLVM and some proprietary compiler back ends.

AMDSMTS Compiler Engineer

2006 — 2012

Portland, Oregon Area

While at AMD I was a member of the Open64 compiler team. I authored optimizations such as control flow based fully unrolling, best fit unrolling guided by register pressure, register pressure optimizations during code generation, authored memory lib routines using multiversioning, implemented profile based indirect inlining, conditional region merging, context sensitive interprocedural optimizations, loop distribution and peeling, a post register allocation dispatch scheduler, peephole optimizations, multiversioning for interior pointers/alignment (patented) and I wrote the Bulldozer machine model while being the lead engineer over code generator development. I also re-targeted Linux compilers to native Win32 which included a full range of tools and runtime support. The emphasis of this role was mitigating performance on AMD architectures for Spec benchmarks. Finally, I was the lead engineer of the Family 15h Software Optimization Guides for over 3 years. Also, I collaborated in hand optimizing prototype optimizations in many SPEC CPU2006 INT and FP metrics.

Cray Inc.Compiler Engineer IV

2004 — 2006

Seattle, WA

While working in the Programming Environments group targeting the DARPA funded Cascade Architecture, I re-targeted our back end compiler and optimized global hierarchical graph coloring register allocators, local register allocators, implemented various machine descriptions and code generators for a MTA-like parallel processor, for the Opteron, and for the Black Widow processor and authored pre/post local schedulers in the compiler back end for the C/C++/F95 compilers. I also wrote an Opteron encoder for specific target work, as well as designed and implemented the code generation schema for blending a two processor co-code generation of simultaneous target emit to facilitate code generation of parallel workloads on two different architectures as part of the Cascade architecture.

Education

California State Polytechnic University-Pomona

Experience+4

Education

Bachelor of Science (BS)

Experience