Passionate about building Cloud Infrastructures and developing Reliable Distributed Web Services.
Experience
2023 — Now
2023 — Now
New York, New York, United States
2022 — 2023
2022 — 2023
Austin, Texas, United States
Cloud Infrastructure and Services (AWS / Terraform / Kubernetes)
• Seamlessly transitioned 150+ on-prem services to AWS with zero downtime. Worked through 40 product and infrastructure teams to coordinate delivery.
• Built Cloud infrastructures and projects requiring extensive experience with core AWS services(e.g., EC2, VPC, S3, SQS, ElastiCache), configuration/deployment tools (Terraform, Puppet, and Kubernetes), and cloud governance applications to manage Indeed’s AWS Organizations footprint.
• Built iterations of production Kubernetes clusters, configured load balancers both on-prem and in AWS, and contributed to automation tools to deploy applications.
Multi-region Deployment for Indeed Interview Platform (MongoDB / Atlas)
• Designed and delivered better user experience (with higher reliability and multi-region read/write availability) and reduced latency by 50% for APAC clients using a business-critical interview platform.
• Wrote a guide, established best practices, and reached a consensus between cross-functional teams for MongoDB geo-sharding in Atlas, for distributed services requiring data migration.
AWS Migration Toolkit (Spring boot / React / Vault)
• Designed, and led the development of a self-service web app to analyze and migrate deployment and configuration data between data centers in Indeed's distributed configuration system.
• Automated and accelerated the process of deploying and migrating systems to AWS from days to within one hour by promoting it as a critical part of the engineering workflow across the entire organization.
Engineering Leadership
• Mentored other software engineers on technical details, increasing impact and advancing their career goals.
• Provided plan leadership for enhancing SRE on-call support, security, and other reliability goals.
• Drafted, led, and participated in 50+ design reviews for both product applications and infrastructure solutions.
2021 — 2022
2021 — 2022
Austin, Texas, United States
Incident Impact Analysis Tool (Python Flask / React / Jira)
• Designed and led the development of an interactive web app to help analyze the predicted impact of system outages on various critical business KPIs;
• Visualized and accelerated the process from days to minutes by implementing APIs with real-time data from Datadog.
CI/CD Improvement and Support
• Wrote Docker files, and used automation building tools (Ant and Gradle), and CI/CD pipelines (Jenkins and GitLab) to deploy applications to Kubernetes clusters.
• Integrated Datadog synthetic testing into GitLab CI/CD pipeline to catch UI errors earlier in deploying stage for frontend web applications.
Operational Support for Indeed’s System Infrastructure and Applications (Terraform / Puppet / Kubernetes)
• Managed, monitored, and supported networks (Load Balancers, DNS, and routing rules) for employers.indeed.com and its subdomains.
• Provisioned, configured, and iterated on-prem and cloud infrastructures (servers nodes/EC2 instances, in-memory caches – Memcached/Redis, kubernetes clusters, etc.) with Terraform and Puppet.
Observability Improvement and Support (Datadog)
• Configured a large number of Datadog dashboards, and developed reusable Terraform modules to manage SLOs monitoring for system infrastructures, applications, databases, and message queues.
• Implemented and iterated Datadog synthetic tests and developed an internal status page to inform customer service teams of system health.
2018 — 2021
2018 — 2021
Austin, Texas Area
On-call support and Reliability Best Practices (SLOs / Pagerduty)
• Contributed to on-call mitigation, investigation, and remediation of major company-wide events.
• Guided product teams to create proper SLOs and establish on-call processes by reviewing and improving their reliability checklist and documentation.
• Developed a self-service process with Terraform to configure team-specific on-call schedules and escalation policies in Pagerduty.
• Maintained operational review weekly with production teams to identify and improve observability and reliability.
Chaos Testing
• Prepared, executed, and monitored different types of chaos testing.
• Identified and resolved generic issues, and verified 20+ critical services’ ability to failover between data centers.
Dependency API Development (Spring Boot / MySQL)
• Designed, and developed APIs providing insights into transitive service dependencies.
• Developed a cron job gathering and saving data in MySQL database over time.
2016 — 2016
2016 — 2016
Ann Arbor
• Processed historical newspaper images using openCV to apply automatic decomposition programs.
• Implemented and refined a vision algorithm to perform segmentation and classifications of newspaper images.
• Implemented an evaluation system to score the segmentation and classifications result comparing to ground truth data.
Education
University of Michigan
Master's degree
University of Michigan
Master's degree
Tongji University