A seasoned cloud and site reliability engineer with 15 years of experience building and operating production-grade distributed systems at scale across multi-cloud environments. Over the past year, I’ve focused on building and operating GPU-based AI inference infra with a few thousand of GPUs.
Experience
2025 — Now
2025 — Now
Palo Alto, California, United States
Ping me if Dream Machine inference slows down.
Try dreaming here: https://lumalabs.ai/dream-machine
Role:
• Building a sophisticated infra platform that efficiently leverages modern GPUs while meeting internal SLOs.
2025 — 2025
2025 — 2025
Stockholm, Stockholm County, Sweden
Role:
• Focusing on software reliability and scalability
2024 — 2025
2024 — 2025
Stockholm, Stockholm County, Sweden
King, a part of Microsoft Xbox, is a leading mobile game developer known for hits like Candy Crush Saga. King reaches over 250 million MAU worldwide and generated approx. $3.5 billion in revenue.
[SRE & ML Platform Engineer]
Architected, deployed, and scaled ML workloads on Cloud Platform. Developed an internal ML platform that became the core component of King’s ML ecosystem.
• Built production-ready AI/ML infrastructure: Designed and implemented a robust ML infrastructure on Kubernetes, including scheduling, scaling, networking, and storage. Delivered a distributed computing platform for reinforcement learning alongside a data analytics cluster running on Cloud.
• Developed ML platform: Developed an internal ML infra platform leveraging Kubernetes and GCP, integrating cutting-edge ML tools, and featuring a user-friendly API to streamline end-to-end pipelines and enhance the AI/ML engineer experience.
• Cloud foundation initiative for AI/ML: Designed multi-tenant Kubernetes clusters tailored specifically for ML workloads. Developed Terraform modules for golden path resources and automated deployment processes. Set a declarative GitOps strategy using Atlantis/ArgoCD to enable self-service infra/resource management.
• Cross-functional communications: Actively collaborate across teams and departments to plan, design, and execute complex strategic projects. Work with MLEs and data scientists on a daily basis to support ML model experimentation on the right infrastructure and optimize system performance. Lead onboarding of new team members to help them understand the internal infrastructure architecture.
2018 — 2024
2018 — 2024
Stockholm, Stockholm County, Sweden
Responsible for designing, deploying, and enhancing game systems, transitioning live workloads to the Cloud, building production-grade K8s clusters on Cloud, boosting platform stability, modernizing on-prem infrastructure, and in on-call rotations to keep live live.
2016 — 2018
2016 — 2018
Pangyo, Gyeonggi-Do, Korea
A PC and mobile gaming company in Asia market such as South Korea, Japan, and China with 4,000 employees and $2B in revenues and the top-grossing gaming company in South Korea and Taiwan. The worldwide popular game titles are Lineage, GuildWars.
[Cloud & System Engineer]
Responsible for designing, optimizing, developing continuous delivery, and supporting infra platforms on AWS.
• Leadership on cloud strategy: Led the development of a hybrid cloud project and designed AWS cloud network and platform architecture for global services with scalability and high availability. Oversaw a departmental cloud budget and optimized resources regularly to reduce OPEX.
• Delivered infrastructure in hybrid: Launched the 50M MAU mobile services both South Korea and Taiwan. Developed Terraform modules to deliver infra platforms to the multi-region. Engineered CI/CD pipeline for the Kubernetes cluster to deploy the service containers. Designed GPU Kubernetes cluster for AI groups. Developed an application deployment system through SaltStack regardless of Windows or Linux.
• Improved service reliability: Developed an on-call monitoring system using InfluxDB with Telegraf and improved service alerts to near real-time and modernized incident management. Clarified log collecting configuration with Fluentd and integrated monitoring pipeline with Elasticsearch and Graylog.
• Communication / Presentations: Trained the monitoring system with the operation procedure to branch office engineers and led multiple Cloud and Kubernetes hands-on training sessions. Organized stand-up regularly with Kanban board to understand the tasks and project timeline transparently.
Education
Kyung Hee University
BS
Kyung Hee University