A seasoned cloud and site reliability engineer with 15 years of experience building and operating production-grade distributed systems at scale across multi-cloud environments. Over the past year, I’ve focused on building and operating GPU-based AI inference infra with a few thousand of GPUs.

Experience

Luma AISoftware Engineer

2025 — Now

Palo Alto, California, United States

Ping me if Dream Machine inference slows down.

Try dreaming here: https://lumalabs.ai/dream-machine

Role:

Building a sophisticated infra platform that efficiently leverages modern GPUs while meeting internal SLOs.

Luma AIConsultant

2025 — 2025

Stockholm, Stockholm County, Sweden

Role:

Focusing on software reliability and scalability

KingPrincipal Site Reliability Engineer

2024 — 2025

Stockholm, Stockholm County, Sweden

King, a part of Microsoft Xbox, is a leading mobile game developer known for hits like Candy Crush Saga. King reaches over 250 million MAU worldwide and generated approx. $3.5 billion in revenue.

[SRE & ML Platform Engineer]

Architected, deployed, and scaled ML workloads on Cloud Platform. Developed an internal ML platform that became the core component of King’s ML ecosystem.

Built production-ready AI/ML infrastructure: Designed and implemented a robust ML infrastructure on Kubernetes, including scheduling, scaling, networking, and storage. Delivered a distributed computing platform for reinforcement learning alongside a data analytics cluster running on Cloud.

Developed ML platform: Developed an internal ML infra platform leveraging Kubernetes and GCP, integrating cutting-edge ML tools, and featuring a user-friendly API to streamline end-to-end pipelines and enhance the AI/ML engineer experience.

Cloud foundation initiative for AI/ML: Designed multi-tenant Kubernetes clusters tailored specifically for ML workloads. Developed Terraform modules for golden path resources and automated deployment processes. Set a declarative GitOps strategy using Atlantis/ArgoCD to enable self-service infra/resource management.

Cross-functional communications: Actively collaborate across teams and departments to plan, design, and execute complex strategic projects. Work with MLEs and data scientists on a daily basis to support ML model experimentation on the right infrastructure and optimize system performance. Lead onboarding of new team members to help them understand the internal infrastructure architecture.

KingSenior Site Reliability Engineer

2018 — 2024

Stockholm, Stockholm County, Sweden

Responsible for designing, deploying, and enhancing game systems, transitioning live workloads to the Cloud, building production-grade K8s clusters on Cloud, boosting platform stability, modernizing on-prem infrastructure, and in on-call rotations to keep live live.

NCSOFTCloud & System Engineer

2016 — 2018

Pangyo, Gyeonggi-Do, Korea

A PC and mobile gaming company in Asia market such as South Korea, Japan, and China with 4,000 employees and $2B in revenues and the top-grossing gaming company in South Korea and Taiwan. The worldwide popular game titles are Lineage, GuildWars.

[Cloud & System Engineer]

Responsible for designing, optimizing, developing continuous delivery, and supporting infra platforms on AWS.

Leadership on cloud strategy: Led the development of a hybrid cloud project and designed AWS cloud network and platform architecture for global services with scalability and high availability. Oversaw a departmental cloud budget and optimized resources regularly to reduce OPEX.

Delivered infrastructure in hybrid: Launched the 50M MAU mobile services both South Korea and Taiwan. Developed Terraform modules to deliver infra platforms to the multi-region. Engineered CI/CD pipeline for the Kubernetes cluster to deploy the service containers. Designed GPU Kubernetes cluster for AI groups. Developed an application deployment system through SaltStack regardless of Windows or Linux.

Improved service reliability: Developed an on-call monitoring system using InfluxDB with Telegraf and improved service alerts to near real-time and modernized incident management. Clarified log collecting configuration with Fluentd and integrated monitoring pipeline with Elasticsearch and Graylog.

Communication / Presentations: Trained the monitoring system with the operation procedure to branch office engineers and led multiple Cloud and Kubernetes hands-on training sessions. Organized stand-up regularly with Kanban board to understand the tasks and project timeline transparently.

Education

Kyung Hee University

BS

Kyung Hee University

Experience+5

Education

BS

BS

Experience