# Kyungcheol Chang > Software Engineer - AI/ML Infrastructure Location: San Francisco Bay Area, United States Profile: https://flows.cv/kyungcheol A seasoned cloud and site reliability engineer with 15 years of experience building and operating production-grade distributed systems at scale across multi-cloud environments. Over the past year, I’ve focused on building and operating GPU-based AI inference infra with a few thousand of GPUs. Extensive experience designing and operating large-scale K8s platforms powered by CNCF technologies, architecting sophisticated AI/ML infrastructure, and advancing platform engineering maturity. Focused on reliability at scale, performance efficiency, security, and developer experience across cross-functional teams Areas of professional expertise: • Deliver Kubernetes as production level for CPU and AI/ML GPU workloads • Improve platform reliability and resilience on top of Kubernetes • Modernize legacy environment and observability • Problem Solving / Strategic Thinking • Cross-functional communications ## Work Experience ### Software Engineer @ Luma AI Jan 2025 – Present | Palo Alto, California, United States Ping me if Dream Machine inference slows down. Try dreaming here: https://lumalabs.ai/dream-machine Role: - Building a sophisticated infra platform that efficiently leverages modern GPUs while meeting internal SLOs. ### Consultant @ Luma AI Jan 2025 – Jan 2025 | Stockholm, Stockholm County, Sweden Role: - Focusing on software reliability and scalability ### Principal Site Reliability Engineer @ King Jan 2024 – Jan 2025 | Stockholm, Stockholm County, Sweden King, a part of Microsoft Xbox, is a leading mobile game developer known for hits like Candy Crush Saga. King reaches over 250 million MAU worldwide and generated approx. $3.5 billion in revenue. [SRE & ML Platform Engineer] Architected, deployed, and scaled ML workloads on Cloud Platform. Developed an internal ML platform that became the core component of King’s ML ecosystem. • Built production-ready AI/ML infrastructure: Designed and implemented a robust ML infrastructure on Kubernetes, including scheduling, scaling, networking, and storage. Delivered a distributed computing platform for reinforcement learning alongside a data analytics cluster running on Cloud. • Developed ML platform: Developed an internal ML infra platform leveraging Kubernetes and GCP, integrating cutting-edge ML tools, and featuring a user-friendly API to streamline end-to-end pipelines and enhance the AI/ML engineer experience. • Cloud foundation initiative for AI/ML: Designed multi-tenant Kubernetes clusters tailored specifically for ML workloads. Developed Terraform modules for golden path resources and automated deployment processes. Set a declarative GitOps strategy using Atlantis/ArgoCD to enable self-service infra/resource management. • Cross-functional communications: Actively collaborate across teams and departments to plan, design, and execute complex strategic projects. Work with MLEs and data scientists on a daily basis to support ML model experimentation on the right infrastructure and optimize system performance. Lead onboarding of new team members to help them understand the internal infrastructure architecture. ### Senior Site Reliability Engineer @ King Jan 2018 – Jan 2024 | Stockholm, Stockholm County, Sweden Responsible for designing, deploying, and enhancing game systems, transitioning live workloads to the Cloud, building production-grade K8s clusters on Cloud, boosting platform stability, modernizing on-prem infrastructure, and in on-call rotations to keep live live. ### Cloud & System Engineer @ NCSOFT Jan 2016 – Jan 2018 | Pangyo, Gyeonggi-Do, Korea A PC and mobile gaming company in Asia market such as South Korea, Japan, and China with 4,000 employees and $2B in revenues and the top-grossing gaming company in South Korea and Taiwan. The worldwide popular game titles are Lineage, GuildWars. [Cloud & System Engineer] Responsible for designing, optimizing, developing continuous delivery, and supporting infra platforms on AWS. • Leadership on cloud strategy: Led the development of a hybrid cloud project and designed AWS cloud network and platform architecture for global services with scalability and high availability. Oversaw a departmental cloud budget and optimized resources regularly to reduce OPEX. • Delivered infrastructure in hybrid: Launched the 50M MAU mobile services both South Korea and Taiwan. Developed Terraform modules to deliver infra platforms to the multi-region. Engineered CI/CD pipeline for the Kubernetes cluster to deploy the service containers. Designed GPU Kubernetes cluster for AI groups. Developed an application deployment system through SaltStack regardless of Windows or Linux. • Improved service reliability: Developed an on-call monitoring system using InfluxDB with Telegraf and improved service alerts to near real-time and modernized incident management. Clarified log collecting configuration with Fluentd and integrated monitoring pipeline with Elasticsearch and Graylog. • Communication / Presentations: Trained the monitoring system with the operation procedure to branch office engineers and led multiple Cloud and Kubernetes hands-on training sessions. Organized stand-up regularly with Kanban board to understand the tasks and project timeline transparently. ### Cloud & System Engineer @ SK planet Jan 2013 – Jan 2016 | Seoul, Korea The top e-commerce and O2O service company in South Korea(11st.co.kr) and Turkey(n11.com) with 3,000 employees and $1B in revenues. [Cloud system engineer] Responsible for developing, engineering OpenStack as a private cloud and integrating public cloud with AWS. • Developed private cloud: Designed in-house OpenStack architecture and cloud governance. Integrated OpenStack into dev, stage, and production environment and replace virtualization software solution and physical machine. Archived 30% TCO reduction in infra operations with the in-house cloud service. • Delivered services in hybrid cloud: Designed AWS VPC with on-prem network connectivity and established public cloud infra management policies include security, resource management, and monitoring. Delivered eCommerce platform infrastructure to Thailand and Indonesia market. • Problem solving: Investigated service performance and had responsible for daily troubleshooting issues in the private cloud such as database, storage, and KVM virtualization. ### Cloud Engineer @ KT Jan 2010 – Jan 2013 The largest telecom company in South Korea and the first cloud provider in the domestic market, $65M cloud joint venture with Softbank. Major clients included Samsung and Softbank. [Public cloud platform engineer] Responsible for developing, providing, engineering, and consulting cloud computing for stakeholder. • Developed IaaS: Developed a cloud system using Cloudstack and launched a public cloud service. Engineered Linux, Windows server images which are a pre-configured operating system of the public cloud. • Engineered cloud infrastructure: Integrated automation system on monitoring such as Nagios, Cacti, and Collectd. Developed infra deploy management system with Chef cookbooks. • Consulted cloud customer: Integrated SAP HANA to KT cloud with SAP staffs in Korea R&D center. ### IT Market Development Assistant(Contract) @ KOTRA Jan 2009 – Jan 2009 | Greater New York City Area / Seoul [IT market researcher] IT market research and assisted in facilitating technology development between US and Korea companies in telecommunication industry - Extensive market research focused on telecommunication and Green energy - Responsible for organizing a cross border partnership conference in Palisades, NY : "Korea-U.S Telecommunication Global Partnering week” ### Web Programming Intern @ Symbio Technologies Jan 2008 – Jan 2009 | New Rochelle, New York Area PHP Programming and QA Testing VDI Thin-Client - Developed Web site between PHP and Mysql - QA VDI Thin-client hardware ### Mandatory Military Service @ Republic of Korea Army Reconnaissance Jan 2004 – Jan 2006 | Goyang, Gyeonggi, South Korea Lead Tactical Radio Operator in an Army reconnaissance unit. Qualified in helicopter rappel operations with 10+ insertion training mission. ## Education ### BS in Computer Engineering Kyung Hee University ### BS in Electronic Engineering Kyung Hee University ## Contact & Social - LinkedIn: https://linkedin.com/in/kyungcheol --- Source: https://flows.cv/kyungcheol JSON Resume: https://flows.cv/kyungcheol/resume.json Last updated: 2026-04-11