# Xinhan Tong > Senior Staff Engineer — ML Infrastructure & AI Systems at Scale | Tech Lead | Distributed Systems, Real-Time Inference, LLM Evals | ex-TikTok, Amazon Location: San Francisco Bay Area, United States Profile: https://flows.cv/xinhan Infrastructure engineer with 7+ years building large-scale distributed systems for ML/AI workloads, including real-time inference pipelines, feature platforms, and LLM evaluation infrastructure. I design the systems that make AI models safe, fast, and reliable in production — from training data pipelines to model serving at 5,000+ QPS to behavioral eval frameworks. Currently leading ML infrastructure at OKX, where I architect streaming/batch feature platforms and real-time inference systems serving transformer, tree-based, and graph models. Previously built ML serving infrastructure at TikTok and distributed systems at Amazon/AWS. Core expertise spans distributed systems design, ML model serving and inference optimization, LLM evaluation infrastructure, and production AI safety monitoring. * Scaled real-time ML inference from 10 to 5,000+ QPS with online drift monitoring * Built feature platform processing streaming, batch, and CDC data for ML model training * Designed LLM eval harnesses and agentic workflow infrastructure (RAG, tool use, multi-step reasoning) * Led teams across ML serving, data infrastructure, and platform engineering at TikTok and Amazon ## Work Experience ### Senior Staff Software Engineer @ OKX Jan 2025 – Present | San Jose, California, United States - Antifraud Architecture - Feature Store - Machine Learning - AI Agents for Infra Automation - AI Agents for Antifraud Risk Analysis ### Staff Software Engineer @ OKX Jan 2024 – Jan 2025 | San Jose, California, United States - Feature Store - Risk Management - Machine Learning Infra ### Lead Software Engineer - Global E-Commerce @ TikTok Jan 2023 – Jan 2024 | Sunnyvale, California, United States - Led, designed, implemented, and launched a VOC/VOB (Voice of Customer/Voice of Business) analytics platform for TikTok Shop from zero to one, transforming raw data into actionable insights. The platform integrates the various latency data (Hive, Kafka etc.), seamlessly normalizes and stores the data (ClickHouse, Redis), generates the OLAP dashboard. It also notifies POCs (Point of Contact) in event of business incidents; - Stakeholders from 10+ teams utilize the platform to dissect TikTok Shop customer experience, initiate proactive measures, and track the progress of resolutions; - Enhanced operational efficiency by integrating a Large Language Model to automate customer support issue summarization for data-driven business decisions; - Expanded the engineering team from 4 to more than 10 engineers and facilitated the onboarding of 5+ new hires; set up best practices for Software Development Life Cycle (SDLC); devised the roadmap, and successfully launched critical feature loops. ### Software Engineer @ DoorDash Jan 2022 – Jan 2023 | Remote - Hired as a tech lead to build a new platform to improve third-party data ingestion on Nov/28; - Impacted by DoorDash layoff event on Nov/30. ### Software Development Engineer II @ Amazon Jan 2020 – Jan 2022 | Greater Seattle Area - Led, designed, implemented, and launched the Automated Collection of Evidence (ACE) service as a single source of truth for transportation risk compliance org. The service is architected as four layers: ingestion (SQS, Lambda), transformation (Airflow), storage (S3, DynamoDB, Aurora), and distribution (ECS, GraphQL). It helps transportation case auditing time reduced from 15 min per audit to about 1 minute per audit; - Designed, led, and implemented a service to collect and aggregate PII (Personal Identifiable Information) data from multiple data sources; set up CI/CD pipeline, dashboard, alarms, operation runbook with SOP (Standard Operating Procedure), and compliance docs to maintain the service to continuously comply the data with FTC SLA; - Reduced the service cost by about 40%, resolved tech debts, and attended weekly operation meetings as Ops Lead; - Migrated GMRA (Gather, Model, Rules, Actions) workflows of merchant customers from dedicated services to cloud native machine learning platforms (AWS Sagemaker). ### Software Engineer @ Amazon Web Services (AWS) Jan 2018 – Jan 2020 | Seattle, Washington, United States - Designed, investigated, implemented, and launched AWS KMS China with senior engineers by building Java service in AWS EC2 with a JNI layer attached to third-party Hardware Security Module (HSM); - Versioned the KMS China infrastructure creation with AWS CloudFormation (Ruby, TypeScript) template, and set up metrics, alarms, dashboards, runbooks, change management plans, and incident response plans; - Set up Network Load Balancer to distribute thousands of TPS traffic of KMS China, and created NTP services to sync HSM from UTC, which recovered 14% of hosts in one Availability Zone; - Improved the security level of random generation by reseeding from HSM with multiple-threading implementation. ### Software Engineer Internship @ Amazon Jan 2017 – Jan 2017 | Seattle, Washington, United States - Designed, implemented, and deployed an email notification system, which included two AWS Lambda services and two Coral (Apache Tomcat) services to improve post-purchase experiences by pushing emails and text notifications; - Set up the CI/CD pipeline of the email notification system with AWS Cloudformation (Ruby). ### Research Assistant @ University of Rochester Jan 2015 – Jan 2015 | Rochester, New York Area - Designed Biochips individually to conduct research on erythrocyte culture and mastered the process of using artificial methods to make erythrocytes mature; - Developed an image processing program to programmatically monitor the cell growth. ### Campus Ambassador (Chairman) @ National Instruments Jan 2014 – Jan 2015 - Enhanced the relationship between the campus tech club and the company by organizing lectures that promoted specific programs and products towards students; - Conducted surveys of students to assess their satisfaction about the companies’ products and services. ## Education ### Graduate Student in Biomedical/Medical Engineering UC Irvine ### Bachelor's degree in Biomedical Engineering Zhejiang University ### Computer Science Coursera ## Contact & Social - LinkedIn: https://linkedin.com/in/xinhan-tong --- Source: https://flows.cv/xinhan JSON Resume: https://flows.cv/xinhan/resume.json Last updated: 2026-04-01