Experienced software engineer who has a passion on Software Backend Development and Distributed Systems. Currently I am working at Confluent, involved in the development of transforming Apache Kafka into a Cloud-native service.
Experience
2022 — Now
2022 — Now
San Francisco Bay Area
Cloud-Native Kafka Transformation and Engineering Leadership
Spearheading the transformation of Kafka into a cloud-native service, with a focus on Kafka cluster load balancing, multi-tenancy support, quota enforcement, and comprehensive observability.
Team Lead – Workload Rebalancing: Led a team of three engineers in developing a dynamic workload rebalancing algorithm across Kafka broker cells. This initiative reduced mean time to recovery (MTTR) for imbalance issues from one week to one day. Enhanced cluster observability by surfacing critical metrics such as CPU usage, topic replica distribution, and network traffic to provide clear insights into cluster balance.
Lead Engineer – Broker Visibility Metrics: Delivered a suite of observability enhancements including:
Client metadata (Kafka client version and software name)
Hot partition detection (identifying partitions consuming >80% of broker resources)
Metrics for deprecated Kafka client requests
These metrics significantly improved issue diagnostics and cluster health monitoring.
Lead Engineer – Compute Offload: Designed and implemented the Compute Offload framework, enabling execution of stateless functions within Kafka Produce/Fetch request workflows. This unlocked use cases such as sensitive data masking and high-throughput schema validation. Led architecture design and seamless integration of custom function execution into core Kafka request handling.
Core Contributor – Incremental Rebalancing: Played a key role in designing and implementing an incremental workload rebalancing algorithm, significantly improving balance in large-scale, multi-tenant Kafka clusters (100+ brokers). Results included a reduction in p99 end-to-end request latency from 400ms to 20ms and a >90% drop in customer escalations related to imbalance.
Mentorship & Code Quality: Actively mentored junior engineers, providing guidance on project design and code reviews to ensure quality, maintainability, and alignment with system goals.
2019 — 2022
2019 — 2022
San Francisco Bay Area
• Designed and developed Physical Device Benchmarking, allowing ASR scientists to benchmark language models on real, managed Alexa devices. Physical device benchmarking is a key step in fully-automated ASR model release workflow.
• Developed Alexa Device Provisioning workflow, enabling ASR scientists to reserve and configure Alexa devices with various firmware and model revision in one-click.
• Developed Audio Streaming capability to the underlying device directly from benchmarking services.
• Set up the device lab in office and developed device agent running on the lab hosts to encapsulate the complexity of managing and performing health checks on devices
• Mentored 3 engineers on career development, solving ambiguous problems and AWS technologies
• Took part in the decisions in team's feature request intake process and ticket resolution categories
2019 — 2019
San Francisco Bay Area
• Owned the Athena Utterance Paraphrasing Workflow (AUPW) backend and frontend. This system serves engineers as a reliable way to generate paraphrases with given utterances and compare their and their corresponding answers' similarity. I build the entire system from backend to frontend with Java, Python, HTML, CSS and JavaScript.
• Delivered the Athena Portal Dashboard. This dashboard would show stakeholders the trend of their scheduled test run with configurable metrics. The user can create, modify and delete a graph on the dashboard with simple operations. Athena Portal Dashboard followed the design pattern of Single-page Application (SAP). Its backend is supported by Java, DynamoDB and Amazon S3. Its frontend is built by HTML, CSS and JavaScript.
• Trusted as a troubleshooter and mentor for the new teammates in my team to jump start them to our development and test procedure.
2016 — 2019
2016 — 2019
San Francisco Bay Area
• Built a scalable test solution for French spellings and definitions sourced from Synapse dictionary, and extended it such that we can use the same framework for other locales. Definitions QSR for fr-FR locale, post dictionary ingestion as well as good quality testing, increased definitions QSR by 38%.
• Built the database access layer to query FUD database for the frequently asked questions, which will be used by multiple parts of Athena eco-system.
• Built the Utterance Replay Service (URS) that will allow engineers to replay questions in test environments to increase our test utterance coverage.
• Built Athena Device Testing Service that will allow us to handle automation requests for testing screens of multimedia devices.
• Athena Service didn't have critical metrics available, which made it hard to monitor the service's status. I made the necessary changes to our code base and added the metrics page which has been very useful to track the load on Athena service.
• I worked with Arts & Entertainment team to build a intent analyzing tool so that the team have a clean way to fetch, store and compare DCQS intents and NLU intents in a highly configurable way.
2015 — 2016
2015 — 2016
Greater New York City Area
Developing and testing a complicated, market-based data center monitor and control software - VMTurbo Operations Manager.
I built an automated testing framework based on Robot Framework for the Operations Manager. This automated testing framework is meant to help System Test Engineers on the regression tests and hopefully, the System Test Engineers will only need to test the new features instead of spending large amount of time on regression testing.
In the meantime, I also take part in the Development and manual QA process of the Operations Manager to familiarize myself with the features of the product and come up with the ideas about automated testing.
Education
Cornell University
Master's Degree
South China University of Technology