Developed a critical load testing tool to scale the Ordering engine and its hundreds of dependencies for peak events like PrimeDay,BlackFriday, etc. Previously, major outages prevented customers from placing orders; tool ensures ordering ecosystem is tested accurately and to scale so customers do not experience impact.
Key contributions include:
•Designed and implemented traffic network infrastructure, leveraging multiple AWS services to scale to hundreds of thousands of transactions per second, ensuring evenly distributed traffic.
•Developed dynamic traffic safety mechanisms, automating real-time load adjustments and reducing manual intervention.
•Regularly identified and addressed scaling bottlenecks, optimizing system performance and improving the ability to accurately replay high load scenarios.
•Hands-on experience running tool for critical events like PrimeDay. Significant contributions to the development of monitoring dashboards, alarms, and performance metrics to track tool health and ensure real-time visibility of system performance.