Working on resiliency, fault tolerance, cluster management, and distributed systems for Amazon Elasticsearch Service.
Contributions (as technical lead) -
* "Autotune for Amazon ES" - a self adapting feedback loop mechanism for intelligently optimizing Elasticsearch clusters.
* Hyper-scale shard allocation and fault tolerance in Ultrawarm enabled Amazon ES clusters.
* A strongly consistent framework for in place configuration updates in a distributed system.
* Self Healing framework to auto-heal clusters.
Other projects I've helped build:
* Split brain avoidance mechanisms
* Internal monitoring systems
* Internal architecture of the service across control and data plane.
* Different parts of a domain's lifecycle supporting scaling updates and configurational changes.
I'm routinely involved in operational deep dives and mentoring other engineers.