Field: Software Domain: Platforms & Infrastructure Specialization: Cloud Native Distributed Systems, Observability, Security, Networking, Identity & Access Management Location: SF Bay Area Languages: Golang, Java/Kotlin, Python Technologies: gRPC, Kubernetes, Istio Service Mesh, Helm, AWS (EC2, SNS, SQS, RDS Aurora...
Experience
2023 — Now
2023 — Now
Member of the Platform Infrastructure Engineering & Reliability team focusing on Compute Platform & Service Infrastructure.
The compute platform team owns Kubernetes and everything that runs on top of it. I'm primarily responsible for Flyte & Armada. Flyte is a workflow orchestration system which schedules workloads to Armada, a multi-cluster batch queuing system that runs on Kubernetes. Together these systems run several hundreds of thousands of jobs per day and powers all batch workloads like data ingestion, simulation, distributed ML training on GPUs with Ray. I maintain these systems, scale them, and heavily modify them to improve observability, security, scalability, and general user experience.
I have upstreamed quite a bit of open source work to support our compute engineering efforts.
• Armada: https://github.com/armadaproject/armada/commits?author=Sovietaced
• Armada Operator: https://github.com/armadaproject/armada-operator/commits?author=Sovietaced
• Flyte (Nominated Committer): https://github.com/flyteorg/flyte/commits?author=Sovietaced
The service infrastructure team builds rock solid core services, libraries, and tooling so that application teams can land production grade code as efficiently, securely, and safely as possible. We act as the glue between the cloud application teams, cloud infrastructure teams, and the site reliability team.
I have upstreamed some work to support our service infrastructure efforts.
• I wrote HTTP client metrics for OTEL: https://github.com/open-telemetry/opentelemetry-go-contrib
As an early hire I helped shape the foundation for backend engineering by driving our choice of backend language, API protocol, data format, and making it all work with Bazel. I built core services and middleware for authenticating, authorizing, and auditing gRPC requests. I built libraries to easily integrate OpenTelemetry metrics and traces throughout our services and their respective dependencies.
2024 — Now
2024 — Now
I review pull requests, provide feedback, and contribute changes to make Flyte more performant, secure, and robust. Some of these changes come from my work at Stack AV and some of them come from myself directly outside of work.
My most notable contributions are the following:
1. Adding support for RBAC and tenant isolation
2. Driving development of the Ray plugin
3. Reducing k8s API & etcD load > 80% to improve scalability.
You can find all of my commits here:
1. https://github.com/flyteorg/flyte/commits?author=Sovietaced
2. https://github.com/flyteorg/flytekit/commits?author=Sovietaced
2023 — 2023
2023 — 2023
Cupertino, California, United States
Special Projects Group
Production Intent Cloud Services for Autonomous Systems
2022 — 2022
2022 — 2022
Palo Alto, California, United States
Staff software engineer on the transportation platform team with a focus on service infrastructure. Highlights below:
• Technical lead & architect responsible for service infrastructure software engineering efforts. Cross functional leader for interactions with Product Security, Cloud Platform, and SRE teams.
• Designed a multi-region AWS cloud architecture to achieve high availability in the event of a regional AWS service failure.
• Designed and drove adoption of a custom IAM solution to support authentication, authorization, and accounting for API requests made from internal employees, internal service workloads, and external partners. Serves ~50M production requests per day
• Lead efforts to improve cloud connectivity with vehicles. Added support for multi-carrier concurrent connections as well as support for connection priority to help prevent TCP head of line blocking issues. Reduced vehicle query failures across the entire fleet from 3% to near 0%.
• Built instrumental common Java libraries for Redis, Postgres, Elasticsearch, and MQ usage in order to enforce best practices and add support for distributed tracing and custom Prometheus metrics.
• Drove adoption of Helm charts to dramatically reduce kubernetes manifest code and make deployments less error prone. Reduced typical kubernetes manifest from 700 LOC to 80 LOC.
• Drove efforts to build a protobuf monorepo and gRPC client artifact publishing pipeline used across the organization.
• Leveraged Prometheus and Grafana to introduce rich service health dashboards. Integrated with SRE infrastructure to send PagerDuty alerts for elevated API error rates, elevated API latency, etc.
• Helped the team grow from 5 to 20+ engineers by conducting 150+ tech screens and onsite interviews
2019 — 2022
2019 — 2022
Palo Alto, California, United States
Education
Marist University