San Jose, California, United States
Shipped 🚢 critical platform infrastructure processing 5M+ messages/day across 2,000+ edge servers with p99 < 500ms and 99.99% uptime.
💻 Fleet Orchestration & Message Routing (Go, REST, gRPC, NATS, Webhooks, WebSockets)
• Architected fleet orchestrator and message routing service deployed across 2,000+ servers as persistent daemons with staged canary-to-stable rollouts, self-recovery, and zero-downtime updates
• Built fleet management service from scratch: 25 data models spanning data centers, server racks, 2,000+ servers, and API clients, serving as data brain for 10+ internal services
• Developed 50+ REST APIs and gRPC methods powering remote fleet upgrades, one config change propagates updates across the entire fleet with instant rollback
🔍 AI Fleet Investigation Agent Spark (Python, Claude, Slack)
• Built claude based fleet investigation agent adopted by 50+ engineers, automatically triggered by health check alerts (every 15s across 2,000+ servers), it SSHs into unhealthy machines, inspects processes, system load, memory, CPU, and service health, and posts a full diagnostic report in Slack before engineers react
• Also serves as on-demand investigation tool, engineers ask questions in natural language, and it queries production databases, searches distributed logs, and correlates events across services to diagnose issues in minutes vs. 30+ manually
⚙️ AI Bug-Fix Automation (Python, Claude, Linear, GitHub)
• Built AI CLI tool enabling one-command bug resolution: remedy <ticket-id> → automatic PR, reducing fix cycle from hours to minutes
• Engineered 7-stage AI pipeline: Slack/Linear context extraction → codebase analysis → implementation plan → AI review → code generation → GitHub PR creation
🤖 AI Slack Delegate Twin (Python, Claude, MCP Servers)
• Built AI assistant that monitors Slack mentions and responds using Claude with real-time access to GitHub, Notion, and Linear via MCP, handling questions, status lookups, and code explanations automatically