🌟 AI & Data Engineer | LLM/Agentic Systems Builder | Data-Driven Problem Solver I’m Shabari Vignesh, an AI Engineer with 4+ years of experience building scalable data platforms and AI-driven systems that support real business decisions.
Experience
2025 — Now
2025 — Now
Santa Clara, California, United States
Project 1: Enterprise Data Lake & Merchant Data Platform (AWS) for single, reliable source of truth for merchant, payment, inventory, and settlement data
• > Built a centralized AWS S3 data lake consolidating data from 3+ external systems (Shopify, Fiserv, Clover), processing 10M+ transaction and inventory records per day.
• > Designed event-driven ingestion pipelines using EventBridge -> Step Functions -> multiple Lambda functions, enabling modular, fault-tolerant data collection workflows.
• > Reduced data fragmentation by ~70%, replacing siloed, system-specific reports with curated, analytics-ready datasets consumed across BI and AI use cases.
• > Initially stored ingestion outputs in JSONL format for flexibility, then migrated curated datasets to Parquet, reducing data scanned by ~70% and improving Athena query performance by ~4-6×
Project 2: AI-Powered Merchant Q&A System (ChatGPT-style Agent) to enable merchants to ask natural-language questions over financial data, receive accurate, grounded answers
• >Delivered an agent-based conversational analytics system using AWS Bedrock and LangChain, supporting the 250+ enterprise and mid-market clients.
• >Enabled self-service insights for daily sales, inventory planning, and settlement analysis, significantly reducing reliance on dashboards and ad-hoc reporting workflows.
• >Designed the system to enforce deterministic SQL and internal API execution, ensuring responses were always grounded in curated data lake tables rather than model inference.
Project 3: AI & Data Observability, Guardrails, and Adversarial Testing to ensure AI-generated financial insights were safe, and auditable
• >Built an AI observability layer tracking prompts, tool calls, data sources, and outputs, enabling full traceability for merchant-facing financial answers.
• >Designed and executed adversarial testing frameworks using real merchant queries and edge cases, significantly reducing hallucinated or speculative responses before production rollout.
2024 — 2024
2024 — 2024
Cupertino, California, United States
• >Built curated demo and evaluation datasets derived from Salesforce, Google Calendar, and Highspot schemas, used to train and validate internal models powering the demo experience, improving product iteration speed by ~30%.
• >Automated ingestion and transformation workflows, reducing manual data preparation and ad-hoc analysis by ~50% for sales, product, and demo teams.
• >Generated highly realistic synthetic Python-based sales and calendar data aligned to real Salesforce, Calendar, and Highspot structures, enabling safe, repeatable demos without exposing sensitive customer information.
• >Removed demo dependencies on live or limited production data, ensuring consistent demo reliability and allowing sales reps to confidently run customer demos at any time.
• >Built Looker dashboards visualizing sales engagement, platform usage, and rep activity, helping leadership quickly assess performance and demo effectiveness.
• >Directly supported customer-facing demos and sales conversations, contributing to two new qualified customer opportunities during the internship
2023 — 2024
San Jose, California, United States
2021 — 2023
2021 — 2023
Bangalore Urban, Karnataka, India
Client: Ellevio (Electricity Distribution)
Ellevio manages large-scale grid and smart-meter data in Sweden to support grid monitoring, anomaly detection, and operational analytics. Operational and IoT data arrived at high volume from APIs and streaming sources, with quality issues and strict performance requirements for both analytics and near real-time monitoring.
• >Built scalable ETL pipelines in Azure Data Factory with incremental loads, retries, and monitoring to ingest operational and meter data reliably.
• >Implemented raw -> cleaned -> curated data layers, stabilizing downstream analytics and reducing data quality issues.
• >Developed SQL transformations to standardize time and units, normalize device identifiers, deduplicate events, and aggregate data into reporting-friendly grains.
• >Modeled and optimized Azure Synapse / SQL Server warehouses using fact-dimension design, partitioning, and indexing, cutting dashboard query times from timeouts to single-digit seconds.
• >Implemented Kafka and Spark Streaming pipelines for near real-time grid signals, handling late and duplicate events to support faster operational awareness
2019 — 2021
2019 — 2021
Bangalore Urban, Karnataka, India
Client:Qatar Airways
Qatar Airways operates large-scale airline booking and flight operations systems that generate high-volume, and globally distributed operational data. Booking, flight, and partner data arrived from multiple systems with inconsistent timestamps, duplicate updates, schema drift, and partial daily loads, making reporting unreliable
• >Built production ETL pipelines using SQL, Python, and Airflow to ingest booking, ticketing, and flight operations data from multiple upstream systems.
• >Standardized timestamps, time zones, and identifiers; cleaned status fields; and deduplicated records using composite keys with “latest update wins” logic.
• >Joined bookings, flight schedules, and operational status into single reporting-ready tables, eliminating the need for analysts to query multiple systems.
• >Orchestrated workflows with Airflow DAGs, implementing retries, dependencies, and alerting to prevent partial or silent data failures.
Client:British American Tobacco (BAT)
BAT’s Global Manufacturing Execution System (MES) supports production and finance workflows across plants and work centers worldwide. Manufacturing and finance data arrived late, incomplete, or corrected after initial load, and finance reporting could not tolerate partial or inconsistent data
• >Built and maintained ETL pipelines for global MES and finance data, integrating plant, work center, shift, and production events into curated finance-ready datasets.
• >Implemented incremental and backfill logic to handle late-arriving, corrected production records across time zones.
• >Applied business rules in SQL to distinguish valid production, rework, and scrap, aggregated data at day/shift / work-center levels for finance reporting.
• >Used Airflow validation gates (expected plant coverage, control totals, row-count thresholds) to fail pipelines early when data was incomplete.
• >Automated reconciliation summaries with Python, making it easy to identify missing plants, dates, or mismatched totals.
Education
San José State University
Master of Science - MS
CMR Institute Of Technology