•Architected a high-throughput RAG pipeline to ingest and process SEC filings (10-K), implementing a custom parser that normalizes heterogeneous HTML/PDF content into structured formats.
•Reduced cloud infrastructure costs by 100% ($10K/month) by engineering a custom OCR solution to replace AWS Textract, maintaining accuracy while improving processing speed and reliability.
•Designed a custom chunking library tailored for financial documents, optimizing token usage and preserving tabular data integrity significantly better than off-the-shelf solutions like LangChain.
•Implemented a hybrid search algorithm combining semantic (SentenceTransformers) and fuzzy matching (RapidFuzz) to align section headers across disparate documents, improving retrieval accuracy.
•Built a document classification model achieving 98.5% accuracy by optimizing embedding strategies and scoring functions for financial data categorization.
•Established a comprehensive CI testing suite using Pytest, covering the full data ingestion lifecycle and accelerating the release cycle by 20%.