In a ring-fenced division of 16+ people, I have the role of implementing solutions to facilitate the analysis of large volume (un)structured data.
The bulk of my activities are carried out using Python 3.
I have mainly worked on:
Back-End of an NLP web-application
• ------------------------------------
In charge of designing & implementing the class structure which can be split in 3 modules:
• > Data-ingestion
Extract in bulk raw text data from (PDF, DOCX, PPTX) documents, this text data is then sliced and stored in a custom data structure.
• > Preprocessing
Toolkit of functions which performs various transformations on text data (remove stopwords, lemmatization using memoization, bigrams generator...)
• > Text Analysis
Exact Search: Implements SQL Like/Wildcard matching behavior
Semantic Search: Based on Word2Vec implementation from Gensim, word matching via cosine similarity.
Named Entity Recognition: Spacy implementation to retrieve mentions of People, Organization and Geopolitical Entities.
Insights via Top-Words: Generates top N word/document using Term-Frequency Inverse Document Frequency (TF-IDF) Score.
FuzzyMatching utility
• ---------------------
Before: Usage of SQL Like function which shows its limits when trying to match addresses, customer names from various data sources.
Now: Implemented 2 different versions of fuzzy matching
1: fuzzywuzzy package which can be slow when number of comparison > 10e6 (based on string manipulations)
2: sklearn NearestNeighbors (K = 1) + TFIDF Vectorizer (with char trigrams) this solutions reduces significantly processing time and based on cosine similarity (linear_kernel).
Processing time: 10e6 comparisons in 15 minutes (2)