Sunnyvale, California, United States
2019 — 2021
San Francisco Bay Area
Developed a datasource scanner to scan, classify, and normalize contents containing Personal Indentifiable Information (PII) from SQL-like databases and network filesystem. Optimized scanning workflow with an internal pipeline with go channels, which improved scan processing speed by 4 times.
Designed and implemented a master-slave model for the scanning system to communicate with backend HTTP server for request delegation and load balancing. Deployed scanner clusters as micro services using docker containers and gRPC, and scaled the system with distributed storage/message queue like consul, Cassandra, and Kafka. Each scanner node is capable of scanning 100k+ tables in each data source and up to 3 million rows in each sql table.
Developed and maintained backend web servers using Gin-Gonic and Express/Node.js HTTP web frameworks. Designed and implemented RESTful backend APIs as well as internal database models using gorm/sqlalchemy and PostgreSQL. Improved query time to 10x faster by de-normalizing database schemas.
Cooperated with data team to build a secured data-sharing platform with Spark. Instrumented and intercepted Spark process to monitor suspicious file operations on both local disk and HDFS with the help of java agent and Byte Buddy, which prevented important data like PII from leakage.
Designed and optimized product features according to real-time feedback from customers during product POC and worked closely with product team to actively respond to feasibility verification requests at the same time. Concluded and provided solutions for on-site support engineers rapidly.
Raleigh-Durham, North Carolina Area
Currently working on a multiple network failure type classification project. Datasets and scoring matrix will be used for a Cisco-owned data analysis competition on Kaggle
Developed a time-series based predictive model in Python with open source data from incomplete data sets, ranking the first in the class with accurate predictions and highest responsivity
Cleansed 500M raw data of network stream transportation, and clustered them according to node failure characteristics
Greater Detroit Area
Coordinated teamwork on a global basis with GM lightening design department and manufacturers across the U.S., China, and Germany
Presented research results in GM heritage center at Warren, MI and appreciated by the managing team of exterior lights
Improved and applied hydrophobic surface with microstructure created by injection molding on headlight inner surface to prevent condensing
2016 — 2016
Shanghai City, China
Analyzed inventory storage information contained in raw EXCEL spreadsheets provided by the client. Cleansed the data and transferred to SQL Server with Python for further analysis
Simplified data visualization process to analyze factory storage information based on existing data source with Tableau, and established a real-time inventory analysis platform by extracting data from remote SQL database
Addressed current oversights of the inventory management system for the client in Vietnam, and drafted requirements for future periodically updated digital inventory stock database
Education
Duke University
Master's degree
UM-SJTU Joint Institute, Shanghai Jiao Tong University