I did a second undergraduate research assistantship during my 3B term under Prof. Mei Nagappan, doing work on defect prediction and analysis in software codebases. Under Prof. Nagappan's tutelage, I explored using Bayesian networks (BNs) to analyze and predict where bugs are likely to appear in a codebase based on various other metrics collected on the codebase (e.g. code churn, lines of code, number of changes made, number of people involved with a specific file or class).
Working with the Eclipse IDE codebase changelog dataset, I used Python, pandas, Graphviz, and GOBNILP (Globally Optimal Bayesian Network learning using Integer Linear Programming), to transform and format the dataset to fit my needs, run experiments and generate candidate BNs, aggregate the results, visualize, and analyze them.
I used pandas to first modify the dataset to be compatible with GOBNILP - aggregating multiple files into a single dataset based on certain metrics, removing any unneeded columns, and discretizing columns into buckets based on how they were distributed in the dataset.
I then used GOBNILP "learn" and generate the optimal BNs based on these datasets, running various experiments based on how the data was aggregated. Then, using Python and Graphviz, I aggregated these results, visualized them, and compared the results between experiments and with results from other studies with Prof. Nagappan to draw conclusions.
Github: https://github.com/Broshen/defect_prediction_analysis