Machine Learning
High-Dimensional Data Imputation via Group Lasso
Dimension reduction and imputation for noisy high-dimensional data
Overview
A machine learning project comparing Group Lasso and KNN-style imputation under missing-not-at-random settings and high-dimensional feature spaces.
Problem
High-dimensional missing data can amplify noise and degrade downstream model utility when imputation ignores feature structure.
Dataset
Communities & Crime data and synthetic missing-not-at-random datasets designed to stress-test imputation robustness.
Approach
Used Group Lasso for feature selection, compared imputation quality with RMSE, and evaluated downstream utility through Scikit-learn decision trees.
Results
Improved imputation quality by 9x RMSE on Communities & Crime data and reduced synthetic feature space by 85%.
Lessons Learned
For missing data problems, feature selection and downstream utility can be more informative than a single imputation score.
Model / Pipeline
The implementation combines Python, Pandas, Scikit-learn, NumPy, Matplotlib with a repeatable workflow for data preparation, evaluation, and communication.