Machine Learning

High-Dimensional Data Imputation via Group Lasso

Dimension reduction and imputation for noisy high-dimensional data

Overview

A machine learning project comparing Group Lasso and KNN-style imputation under missing-not-at-random settings and high-dimensional feature spaces.

Problem

High-dimensional missing data can amplify noise and degrade downstream model utility when imputation ignores feature structure.

Dataset

Communities & Crime data and synthetic missing-not-at-random datasets designed to stress-test imputation robustness.

Approach

Used Group Lasso for feature selection, compared imputation quality with RMSE, and evaluated downstream utility through Scikit-learn decision trees.

Results

Improved imputation quality by 9x RMSE on Communities & Crime data and reduced synthetic feature space by 85%.

Lessons Learned

For missing data problems, feature selection and downstream utility can be more informative than a single imputation score.

Model / Pipeline

The implementation combines Python, Pandas, Scikit-learn, NumPy, Matplotlib with a repeatable workflow for data preparation, evaluation, and communication.