Data Science
BC PM2.5 Short-Term Forecasting Report
ETL and statistical learning workflow for air-quality prediction
Overview
A forecasting project that automates public data retrieval, compresses large environmental datasets, and models short-term PM2.5 air pollution outcomes.
Problem
Air-quality forecasting needs repeatable ingestion from many public sources and efficient storage before statistical learning can be trusted.
Dataset
Public environmental data covering roughly 200 million records across 300 sources, processed into analysis-ready files.
Approach
Automated downloads with Selenium, built Pandas and GeoPandas ETL pipelines, reduced storage footprint, and trained statistical models with SciPy and Statsmodels.
Results
Produced a 73.5% accurate short-term PM2.5 model and reduced data size from about 600MB to 80MB through robust preprocessing.
Lessons Learned
Environmental modeling is an engineering problem before it is a modeling problem; ingestion, geospatial joins, and storage choices shaped the final accuracy.
Model / Pipeline
The implementation combines Python, Pandas, GeoPandas, Scikit-learn, SciPy, Statsmodels, Selenium with a repeatable workflow for data preparation, evaluation, and communication.