Data Science

BC PM2.5 Short-Term Forecasting Report

ETL and statistical learning workflow for air-quality prediction

Overview

A forecasting project that automates public data retrieval, compresses large environmental datasets, and models short-term PM2.5 air pollution outcomes.

Problem

Air-quality forecasting needs repeatable ingestion from many public sources and efficient storage before statistical learning can be trusted.

Dataset

Public environmental data covering roughly 200 million records across 300 sources, processed into analysis-ready files.

Approach

Automated downloads with Selenium, built Pandas and GeoPandas ETL pipelines, reduced storage footprint, and trained statistical models with SciPy and Statsmodels.

Results

Produced a 73.5% accurate short-term PM2.5 model and reduced data size from about 600MB to 80MB through robust preprocessing.

Lessons Learned

Environmental modeling is an engineering problem before it is a modeling problem; ingestion, geospatial joins, and storage choices shaped the final accuracy.

Model / Pipeline

The implementation combines Python, Pandas, GeoPandas, Scikit-learn, SciPy, Statsmodels, Selenium with a repeatable workflow for data preparation, evaluation, and communication.