Flight Delay Analysis and Prediction
Machine Learning
In In this project, we explored a large dataset of flight cancellations and delays
from 2019 to 2023 in the United States
provided by the Department of Transportation, along with NOAA weather data, the original dataset was
composed of 3 Million rows, which after cleaning and preprocessing gave us a final
dataset composed of 1.3 Million
data.
We used machine learning models to identify trends and patterns in flight delays and cancellations and
created an
application for consumers to predict the likelihood that their next flight would be delayed or canceled.
After testing several machine learning models, we employed the XGBoost
model which achieved an accuracy
of 0.68 and an
F1-weighted average score of 0.70 in the Macro Average results.
Given the class imbalance in our dataset (77% for one class), we opted for
macro averaging over weighted
averaging.
Macro averaging ensures an unbiased evaluation of the model's performance across all classes, providing
a fair and
robust assessment of overall model performance.
In addition, during feature engineering, we integrated detailed weather data, such as precipitation,
snowfall, maximum
temperature, minimum temperature and average wind speed, to contextualize flight operations in relation
to weather
conditions and find possible strong correlations between variables.
To assign geographic markers to each airport in our flight dataset, we use an airport coordinate dataset
that includes
IATA and ICAO codes, airport names, latitude and longitude.
This project aims to to address flight delays and cancellations by providing a data-driven tool
to help us
identify and manage flight delays and cancellations.
Macro Average Results | |||
---|---|---|---|
Model | Precision | Recall | f1-score |
Logistic Class. (without Weather) | 0.57 | 0.60 | 0.55 |
Logistic Class. | 0.66 | 0.52 | 0.48 |
Decision Tree | 0.58 | 0.58 | 0.58 |
Decision Tree with RandomSearchCV | 0.68 | 0.55 | 0.54 |
Random Forest | 0.71 | 0.55 | 0.55 |
Gradient Boosting (XGBoost) | 0.62 | 0.67 | 0.62 |
Neural Network | 0.68 | 0.57 | 0.58 |
K-Nearest Neighbors | 0.61 | 0.58 | 0.59 |
Random Forest (ATL Airport) | 0.71 | 0.56 | 0.56 |
Neural Network (ATL Airport) | 0.68 | 0.54 | 0.52 |