heal.abstract
This dissertation was conducted as a part of the MSc in Data Science at the Interna-tional Hellenic University.
The global fight against cardiovascular diseases (CVD) is experiencing a plateau in progress. One of the major causes of this issue, is that it is extremely difficult even for health practitioners to predict heart diseases as it is an intricate task, demanding a great amount of knowledge and experience. In such times, there exists a growing demand to integrate machine learning (ML) and data mining within the healthcare system, as by har-nessing the wealth of available data, insights to society can be very beneficial.
This research successfully addresses a significant gap in the existing literature, by thoroughly examining both machine learning models and neural networks for CVD risk prediction based on personal lifestyle factors in a highly imbalanced real-life dataset. We trained multiple classifiers, including namely, Logistic Regression (LR), Decision Trees (DT), Random Forest (RF), Gradient Boosting (GB), XGBoost (XGB), CatBoost and Arti-ficial Neural Networks (ANN). We used the Behavioral Risk Factor Surveillance System (BRFSS) 2021 Heart Disease Health Indicators dataset and to tackle the class imbalance challenge, we used methods such as Synthetic Minority Over Sampling Technique (SMOTE) Sampling, Adaptive Synthetic (ADASYN) Sampling, SMOTE-Tomek, and SMOTE-ENN.
Based on the findings, we conclude that hybrid models like SMOTE-ENN and SMOTE-Tomek outperformed the alternative sampling techniques in terms of the sensitivi-ty metric. Our proposed implementation includes SMOTE-ENN coupled with CatBoost optimized through Optuna, achieving a remarkable 88% on recall and 82% on the AUC metric. Also, the ANN proposed, exhibited promising results, offering an additional layer of robustness in detecting positive cases of cardiovascular diseases.
en