RNTI

MODULAD
Cartographie du risque épidémiologique : le défi des données fortement déséquilibrées
In EGC 2025, vol. RNTI-E-41, pp.159-170
Abstract
The advent of big data collection has contributed to the development of knowledge extraction methods, but it has also introduced new challenges. One of the main issues is dealing with highly imbalanced datasets, particularly in class labels for categorical classification tasks. This article presents a comprehensive strategy developed to address the issue of imbalanced data in a spatio-temporal epidemiological study of leptospirosis. The approach was evaluated using real data for a binary classification task, predicting the presence of contamination risk with the bacteria associated with leptospirosis, where the majority class represents 95% of the labels. By applying under-sampling, training 200 machine learning models, and using weighted predictions, our strategy achieved a balanced accuracy of 76.19%, an AUC-ROC of 83.29%, and a recall of 83.93%.