Cartographie du risque épidémiologique : le défi des données fortement déséquilibrées
Abstract
The advent of big data collection has contributed to the development of knowledge extraction
methods, but it has also introduced new challenges. One of the main issues is dealing with
highly imbalanced datasets, particularly in class labels for categorical classification tasks. This
article presents a comprehensive strategy developed to address the issue of imbalanced data in
a spatio-temporal epidemiological study of leptospirosis. The approach was evaluated using
real data for a binary classification task, predicting the presence of contamination risk with the
bacteria associated with leptospirosis, where the majority class represents 95% of the labels.
By applying under-sampling, training 200 machine learning models, and using weighted predictions,
our strategy achieved a balanced accuracy of 76.19%, an AUC-ROC of 83.29%, and
a recall of 83.93%.