Optimisation d'architecture de lacs de données basée sur les chaînes d'approvisionnement
Abstract
Data lakes have recently emerged as a new generation of data repository. Data lake architecture design, which
has significant impacts on data lake performance and data quality, is an active topic. In this paper, we study a joint
“location-allocation" problem which is used in supply chain network design for improving data lake architecture and
performance. we propose a mathematical model applied to a MapReduce environment, based on an analogy between
data lakes and supply chain. We solve this model with a greedy algorithm and determine the optimal numbers of
MapReduce jobs that should be run in such a data lake to optimize the performance.