Conception physique d'un entrepôt de données distribuées basée sur K-means équilibré
Abstract
Horizontal partitioning has been widely used to optimize query processing in distributed
system such as Hadoop and Spark. In distributed data warehouses, the most expensive opera-
tion for OLAP queries is star join which requires many MapReduce cycles to perform it. In this
paper, we propose new data placement in Hadoop based on K-means balanced algorithm. This
schéma allows to perfom star join operation in only one Spark stage. In our technique, we take
into account the physical characteristics of the cluster and the volume of data. To evaluate our
approach, we conducted some experiments on a cluster of 5 nodes. Where, our approach has
improved the execution time of some OLAP queries by 60% over some existing approaches.