RNTI

MODULAD
Une Plateforme ETL parallèle et distribuée pour l'intégration de données massives
In EGC 2015, vol. RNTI-E-28, pp.455-460
Abstract
We focus in this paper on the impact of Big Data in a decision-making environment, particularly on the data integration phase. In this context, we developed, under the Apache Hadoop framework, a platform called P-ETL (Parallel-ETL) intented to the large data warehousing (DW) according to the MapReduce paradigm. P-ETL allows setting ETL processes (workflow template) and provides an advanced setting that consists of configuring the parallel/distributed processing in Apache Hadoop. This paper demonstrates P-ETL platform. Facing data sets with 244 * 106 until 7, 317 * 109 tuples, the conducted experiment shows that increasing the cluster size and the parallel tasks speed-up the P-ETL process.