| Home

Overview


Original Research

APACHE SPARK AND HADOOP: A DETAILED COMPARISON OF THE TWO PROCESSING PARADIGMS

ENAS TAWFIQ NAFFAR 1, LAMA SADI AWAD 2, and Dr. FAWAZ AHMAD ALZAGHOUL 3.

Vol 19, No 03 ( 2024 )   |  DOI: 10.5281/zenodo.10847726   |   Author Affiliation: Faculty of Information Technology, Philadelphia University, Jordan 1; King Abdullah II School of Information Technology, The University of Jordan Amman 2,3.   |   Licensing: CC 4.0   |   Pg no: 102-114   |   Published on: 05-03-2024

Abstract

The exponential growth in data volume highlights the crucial need for efficient massive dataset processing, storage, and analysis. This paper examines the fundamental details of Apache Hadoop and Apache Spark, two popular large data processing frameworks. Their fundamental designs, data storage structures, processing techniques, fault tolerance systems, and general structural frameworks are all thoroughly examined. Apache Hadoop, an innovative framework, utilizes the Hadoop Distributed File System (HDFS) along with the MapReduce programming model to achieve distributed data storage and processing. On the other hand, Apache Spark, a more modern equivalent, uses in-memory processing and Resilient Distributed Datasets (RDDs) to improve performance. The comparative examination reveals the subtleties of each framework's data storage, processing models, and fault tolerance techniques. Spark can handle batch and real-time processing, which contrasts with Hadoop's conventional batch-oriented processing using MapReduce. The study investigates how these structural differences affect the system's overall efficiency, scalability, and simplicity of use. The results of this study add to our understanding of Hadoop and Spark's fundamental underpinnings. Organizations can choose a framework that best suits their unique data processing needs by providing information about their internal structures and processing techniques.


Keywords

Big Data, Hadoop, Spark, MapReduce, RDDs.