What is Apache Spark
- Apache Spark is an open-source distributed cluster-computing framework.
- Spark is a data processing engine developed to provide faster and ease-of-use analytics than Hadoop MapReduce.
- Before Apache Software Foundation took possession of Spark, it was under the control of University of California, Berkeley’s AMP Lab.
What is Apache Hadoop
- Apache Hadoop is an open-source framework written in Java that allows us to store and process Big Data in a distributed environment, across various clusters of computers using simple programming constructs.
- To do this, Hadoop uses an algorithm called MapReduce, which divides the task into small parts and assigns them to a set of computers.
- Hadoop also has its own file system, Hadoop Distributed File System (HDFS), which is based on the Google File System (GFS).
- HDFS is designed to run on low-cost hardware.
CRITERIA | SPARK | HADOOP MAPREDUCE |
---|---|---|
Memory | Let’s save data on memory with the use of RDD’s. |
Does not leverage the memory of the hadoop cluster to maximum. |
Disk usage | Spark caches data in-memory and ensures low latency. |
MapReduce is disk oriented. |
Processing | Supports real-time processing through spark streaming. |
Only batch processing is supported |
Installation | Is not bound to Hadoop. | Is bound to hadoop. |
Storage | Leverage exciting | HDFS |
Speed | 10 – 100X faster. | Fast. |
Rsource management | standalone | YARN |