Why do we need Data Locality in Hadoop ?
- Datasets in HDFS store as blocks in DataNodes the Hadoop cluster.
- During the execution of a MapReduce job the individual Mapper processes the blocks (Input Splits).
- If the data does not reside in the same node where the Mapper is executing the job, the data needs to be copied from the DataNode over the network to the mapper DataNode.
- Now if a MapReduce job has more than 100 Mapper and each Mapper tries to copy the data from other DataNode in the cluster simultaneously, it would cause serious network congestion which is a big performance issue of the overall system.
- Hence, data proximity to the computation is an effective and cost-effective solution which is technically termed as Data locality in Hadoop. It helps to increase the overall throughput of the system.
Types of data locality
- Data local
- In this type data and the mapper resides on the same node. This is the closest proximity of data and the most preferred scenario.
- Rack Local
- In this type data and the mapper resides on the same node. This is the closest proximity of data and the most preferred scenario.
- In this scenarios mapper and data reside on the same rack but on the different data nodes.
- Different Rack
- In this scenario mapper and data reside on the different racks.