Hadoopdata~10 mins

Input splits and data locality in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Input splits and data locality

Start: Large Input File

↓

Split Input into Chunks

↓

Assign Splits to Nodes

↓

Check Data Locality

↓

Process Locally

↓

Map Task

↓

Reduce Task

↓

Output Result

The large input file is split into chunks called input splits. Each split is assigned to a node, preferring nodes that already have the data (data locality). If data is local, processing is faster; otherwise, data is fetched over the network.

Execution Sample

Hadoop

input_file = 'bigdata.txt'
splits = split_input(input_file, size='64MB')
for split in splits:
    node = assign_node(split)
    if node.has_data(split):
        process_locally(split, node)

This code splits a big file into 64MB chunks, assigns each chunk to a node, and processes it locally if the node has the data.

Execution Table

Step	Input Split	Assigned Node	Data Locality?	Action	Output
1	Split 1 (0-64MB)	Node A	Yes	Process locally	Map task on Node A
2	Split 2 (64-128MB)	Node B	No	Fetch data over network	Map task on Node B
3	Split 3 (128-192MB)	Node C	Yes	Process locally	Map task on Node C
4	Split 4 (192-256MB)	Node A	No	Fetch data over network	Map task on Node A
5	All splits processed	-	-	Start reduce phase	Reduce task aggregates results
6	Job complete	-	-	-	Final output ready

💡 All input splits processed and reduce phase completed, job finished.

Variable Tracker

Variable	Start	After 1	After 2	After 3	After 4	Final
splits	[]	[Split1]	[Split1, Split2]	[Split1, Split2, Split3]	[Split1, Split2, Split3, Split4]	[Split1, Split2, Split3, Split4]
assigned_nodes	{}	{Split1: NodeA}	{Split1: NodeA, Split2: NodeB}	{Split1: NodeA, Split2: NodeB, Split3: NodeC}	{Split1: NodeA, Split2: NodeB, Split3: NodeC, Split4: NodeA}	{Split1: NodeA, Split2: NodeB, Split3: NodeC, Split4: NodeA}
data_locality	{}	{Split1: Yes}	{Split1: Yes, Split2: No}	{Split1: Yes, Split2: No, Split3: Yes}	{Split1: Yes, Split2: No, Split3: Yes, Split4: No}	{Split1: Yes, Split2: No, Split3: Yes, Split4: No}

Key Moments - 3 Insights

Why does processing a split locally matter?

What happens if the node does not have the data locally?

How are input splits assigned to nodes?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, which split is processed locally on Node C?

ASplit 1

BSplit 2

CSplit 3

DSplit 4

Concept Snapshot

Input splits break large data into chunks.
Each split is assigned to a node.
Data locality means processing where data lives.
Local processing is faster, avoids network delay.
If no local data, node fetches data over network.
Map tasks run on splits, then reduce aggregates results.

Full Transcript

In Hadoop, large input files are split into smaller chunks called input splits. Each split is assigned to a node in the cluster. The system tries to assign splits to nodes that already have the data locally, which is called data locality. When data locality is achieved, the map task processes the split directly on that node, making processing faster and reducing network traffic. If the node does not have the data locally, it fetches the data over the network before processing, which is slower. After all splits are processed by map tasks, the reduce phase starts to aggregate the results. This process ensures efficient distributed data processing by minimizing data movement across the network.