0
0
Hadoopdata~10 mins

Input splits and data locality in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Input splits and data locality
Start: Large Input File
Split Input into Chunks
Assign Splits to Nodes
Check Data Locality
Process Locally
Map Task
Reduce Task
Output Result
The large input file is split into chunks called input splits. Each split is assigned to a node, preferring nodes that already have the data (data locality). If data is local, processing is faster; otherwise, data is fetched over the network.
Execution Sample
Hadoop
input_file = 'bigdata.txt'
splits = split_input(input_file, size='64MB')
for split in splits:
    node = assign_node(split)
    if node.has_data(split):
        process_locally(split, node)
This code splits a big file into 64MB chunks, assigns each chunk to a node, and processes it locally if the node has the data.
Execution Table
StepInput SplitAssigned NodeData Locality?ActionOutput
1Split 1 (0-64MB)Node AYesProcess locallyMap task on Node A
2Split 2 (64-128MB)Node BNoFetch data over networkMap task on Node B
3Split 3 (128-192MB)Node CYesProcess locallyMap task on Node C
4Split 4 (192-256MB)Node ANoFetch data over networkMap task on Node A
5All splits processed--Start reduce phaseReduce task aggregates results
6Job complete---Final output ready
💡 All input splits processed and reduce phase completed, job finished.
Variable Tracker
VariableStartAfter 1After 2After 3After 4Final
splits[][Split1][Split1, Split2][Split1, Split2, Split3][Split1, Split2, Split3, Split4][Split1, Split2, Split3, Split4]
assigned_nodes{}{Split1: NodeA}{Split1: NodeA, Split2: NodeB}{Split1: NodeA, Split2: NodeB, Split3: NodeC}{Split1: NodeA, Split2: NodeB, Split3: NodeC, Split4: NodeA}{Split1: NodeA, Split2: NodeB, Split3: NodeC, Split4: NodeA}
data_locality{}{Split1: Yes}{Split1: Yes, Split2: No}{Split1: Yes, Split2: No, Split3: Yes}{Split1: Yes, Split2: No, Split3: Yes, Split4: No}{Split1: Yes, Split2: No, Split3: Yes, Split4: No}
Key Moments - 3 Insights
Why does processing a split locally matter?
Processing locally avoids network data transfer, making the map task faster and reducing network load, as shown in rows 1 and 3 of the execution table where data locality is 'Yes'.
What happens if the node does not have the data locally?
The node fetches the data over the network before processing, which takes more time, as seen in rows 2 and 4 where data locality is 'No' and action is 'Fetch data over network'.
How are input splits assigned to nodes?
Splits are assigned to nodes that ideally have the data locally to maximize data locality, but if not possible, any node can be assigned and fetch data remotely, as shown in the 'Assigned Node' and 'Data Locality?' columns.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, which split is processed locally on Node C?
ASplit 1
BSplit 2
CSplit 3
DSplit 4
💡 Hint
Check the 'Assigned Node' and 'Data Locality?' columns in rows 1-4.
At which step does the job start the reduce phase?
AStep 5
BStep 6
CStep 4
DStep 3
💡 Hint
Look at the 'Action' column for the step mentioning 'Start reduce phase'.
If all splits had data locality 'Yes', what would change in the execution table?
AMore splits would be assigned to Node B
BAll actions would be 'Process locally' with no network fetch
CReduce phase would start earlier
DJob would not complete
💡 Hint
Compare rows where 'Data Locality?' is 'No' and see their actions.
Concept Snapshot
Input splits break large data into chunks.
Each split is assigned to a node.
Data locality means processing where data lives.
Local processing is faster, avoids network delay.
If no local data, node fetches data over network.
Map tasks run on splits, then reduce aggregates results.
Full Transcript
In Hadoop, large input files are split into smaller chunks called input splits. Each split is assigned to a node in the cluster. The system tries to assign splits to nodes that already have the data locally, which is called data locality. When data locality is achieved, the map task processes the split directly on that node, making processing faster and reducing network traffic. If the node does not have the data locally, it fetches the data over the network before processing, which is slower. After all splits are processed by map tasks, the reduce phase starts to aggregate the results. This process ensures efficient distributed data processing by minimizing data movement across the network.