0
0
Hadoopdata~10 mins

Pig Latin basics in Hadoop - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Pig Latin basics
Load Data
Apply Transformations
Filter or Group Data
Generate Output
Store or Dump Results
Pig Latin scripts start by loading data, then apply transformations like filtering or grouping, and finally output the results.
Execution Sample
Hadoop
A = LOAD 'data.txt' USING PigStorage(',');
B = FILTER A BY $0 > 10;
C = GROUP B BY $1;
D = FOREACH C GENERATE group, COUNT(B);
DUMP D;
This script loads data, filters rows where first column > 10, groups by second column, counts rows per group, and shows results.
Execution Table
StepActionInput Data SnapshotResultNotes
1LOAD data.txtRaw file with rows like (12,apple), (6,banana)Relation A with all rowsData loaded into relation A
2FILTER A BY $0 > 10Relation A with rows (12,apple), (6,banana)Relation B with rows (12,apple)Only rows with first column > 10 kept
3GROUP B BY $1Relation B with (12,apple)Relation C grouped by second column 'apple'Rows grouped by second column
4FOREACH C GENERATE group, COUNT(B)Grouped relation CRelation D with ('apple', 1)Count of rows per group calculated
5DUMP DRelation DOutput: ('apple', 1)Results displayed on screen
💡 All steps completed, final output dumped
Variable Tracker
RelationStartAfter Step 1After Step 2After Step 3After Step 4Final
AemptyAll rows from data.txtAll rows from data.txtAll rows from data.txtAll rows from data.txtAll rows from data.txt
BemptyemptyRows with $0 > 10Rows with $0 > 10Rows with $0 > 10Rows with $0 > 10
CemptyemptyemptyGrouped by $1Grouped by $1Grouped by $1
DemptyemptyemptyemptyGroup countsGroup counts
Key Moments - 3 Insights
Why does FILTER keep only some rows and not all?
FILTER keeps rows where the condition is true. In the execution_table step 2, only rows with first column > 10 remain.
What does GROUP BY do to the data?
GROUP BY collects rows sharing the same key into one group. See execution_table step 3 where rows are grouped by the second column.
Why do we use FOREACH after GROUP BY?
FOREACH lets us process each group separately, like counting rows per group as shown in step 4 of the execution_table.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 2. What rows does relation B contain?
ARows where second column is 'apple'
BRows where first column is greater than 10
CAll rows from the file
DEmpty relation
💡 Hint
Check the 'Action' and 'Result' columns at step 2 in execution_table
At which step does the data get grouped by the second column?
AStep 1
BStep 2
CStep 3
DStep 4
💡 Hint
Look for 'GROUP BY' action in execution_table
If the FILTER condition changed to $0 > 5, how would relation B change after step 2?
AIt would contain more rows
BIt would contain fewer rows
CIt would be empty
DIt would be unchanged
💡 Hint
Compare the FILTER condition and its effect on rows in variable_tracker
Concept Snapshot
Pig Latin basics:
- LOAD reads data into a relation
- FILTER keeps rows matching a condition
- GROUP BY collects rows by key
- FOREACH processes each group
- DUMP shows results
Simple steps to transform big data
Full Transcript
This visual trace shows how a Pig Latin script runs step-by-step. First, data is loaded from a file into a relation named A. Then, a FILTER keeps only rows where the first column is greater than 10, creating relation B. Next, the data in B is grouped by the second column, forming relation C. After grouping, FOREACH generates a new relation D with the group key and count of rows in each group. Finally, DUMP outputs the results. Variables change as data moves through these steps. Beginners often wonder why FILTER removes rows or how GROUP BY works; this trace clarifies those by showing exact data changes. The quiz tests understanding of these steps and their effects on data.