Pig Latin helps you write simple steps to process big data easily. It makes working with large data sets faster and clearer.
0
0
Pig Latin basics in Hadoop
Introduction
You want to clean or filter large data files quickly.
You need to join two big data tables to find connections.
You want to group data to find totals or averages.
You want to sort data to see top or bottom results.
You want to load data from files and save results back.
Syntax
Hadoop
alias = LOAD 'datafile' USING loader AS (field1:type, field2:type, ...); filtered = FILTER alias BY condition; grouped = GROUP alias BY field; result = FOREACH grouped GENERATE group, COUNT(alias); STORE result INTO 'output';
Each step creates a new alias (name) for the data.
Use LOAD to read data, FILTER to select rows, GROUP to collect by key, FOREACH to process groups, and STORE to save results.
Examples
Load a CSV file with user info, defining each column's name and type.
Hadoop
data = LOAD 'users.csv' USING PigStorage(',') AS (name:chararray, age:int, city:chararray);
Select only users who are 18 or older.
Hadoop
adults = FILTER data BY age >= 18;Group users by their city.
Hadoop
grouped_by_city = GROUP adults BY city;
Count how many adults are in each city.
Hadoop
count_by_city = FOREACH grouped_by_city GENERATE group, COUNT(adults);
Sample Program
This program loads user data, filters adults, groups them by city, counts adults per city, and shows the result.
Hadoop
users = LOAD 'users.csv' USING PigStorage(',') AS (name:chararray, age:int, city:chararray); adults = FILTER users BY age >= 18; grouped = GROUP adults BY city; count_by_city = FOREACH grouped GENERATE group AS city, COUNT(adults) AS adult_count; DUMP count_by_city;
OutputSuccess
Important Notes
Always define the schema (field names and types) when loading data for clarity.
Use DUMP to see results immediately during testing, STORE to save results permanently.
Pig Latin commands run in order, each step uses the previous step's output.
Summary
Pig Latin is a simple language to process big data step-by-step.
Use LOAD, FILTER, GROUP, FOREACH, and STORE to handle data easily.
It helps you clean, group, count, and save big data results quickly.