Hadoopdata~30 mins

LOAD, FILTER, and STORE operations in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

LOAD, FILTER, and STORE operations

📖 Scenario: You work with a large dataset of customer orders stored in Hadoop. You want to load this data, filter orders with amounts greater than 100, and save the filtered results for further analysis.

🎯 Goal: Build a Hadoop Pig script that loads the orders data, filters orders with amount greater than 100, and stores the filtered data into a new location.

📋 What You'll Learn

Load data from '/data/orders' with fields order_id, customer_id, and amount

Create a filter condition to keep only orders where amount > 100

Store the filtered results into '/data/filtered_orders'

💡 Why This Matters

🌍 Real World

Filtering large datasets in Hadoop is common for preparing data for analysis or reporting.

💼 Career

Data engineers and analysts use LOAD, FILTER, and STORE operations daily to manage big data pipelines.

Progress0 / 4 steps

Load the orders data

Write a Pig Latin statement to load the data from '/data/orders' into a relation called orders. The data has three fields: order_id, customer_id, and amount.

Hadoop

-- Load the orders data from '/data/orders'
-- Your code here

Need a hint?

Use LOAD with PigStorage and define the schema with AS.

Filter orders with amount greater than 100

Create a new relation called filtered_orders by filtering orders to keep only rows where amount > 100.

Hadoop

orders = LOAD '/data/orders' USING PigStorage(',') AS (order_id:int, customer_id:int, amount:float);
-- Filter orders with amount > 100
-- Your code here

Need a hint?

Use FILTER with the condition amount > 100.

Store the filtered orders

Write a statement to store the filtered_orders relation into the directory '/data/filtered_orders' using PigStorage with comma separator.

Hadoop

orders = LOAD '/data/orders' USING PigStorage(',') AS (order_id:int, customer_id:int, amount:float);
filtered_orders = FILTER orders BY amount > 100;
-- Store filtered_orders into '/data/filtered_orders'
-- Your code here

Need a hint?

Use STORE with PigStorage to save the filtered data.

Display the filtered orders

Write a statement to dump the filtered_orders relation to display the filtered data on the console.

Hadoop

orders = LOAD '/data/orders' USING PigStorage(',') AS (order_id:int, customer_id:int, amount:float);
filtered_orders = FILTER orders BY amount > 100;
STORE filtered_orders INTO '/data/filtered_orders' USING PigStorage(',');
-- Dump filtered_orders to display the data
-- Your code here

Need a hint?

Use DUMP to print the filtered data.