What is LOAD, FILTER, and STORE operations in Hadoop?

Hadoopdata~5 mins

LOAD, FILTER, and STORE operations in Hadoop

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

We use LOAD to bring data into Hadoop, FILTER to pick only the data we want, and STORE to save the results. This helps us work with big data step-by-step.

When you want to read data from a file in Hadoop to analyze it.

When you need to keep only certain rows from a large dataset based on a condition.

When you want to save the filtered or processed data back to Hadoop storage.

When cleaning data by removing unwanted records before further analysis.

When preparing data for another program or step by saving the results.

Syntax

Hadoop

data = LOAD 'input_path' USING PigStorage(',');
filtered_data = FILTER data BY condition;
STORE filtered_data INTO 'output_path' USING PigStorage(',');

LOAD reads data from a file or folder.

FILTER keeps rows where the condition is true.

Examples

Load a CSV file named users.csv into a variable called data.

Hadoop

data = LOAD 'users.csv' USING PigStorage(',');

Keep only rows where the age column is greater than 30.

Hadoop

filtered_data = FILTER data BY age > 30;

Save the filtered data into a folder named users_over_30.

Hadoop

STORE filtered_data INTO 'output/users_over_30' USING PigStorage(',');

Sample Program

This program loads a CSV file with user data, filters users older than 30, and stores the result.

Hadoop

data = LOAD 'input/users.csv' USING PigStorage(',');
filtered_data = FILTER data BY (int)$1 > 30;
STORE filtered_data INTO 'output/users_over_30' USING PigStorage(',');

OutputSuccess

Important Notes

Make sure the input path exists and is accessible in Hadoop.

FILTER conditions must match the data types; cast if needed.

STORE will create a folder; if it exists, Hadoop may throw an error.

Summary

LOAD brings data into Hadoop for processing.

FILTER selects only the rows you want based on a condition.

STORE saves your processed data back to Hadoop storage.