0
0
Hadoopdata~5 mins

Pig vs Hive comparison in Hadoop

Choose your learning style9 modes available
Introduction

We use Pig and Hive to work with big data easily. They help us write simple commands instead of complex code.

When you want to process large data sets with simple scripts.
When you prefer SQL-like queries to analyze data.
When you need to transform data before analysis.
When you want to run batch processing jobs on Hadoop.
When you want to choose between scripting and query languages for big data.
Syntax
Hadoop
Pig Latin example:
A = LOAD 'data' AS (name:chararray, age:int);
B = FILTER A BY age > 30;
STORE B INTO 'output';

HiveQL example:
SELECT name, age FROM data WHERE age > 30;

Pig uses a scripting language called Pig Latin.

Hive uses HiveQL, which looks like SQL.

Examples
This Pig script loads user data, filters users with salary over 50000, and shows the result.
Hadoop
Pig Latin:
A = LOAD 'users' AS (id:int, name:chararray, salary:int);
B = FILTER A BY salary > 50000;
DUMP B;
This Hive query selects users with salary over 50000 from the users table.
Hadoop
HiveQL:
SELECT id, name FROM users WHERE salary > 50000;
Sample Program

This example shows how to filter employees older than 25 and count them by department using Pig and Hive.

Hadoop
/* Pig Latin script example */
A = LOAD 'employee_data' AS (name:chararray, age:int, department:chararray);
B = FILTER A BY age > 25;
C = GROUP B BY department;
D = FOREACH C GENERATE group, COUNT(B);
DUMP D;

/* HiveQL equivalent */
-- CREATE TABLE employee_data(name STRING, age INT, department STRING);
-- SELECT department, COUNT(*) FROM employee_data WHERE age > 25 GROUP BY department;
OutputSuccess
Important Notes

Pig is better for data transformation and procedural tasks.

Hive is better for data summarization and ad-hoc queries.

Both run on Hadoop and convert scripts/queries into MapReduce jobs.

Summary

Pig uses a scripting language; Hive uses SQL-like queries.

Pig is procedural; Hive is declarative.

Choose Pig for complex data flows, Hive for easy querying.