What is Data lake design patterns in Hadoop?

Hadoopdata~5 mins

Data lake design patterns in Hadoop

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

A data lake stores lots of raw data in one place. Design patterns help organize this data so it is easy to find and use.

When collecting data from many sources like apps, sensors, and databases.

When you want to keep data in its original form for future use.

When you need to support different types of data like text, images, and logs.

When you want to separate raw data from cleaned and processed data.

When you want to control who can access different parts of the data.

Syntax

Hadoop

Data Lake Design Patterns:

1. Raw Zone (Landing Zone): Store original data as-is.
2. Processed Zone (Cleansed Zone): Store cleaned and transformed data.
3. Curated Zone: Store data ready for analysis.
4. Metadata Layer: Store information about data (like source, format).
5. Access Layer: Manage how users and tools access data.

Use folders or tables to separate zones in Hadoop HDFS or Hive.

Zones help keep data organized and easy to manage.

Metadata helps users understand what data is available and how to use it.

Examples

Use separate folders in HDFS to organize data by its processing stage.

Hadoop

/data_lake/raw/  # Raw Zone folder for original files
/data_lake/processed/  # Processed Zone for cleaned data
/data_lake/curated/  # Curated Zone for analysis-ready data

Create Hive tables pointing to raw data files for easy querying.

Hadoop

CREATE EXTERNAL TABLE raw_events (
  event_id STRING,
  event_time STRING,
  event_data STRING
)
STORED AS PARQUET
LOCATION '/data_lake/raw/events/';

Curated tables hold cleaned and structured data ready for reports.

Hadoop

CREATE EXTERNAL TABLE curated_sales (
  sale_id STRING,
  sale_date DATE,
  amount DOUBLE
)
STORED AS PARQUET
LOCATION '/data_lake/curated/sales/';

Sample Program

This code reads raw JSON event data from the raw zone, cleans it by selecting columns and filtering, then saves it to the processed zone. Finally, it reads and shows the processed data.

Hadoop

from pyspark.sql import SparkSession

# Start Spark session
spark = SparkSession.builder.appName('DataLakeDesign').getOrCreate()

# Load raw data from HDFS
raw_df = spark.read.json('/data_lake/raw/events/')

# Show raw data
print('Raw Data:')
raw_df.show(3)

# Clean data: select needed columns and filter
processed_df = raw_df.select('event_id', 'event_time').filter(raw_df.event_id.isNotNull())

# Save processed data back to HDFS
processed_df.write.mode('overwrite').parquet('/data_lake/processed/events/')

# Load processed data
curated_df = spark.read.parquet('/data_lake/processed/events/')

# Show processed data
print('Processed Data:')
curated_df.show(3)

spark.stop()

OutputSuccess

Important Notes

Keep raw data unchanged to allow reprocessing if needed.

Use metadata to track data source, format, and update times.

Design zones to match your team's workflow and tools.

Summary

Data lake design patterns organize data into zones: raw, processed, and curated.

This helps keep data clean, easy to find, and ready for analysis.

Use folders and tables in Hadoop to separate and manage these zones.