0
0
Hadoopdata~5 mins

Data lake design patterns in Hadoop

Choose your learning style9 modes available
Introduction

A data lake stores lots of raw data in one place. Design patterns help organize this data so it is easy to find and use.

When collecting data from many sources like apps, sensors, and databases.
When you want to keep data in its original form for future use.
When you need to support different types of data like text, images, and logs.
When you want to separate raw data from cleaned and processed data.
When you want to control who can access different parts of the data.
Syntax
Hadoop
Data Lake Design Patterns:

1. Raw Zone (Landing Zone): Store original data as-is.
2. Processed Zone (Cleansed Zone): Store cleaned and transformed data.
3. Curated Zone: Store data ready for analysis.
4. Metadata Layer: Store information about data (like source, format).
5. Access Layer: Manage how users and tools access data.

Use folders or tables to separate zones in Hadoop HDFS or Hive.

Zones help keep data organized and easy to manage.

Metadata helps users understand what data is available and how to use it.

Examples
Use separate folders in HDFS to organize data by its processing stage.
Hadoop
/data_lake/raw/  # Raw Zone folder for original files
/data_lake/processed/  # Processed Zone for cleaned data
/data_lake/curated/  # Curated Zone for analysis-ready data
Create Hive tables pointing to raw data files for easy querying.
Hadoop
CREATE EXTERNAL TABLE raw_events (
  event_id STRING,
  event_time STRING,
  event_data STRING
)
STORED AS PARQUET
LOCATION '/data_lake/raw/events/';
Curated tables hold cleaned and structured data ready for reports.
Hadoop
CREATE EXTERNAL TABLE curated_sales (
  sale_id STRING,
  sale_date DATE,
  amount DOUBLE
)
STORED AS PARQUET
LOCATION '/data_lake/curated/sales/';
Sample Program

This code reads raw JSON event data from the raw zone, cleans it by selecting columns and filtering, then saves it to the processed zone. Finally, it reads and shows the processed data.

Hadoop
from pyspark.sql import SparkSession

# Start Spark session
spark = SparkSession.builder.appName('DataLakeDesign').getOrCreate()

# Load raw data from HDFS
raw_df = spark.read.json('/data_lake/raw/events/')

# Show raw data
print('Raw Data:')
raw_df.show(3)

# Clean data: select needed columns and filter
processed_df = raw_df.select('event_id', 'event_time').filter(raw_df.event_id.isNotNull())

# Save processed data back to HDFS
processed_df.write.mode('overwrite').parquet('/data_lake/processed/events/')

# Load processed data
curated_df = spark.read.parquet('/data_lake/processed/events/')

# Show processed data
print('Processed Data:')
curated_df.show(3)

spark.stop()
OutputSuccess
Important Notes

Keep raw data unchanged to allow reprocessing if needed.

Use metadata to track data source, format, and update times.

Design zones to match your team's workflow and tools.

Summary

Data lake design patterns organize data into zones: raw, processed, and curated.

This helps keep data clean, easy to find, and ready for analysis.

Use folders and tables in Hadoop to separate and manage these zones.