Hadoopdata~30 mins

Data lake design patterns in Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Data Lake Design Patterns with Hadoop

📖 Scenario: You work at a company that collects lots of data from different sources like sales, customer feedback, and website logs. You want to organize this data in a Hadoop data lake so it is easy to find and use later.

🎯 Goal: Build a simple data lake structure using Hadoop folders and files that follow common design patterns: raw data, cleaned data, and aggregated data.

📋 What You'll Learn

Create a dictionary called data_lake with keys for raw, cleaned, and aggregated data folders

Add a configuration variable called file_format set to parquet

Use a dictionary comprehension to create file paths for each data type using the file_format

Print the final data_lake_paths dictionary showing the full paths

💡 Why This Matters

🌍 Real World

Data lakes store large amounts of raw and processed data in Hadoop systems. Organizing data with clear folder and file naming helps teams find and use data efficiently.

💼 Career

Understanding data lake design patterns is important for data engineers and analysts working with big data platforms like Hadoop.

Progress0 / 4 steps

Create the initial data lake structure

Create a dictionary called data_lake with these exact keys and values: 'raw': '/data/raw', 'cleaned': '/data/cleaned', and 'aggregated': '/data/aggregated'.

Hadoop

# Create the data_lake dictionary with raw, cleaned, and aggregated paths
# Your code here

Need a hint?

Use curly braces {} to create a dictionary with the specified keys and values.

Add a file format configuration

Create a variable called file_format and set it to the string 'parquet'.

Hadoop

data_lake = {'raw': '/data/raw', 'cleaned': '/data/cleaned', 'aggregated': '/data/aggregated'}
# Create the file_format variable and set it to 'parquet'
# Your code here

Need a hint?

Use a simple assignment statement to create the variable file_format.

Create file paths using dictionary comprehension

Use a dictionary comprehension to create a new dictionary called data_lake_paths. For each key and path in data_lake.items(), create a new path by adding a slash and the file name data. plus the file_format extension. For example, the raw path should be /data/raw/data.parquet.

Hadoop

data_lake = {'raw': '/data/raw', 'cleaned': '/data/cleaned', 'aggregated': '/data/aggregated'}
file_format = 'parquet'
# Create data_lake_paths dictionary using dictionary comprehension
# Your code here

Need a hint?

Use {key: value for key, value in dictionary.items()} and f-strings to build the new paths.

Print the final data lake paths

Write a print statement to display the data_lake_paths dictionary.

Hadoop

data_lake = {'raw': '/data/raw', 'cleaned': '/data/cleaned', 'aggregated': '/data/aggregated'}
file_format = 'parquet'
data_lake_paths = {key: f"{path}/data.{file_format}" for key, path in data_lake.items()}
# Print the data_lake_paths dictionary
# Your code here

Need a hint?

Use print(data_lake_paths) to show the dictionary.