Hadoopdata~15 mins

Why data lake architecture centralizes data in Hadoop - See It in Action

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Why Data Lake Architecture Centralizes Data

📖 Scenario: Imagine a large company that collects data from many sources like sales, customer feedback, and website logs. They want to keep all this data in one place so everyone can use it easily.

🎯 Goal: You will create a simple example to show how data from different sources can be stored together in a data lake architecture using Python dictionaries. This will help you understand why data lakes centralize data.

📋 What You'll Learn

Create a dictionary called data_sources with three keys: 'sales', 'feedback', and 'logs'.

Each key should have a list of sample data strings as its value.

Create a variable called central_data_lake and set it to an empty list.

Use a for loop with variables source and records to iterate over data_sources.items().

Inside the loop, extend central_data_lake with the records.

Print the central_data_lake to show all data combined.

💡 Why This Matters

🌍 Real World

Companies collect data from many places like sales, customer feedback, and logs. A data lake stores all this data in one place so teams can analyze it easily.

💼 Career

Understanding data lake architecture helps you work with big data platforms like Hadoop and prepare data for analysis or machine learning.

Progress0 / 4 steps

Create the data sources dictionary

Create a dictionary called data_sources with these exact keys and values: 'sales' with ["sale1", "sale2"], 'feedback' with ["good", "bad"], and 'logs' with ["log1", "log2"].

Hadoop

# Create the data_sources dictionary with sales, feedback, and logs
# Your code here

Need a hint?

Use curly braces to create a dictionary. Each key should have a list of strings as its value.

Create the central data lake list

Create a variable called central_data_lake and set it to an empty list [].

Hadoop

data_sources = {
    'sales': ["sale1", "sale2"],
    'feedback': ["good", "bad"],
    'logs': ["log1", "log2"]
}
# Create an empty list called central_data_lake
# Your code here

Need a hint?

Use square brackets to create an empty list.

Combine all data into the central data lake

Use a for loop with variables source and records to iterate over data_sources.items(). Inside the loop, extend central_data_lake with the records.

Hadoop

data_sources = {
    'sales': ["sale1", "sale2"],
    'feedback': ["good", "bad"],
    'logs': ["log1", "log2"]
}
central_data_lake = []
# Use a for loop to add all records to central_data_lake
# Your code here

Need a hint?

Use for source, records in data_sources.items(): and inside the loop use central_data_lake.extend(records).

Print the combined data

Write print(central_data_lake) to display all data combined in the central data lake.

Hadoop

data_sources = {
    'sales': ["sale1", "sale2"],
    'feedback': ["good", "bad"],
    'logs': ["log1", "log2"]
}
central_data_lake = []
for source, records in data_sources.items():
    central_data_lake.extend(records)
# Print the central_data_lake
# Your code here

Need a hint?

Use the print function to show the combined list.