0
0
Hadoopdata~10 mins

Why data lake architecture centralizes data in Hadoop - Visual Breakdown

Choose your learning style9 modes available
Concept Flow - Why data lake architecture centralizes data
Collect data from many sources
Store all data in one place: Data Lake
Data is raw, unstructured or structured
Users access centralized data for analysis
Data governance and security applied centrally
Data lake supports many use cases and teams
Data lake architecture collects all data from different sources and stores it centrally in raw form, allowing many users and teams to access and analyze the same data securely.
Execution Sample
Hadoop
sources = ['app', 'web', 'sensor']
data_lake = []
for source in sources:
    data = collect_data(source)
    data_lake.append(data)
print(len(data_lake))
This code collects data from multiple sources and stores it all in one central list representing a data lake.
Execution Table
StepSourceData CollectedData Lake SizeAction
1appapp data chunk1Collected app data and added to data lake
2webweb data chunk2Collected web data and added to data lake
3sensorsensor data chunk3Collected sensor data and added to data lake
4--3All sources collected, data lake centralized
💡 All data sources processed and stored centrally in the data lake
Variable Tracker
VariableStartAfter 1After 2After 3Final
data_lake[]['app data chunk']['app data chunk', 'web data chunk']['app data chunk', 'web data chunk', 'sensor data chunk']['app data chunk', 'web data chunk', 'sensor data chunk']
Key Moments - 2 Insights
Why do we add data from all sources into one data lake instead of separate places?
Because centralizing data in one place makes it easier for different teams to access and analyze all data together, as shown in execution_table steps 1 to 4.
Is the data in the data lake processed or raw when stored?
The data is stored in raw form, meaning it is not processed yet, allowing flexibility for different analysis needs, as implied by the 'Data Collected' column in execution_table.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, how many data chunks are in the data lake after step 2?
A2
B1
C3
D0
💡 Hint
Check the 'Data Lake Size' column at step 2 in the execution_table.
At which step does the data lake first contain data from the 'sensor' source?
AStep 1
BStep 2
CStep 3
DStep 4
💡 Hint
Look at the 'Source' column in execution_table to find when 'sensor' data is added.
If we add a new source 'mobile', how would the data lake size change after adding it?
AIt stays the same
BIt increases by 1
CIt doubles
DIt decreases
💡 Hint
Refer to how data_lake size increases by 1 for each new source in variable_tracker.
Concept Snapshot
Data lake architecture collects all data from multiple sources
and stores it centrally in raw form.
This centralization allows easy access for many users and teams.
Data governance and security are managed in one place.
It supports diverse analysis and use cases efficiently.
Full Transcript
Data lake architecture centralizes data by collecting it from many sources like apps, web, and sensors. All this data is stored in one place called a data lake. The data is kept raw, meaning it is not processed yet, so different teams can use it for their own analysis. Centralizing data makes it easier to manage security and governance. The example code shows collecting data from three sources and adding each to a list representing the data lake. The execution table tracks each step, showing how the data lake grows as data is added. Key moments clarify why centralization helps and that data is stored raw. The quiz tests understanding of data lake size changes and source additions. This approach helps organizations use their data efficiently and securely.