0
0
Hadoopdata~20 mins

Data lake design patterns in Hadoop - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Data Lake Design Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding the Zone Architecture in Data Lakes

In a typical data lake design, data is organized into different zones such as raw, cleansed, and curated. What is the primary purpose of the raw zone?

AStore metadata and data catalog information.
BStore only cleaned and validated data ready for business use.
CStore aggregated and summarized data for reporting.
DStore data exactly as ingested without any transformation or filtering.
Attempts:
2 left
💡 Hint

Think about where data first lands in the data lake before any processing.

data_output
intermediate
2:00remaining
Data Partitioning Effect on Query Performance

Given a Hadoop data lake storing logs partitioned by date, what is the expected effect on query performance when filtering by a specific date?

AQuery performance is unaffected by partitioning.
BQuery performance worsens because all partitions must be scanned.
CQuery performance improves because only the relevant partition is scanned.
DQuery performance depends only on file size, not partitioning.
Attempts:
2 left
💡 Hint

Consider how partition pruning works in Hadoop query engines.

🔧 Debug
advanced
3:00remaining
Identifying the Cause of Data Duplication in a Data Lake

A data lake ingestion pipeline writes data daily to the curated zone. After some time, duplicate records appear in the curated data. Which of the following is the most likely cause?

AThe ingestion job appends data without checking for existing records.
BThe raw zone data is corrupted causing duplicates.
CPartitioning by date was not applied in the raw zone.
DThe data lake storage is running out of space.
Attempts:
2 left
💡 Hint

Think about how data is merged or appended during ingestion.

🚀 Application
advanced
3:00remaining
Choosing the Best Data Lake Design Pattern for Streaming Data

You need to design a data lake solution for real-time streaming data ingestion and analytics. Which design pattern is most suitable?

AUsing a traditional data warehouse instead of a data lake.
BLambda architecture combining batch and speed layers.
CStoring all data in the raw zone without transformation.
DOnly batch processing with nightly ingestion jobs.
Attempts:
2 left
💡 Hint

Consider architectures that handle both real-time and batch data.

Predict Output
expert
3:00remaining
Output of Hive Query on Partitioned Data

Consider a Hive table partitioned by year and month. What is the output of the following query?

SELECT year, month, COUNT(*) AS cnt FROM sales WHERE year = 2023 AND month = 5 GROUP BY year, month;

Assume the table has 1000 records for May 2023 and 5000 records for other months.

A[{'year': 2023, 'month': 5, 'cnt': 1000}]
B[{'year': 2023, 'month': 5, 'cnt': 6000}]
C[{'year': 2023, 'month': 5, 'cnt': 5000}]
DEmpty result set
Attempts:
2 left
💡 Hint

Focus on the filter conditions and grouping in the query.