Challenge - 5 Problems

🎖️

Data Lake Design Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Understanding the Zone Architecture in Data Lakes

In a typical data lake design, data is organized into different zones such as raw, cleansed, and curated. What is the primary purpose of the raw zone?

AStore metadata and data catalog information.

BStore only cleaned and validated data ready for business use.

CStore aggregated and summarized data for reporting.

DStore data exactly as ingested without any transformation or filtering.

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Data Partitioning Effect on Query Performance

Given a Hadoop data lake storing logs partitioned by date, what is the expected effect on query performance when filtering by a specific date?

AQuery performance is unaffected by partitioning.

BQuery performance worsens because all partitions must be scanned.

CQuery performance improves because only the relevant partition is scanned.

DQuery performance depends only on file size, not partitioning.

Attempts:

2 left

🔧 Debug

advanced

3:00remaining

Identifying the Cause of Data Duplication in a Data Lake

A data lake ingestion pipeline writes data daily to the curated zone. After some time, duplicate records appear in the curated data. Which of the following is the most likely cause?

AThe ingestion job appends data without checking for existing records.

BThe raw zone data is corrupted causing duplicates.

CPartitioning by date was not applied in the raw zone.

DThe data lake storage is running out of space.

Attempts:

2 left

🚀 Application

advanced

3:00remaining

Choosing the Best Data Lake Design Pattern for Streaming Data

You need to design a data lake solution for real-time streaming data ingestion and analytics. Which design pattern is most suitable?

AUsing a traditional data warehouse instead of a data lake.

BLambda architecture combining batch and speed layers.

CStoring all data in the raw zone without transformation.

DOnly batch processing with nightly ingestion jobs.

Attempts:

2 left

❓ Predict Output

expert

3:00remaining

Output of Hive Query on Partitioned Data

Consider a Hive table partitioned by year and month. What is the output of the following query?

SELECT year, month, COUNT(*) AS cnt FROM sales WHERE year = 2023 AND month = 5 GROUP BY year, month;

Assume the table has 1000 records for May 2023 and 5000 records for other months.

A[{'year': 2023, 'month': 5, 'cnt': 1000}]

B[{'year': 2023, 'month': 5, 'cnt': 6000}]

C[{'year': 2023, 'month': 5, 'cnt': 5000}]

DEmpty result set

Attempts:

2 left