In a typical data lake design, data is organized into different zones such as raw, cleansed, and curated. What is the primary purpose of the raw zone?
Think about where data first lands in the data lake before any processing.
The raw zone holds data exactly as it arrives from source systems. It is unprocessed and serves as the foundation for further cleansing and transformation.
Given a Hadoop data lake storing logs partitioned by date, what is the expected effect on query performance when filtering by a specific date?
Consider how partition pruning works in Hadoop query engines.
Partitioning data by date allows query engines to scan only the relevant partition, reducing data scanned and improving performance.
A data lake ingestion pipeline writes data daily to the curated zone. After some time, duplicate records appear in the curated data. Which of the following is the most likely cause?
Think about how data is merged or appended during ingestion.
If the ingestion job appends data daily without deduplication or overwrite logic, duplicates accumulate in the curated zone.
You need to design a data lake solution for real-time streaming data ingestion and analytics. Which design pattern is most suitable?
Consider architectures that handle both real-time and batch data.
Lambda architecture supports both real-time streaming (speed layer) and batch processing (batch layer), making it ideal for streaming data analytics.
Consider a Hive table partitioned by year and month. What is the output of the following query?
SELECT year, month, COUNT(*) AS cnt FROM sales WHERE year = 2023 AND month = 5 GROUP BY year, month;
Assume the table has 1000 records for May 2023 and 5000 records for other months.
Focus on the filter conditions and grouping in the query.
The query filters data for year 2023 and month 5, then counts records. Since there are 1000 records for May 2023, the count is 1000.