Challenge - 5 Problems

🎖️

Partitioning Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

1:30remaining

Why use partitioning in Hadoop queries?

Which of the following best explains why partitioning improves query performance in Hadoop?

APartitioning encrypts data to secure it during query execution.

BPartitioning compresses data to reduce storage space and speed up queries.

CPartitioning reduces the amount of data scanned by filtering data at the storage level.

DPartitioning duplicates data across nodes to increase fault tolerance.

Attempts:

2 left

❓ Predict Output

intermediate

1:30remaining

Output of query with partition pruning

Given a Hive table partitioned by year, what will be the output count of this query?

SELECT COUNT(*) FROM sales WHERE year = 2023;

Assume the table has 100,000 rows total, with 20,000 rows for year 2023.

A100000

B20000

D50000

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Result of partitioned data filtering

Consider a Hadoop table partitioned by country with the following data counts:

USA: 50,000 rows
Canada: 30,000 rows
Mexico: 20,000 rows

What is the number of rows returned by this query?

SELECT * FROM table WHERE country IN ('USA', 'Mexico');

A70000

B50000

C30000

D100000

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the cause of slow query despite partitioning

A Hive query on a partitioned table is running very slowly. The query filters on a non-partitioned column. What is the most likely reason?

AThe query optimizer is disabled.

BThe table is not actually partitioned, causing full scan.

CThe partition column has too many distinct values causing overhead.

DThe query cannot use partition pruning because the filter is on a non-partitioned column.

Attempts:

2 left

🚀 Application

expert

2:30remaining

Choosing partition columns for best query performance

You have a large Hadoop dataset with columns: date, region, product, and sales. Most queries filter by date and region. Which partitioning strategy will likely give the best query performance?

APartition by both <code>date</code> and <code>region</code>

BPartition by <code>region</code> only

CPartition by <code>date</code> only

DDo not partition the table

Attempts:

2 left