0
0
Hadoopdata~20 mins

Partitioning for query performance in Hadoop - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Partitioning Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
Why use partitioning in Hadoop queries?

Which of the following best explains why partitioning improves query performance in Hadoop?

APartitioning encrypts data to secure it during query execution.
BPartitioning compresses data to reduce storage space and speed up queries.
CPartitioning reduces the amount of data scanned by filtering data at the storage level.
DPartitioning duplicates data across nodes to increase fault tolerance.
Attempts:
2 left
💡 Hint

Think about how partitioning helps avoid reading unnecessary data.

Predict Output
intermediate
1:30remaining
Output of query with partition pruning

Given a Hive table partitioned by year, what will be the output count of this query?

SELECT COUNT(*) FROM sales WHERE year = 2023;

Assume the table has 100,000 rows total, with 20,000 rows for year 2023.

A100000
B20000
C0
D50000
Attempts:
2 left
💡 Hint

The query filters on the partition column year.

data_output
advanced
2:00remaining
Result of partitioned data filtering

Consider a Hadoop table partitioned by country with the following data counts:

  • USA: 50,000 rows
  • Canada: 30,000 rows
  • Mexico: 20,000 rows

What is the number of rows returned by this query?

SELECT * FROM table WHERE country IN ('USA', 'Mexico');
A70000
B50000
C30000
D100000
Attempts:
2 left
💡 Hint

Sum the rows from the selected partitions.

🔧 Debug
advanced
2:00remaining
Identify the cause of slow query despite partitioning

A Hive query on a partitioned table is running very slowly. The query filters on a non-partitioned column. What is the most likely reason?

AThe query optimizer is disabled.
BThe table is not actually partitioned, causing full scan.
CThe partition column has too many distinct values causing overhead.
DThe query cannot use partition pruning because the filter is on a non-partitioned column.
Attempts:
2 left
💡 Hint

Partition pruning only works when filtering on partition columns.

🚀 Application
expert
2:30remaining
Choosing partition columns for best query performance

You have a large Hadoop dataset with columns: date, region, product, and sales. Most queries filter by date and region. Which partitioning strategy will likely give the best query performance?

APartition by both <code>date</code> and <code>region</code>
BPartition by <code>region</code> only
CPartition by <code>date</code> only
DDo not partition the table
Attempts:
2 left
💡 Hint

Consider which columns are most commonly used in filters.