Which of the following best explains why partitioning improves query performance in Hadoop?
Think about how partitioning helps avoid reading unnecessary data.
Partitioning organizes data into parts based on column values, so queries can skip irrelevant partitions and scan less data, improving speed.
Given a Hive table partitioned by year, what will be the output count of this query?
SELECT COUNT(*) FROM sales WHERE year = 2023;
Assume the table has 100,000 rows total, with 20,000 rows for year 2023.
The query filters on the partition column year.
The query only scans the partition for year 2023, which contains 20,000 rows, so the count is 20,000.
Consider a Hadoop table partitioned by country with the following data counts:
- USA: 50,000 rows
- Canada: 30,000 rows
- Mexico: 20,000 rows
What is the number of rows returned by this query?
SELECT * FROM table WHERE country IN ('USA', 'Mexico');Sum the rows from the selected partitions.
The query scans partitions for USA and Mexico only, totaling 50,000 + 20,000 = 70,000 rows.
A Hive query on a partitioned table is running very slowly. The query filters on a non-partitioned column. What is the most likely reason?
Partition pruning only works when filtering on partition columns.
Partition pruning skips irrelevant partitions only if the filter is on the partition column. Filtering on non-partitioned columns causes full table scan.
You have a large Hadoop dataset with columns: date, region, product, and sales. Most queries filter by date and region. Which partitioning strategy will likely give the best query performance?
Consider which columns are most commonly used in filters.
Partitioning by both date and region allows queries filtering on either or both columns to prune partitions effectively, improving performance.