beginner

What is partitioning in the context of Hadoop data storage?

Partitioning means dividing a large dataset into smaller, manageable parts based on a column's values. This helps Hadoop process queries faster by scanning only relevant parts.

Click to reveal answer

beginner

How does partitioning improve query performance in Hadoop?

Partitioning reduces the amount of data scanned during a query by filtering partitions based on query conditions. This means less data to read and faster results.

Click to reveal answer

beginner

What is a common column type used for partitioning in Hadoop tables?

Columns with discrete values like dates, regions, or categories are commonly used for partitioning because they split data into meaningful groups.

Click to reveal answer

intermediate

What happens if you over-partition your data in Hadoop?

Over-partitioning creates many small files, which can slow down query performance due to overhead in managing many partitions and files.

Click to reveal answer

intermediate

Explain the difference between partitioning and bucketing in Hadoop.

Partitioning divides data into folders based on column values, while bucketing splits data into a fixed number of files within partitions for better sampling and joins.

Click to reveal answer

What is the main benefit of partitioning data in Hadoop?

AEncrypting data for security

BCompressing data to save space

CBacking up data automatically

DFaster query by scanning only relevant partitions

Which type of column is best suited for partitioning?

AContinuous numeric values like temperature

BUnique IDs like user IDs

CCategorical values like dates or regions

DText descriptions

What is a downside of having too many partitions?

AQueries become slower due to overhead

BData gets deleted automatically

CPartitions merge into one big file

DData becomes unsearchable

Partitioning in Hadoop physically stores data in:

ARandom locations

BSeparate folders based on partition column values

CEncrypted blocks

DSingle large file

Which technique complements partitioning for better query performance?

ABucketing

BIndexing

CCompression

DReplication

Describe how partitioning helps improve query performance in Hadoop.

What are the risks of over-partitioning data and how can it affect Hadoop queries?