beginner

What does the repartition() function do in Apache Spark?

The repartition() function reshuffles the data across the cluster to create the specified number of partitions. It can increase or decrease partitions and involves a full shuffle, which can be expensive.

Click to reveal answer

beginner

How does coalesce() differ from repartition()?

coalesce() reduces the number of partitions without a full shuffle by merging existing partitions. It is more efficient than repartition when decreasing partitions but cannot increase partitions unless shuffle is set to true.

Click to reveal answer

intermediate

When should you use repartition() instead of coalesce()?

Use repartition() when you want to increase the number of partitions or when you need a full shuffle to evenly distribute data for better parallelism.

Click to reveal answer

intermediate

What is a shuffle in Apache Spark and why is it important in repartitioning?

A shuffle is the process of redistributing data across partitions and nodes. It is important in repartitioning because it moves data to balance partitions but can be costly in time and resources.

Click to reveal answer

intermediate

Explain the performance impact of using coalesce() versus repartition().

coalesce() is faster and uses less resources because it avoids a full shuffle by merging partitions. repartition() is slower due to the shuffle but provides better data distribution when increasing partitions.

Click to reveal answer

Which function should you use to increase the number of partitions in Spark?

Acoalesce()

Bfilter()

Cmap()

Drepartition()

What is the main cost associated with using repartition()?

AFull shuffle of data

BNo cost, it is free

COnly merges partitions

DDeletes data

Which method is more efficient when reducing partitions without needing a shuffle?

AgroupBy()

Brepartition()

Ccoalesce()

Djoin()

Can coalesce() increase the number of partitions?

ANo

BYes

COnly if shuffle is true

DOnly with repartition()

Why might repartition() improve parallelism?

ABecause it merges partitions

BBecause it evenly redistributes data across partitions

CBecause it deletes partitions

DBecause it caches data

Describe the differences between repartition() and coalesce() in Apache Spark and when to use each.

Explain what a shuffle is in Spark and why it matters for partition tuning.