beginner

What is a join strategy in Apache Spark?

A join strategy is the method Spark uses to combine two datasets based on a common key. Different strategies affect how data is shuffled and processed, impacting performance.

Click to reveal answer

beginner

How does a broadcast join improve Spark performance?

Broadcast join sends a small dataset to all worker nodes, avoiding large data shuffles. This reduces network overhead and speeds up the join operation.

Click to reveal answer

beginner

Why can a shuffle join be slower in Spark?

Shuffle join requires redistributing data across the cluster based on join keys. This data movement is expensive and can slow down the job, especially with large datasets.

Click to reveal answer

intermediate

What factors influence Spark's choice of join strategy?

Spark considers dataset size, available memory, and data distribution. For example, it prefers broadcast join if one dataset is small enough to fit in memory.

Click to reveal answer

intermediate

How does join strategy affect resource usage in Spark?

Different join strategies use CPU, memory, and network differently. Broadcast joins use more memory but less network, while shuffle joins use more network and CPU for data movement.

Click to reveal answer

Which join strategy avoids shuffling large datasets in Spark?

ACartesian join

BShuffle join

CBroadcast join

DSort-merge join

What is a main downside of shuffle joins in Spark?

ARequires broadcasting data

BCauses expensive data movement across nodes

COnly works with small datasets

DDoes not support join keys

When does Spark prefer to use a broadcast join?

AWhen one dataset is small enough to fit in memory

BWhen both datasets are large

CWhen datasets are sorted

DWhen join keys are missing

Which resource is used more in broadcast joins compared to shuffle joins?

ANetwork bandwidth

BCPU cores

CDisk space

DMemory

What happens if Spark uses shuffle join on very large datasets?

APerformance may degrade due to heavy data movement

BSpark crashes immediately

CPerformance improves due to parallelism

DJoin keys are ignored

Explain why the choice of join strategy affects Spark performance.

Describe when to use broadcast join versus shuffle join in Spark.