Recall & Review
beginner
What is a join strategy in Apache Spark?
A join strategy is the method Spark uses to combine two datasets based on a common key. Different strategies affect how data is shuffled and processed, impacting performance.
Click to reveal answer
beginner
How does a broadcast join improve Spark performance?
Broadcast join sends a small dataset to all worker nodes, avoiding large data shuffles. This reduces network overhead and speeds up the join operation.
Click to reveal answer
beginner
Why can a shuffle join be slower in Spark?
Shuffle join requires redistributing data across the cluster based on join keys. This data movement is expensive and can slow down the job, especially with large datasets.
Click to reveal answer
intermediate
What factors influence Spark's choice of join strategy?
Spark considers dataset size, available memory, and data distribution. For example, it prefers broadcast join if one dataset is small enough to fit in memory.
Click to reveal answer
intermediate
How does join strategy affect resource usage in Spark?
Different join strategies use CPU, memory, and network differently. Broadcast joins use more memory but less network, while shuffle joins use more network and CPU for data movement.
Click to reveal answer
Which join strategy avoids shuffling large datasets in Spark?
✗ Incorrect
Broadcast join sends a small dataset to all nodes, avoiding shuffling large datasets.
What is a main downside of shuffle joins in Spark?
✗ Incorrect
Shuffle joins require redistributing data, which is expensive and slows performance.
When does Spark prefer to use a broadcast join?
✗ Incorrect
Broadcast join is chosen if one dataset is small enough to be sent to all nodes.
Which resource is used more in broadcast joins compared to shuffle joins?
✗ Incorrect
Broadcast joins use more memory to hold the small dataset on each node.
What happens if Spark uses shuffle join on very large datasets?
✗ Incorrect
Shuffle join on large datasets causes heavy data movement, slowing performance.
Explain why the choice of join strategy affects Spark performance.
Think about how data moves and is stored during joins.
You got /4 concepts.
Describe when to use broadcast join versus shuffle join in Spark.
Consider dataset sizes and resource costs.
You got /4 concepts.