beginner

What is a broadcast join in Apache Spark?

A broadcast join is a type of join where a small table is sent (broadcasted) to all worker nodes. This avoids shuffling large data and speeds up the join process.

Click to reveal answer

beginner

Why use broadcast joins for small tables?

Broadcast joins reduce data movement by sending the small table to all nodes. This makes joins faster and more efficient when one table is much smaller than the other.

Click to reveal answer

intermediate

How do you enable a broadcast join in Spark using DataFrame API?

You can use the function broadcast() from pyspark.sql.functions to mark a small DataFrame for broadcasting before joining it with a large DataFrame.

Click to reveal answer

intermediate

What happens if you broadcast a large table by mistake?

Broadcasting a large table can cause memory issues on worker nodes and slow down the job because the large data is sent to all nodes, defeating the purpose of optimization.

Click to reveal answer

intermediate

How does Spark decide to use broadcast join automatically?

Spark uses a configuration called spark.sql.autoBroadcastJoinThreshold to decide the max size of a table to broadcast automatically during joins.

Click to reveal answer

What is the main benefit of a broadcast join in Spark?

AReduces data shuffling by sending small table to all nodes

BIncreases data shuffling for better parallelism

CBroadcasts large tables to reduce memory usage

DAvoids joins altogether

Which Spark function is used to mark a DataFrame for broadcast join?

Apersist()

Bcache()

Cbroadcast()

Dcollect()

What configuration controls the max size of a table to broadcast automatically?

Aspark.executor.memory

Bspark.sql.shuffle.partitions

Cspark.sql.broadcastTimeout

Dspark.sql.autoBroadcastJoinThreshold

What is a risk of broadcasting a large table?

AFaster join execution

BMemory overflow on worker nodes

CReduced network traffic

DAutomatic caching

When should you prefer broadcast join?

AWhen one table is much smaller than the other

BWhen both tables are very large

CWhen tables have no common keys

DWhen you want to avoid joins

Explain how broadcast joins improve join performance in Spark.

Describe how to use broadcast join in Spark with code.