Challenge - 5 Problems

🎖️

Spark Join Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of Inner Join in Spark

What is the output of the following Spark code performing an inner join?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
data2 = [(2, "Sales"), (3, "HR"), (4, "IT")]
df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "dept"])
joined_df = df1.join(df2, on="id", how="inner")
result = joined_df.collect()

A[Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR')]

B[Row(id=1, name='Alice', dept=None), Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR')]

C[Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR'), Row(id=4, name=None, dept='IT')]

D[Row(id=1, name='Alice', dept=None), Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR'), Row(id=4, name=None, dept='IT')]

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of Left Outer Join in Spark

What is the output of this Spark code performing a left outer join?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
data2 = [(2, "Sales"), (3, "HR"), (4, "IT")]
df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "dept"])
joined_df = df1.join(df2, on="id", how="left")
result = joined_df.collect()

A[Row(id=4, name=None, dept='IT')]

B[Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR')]

C[Row(id=1, name='Alice', dept=None), Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR')]

D[Row(id=1, name='Alice', dept=None), Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR'), Row(id=4, name=None, dept='IT')]

Attempts:

2 left

❓ Predict Output

advanced

2:00remaining

Output of Right Outer Join in Spark

What is the output of this Spark code performing a right outer join?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
data2 = [(2, "Sales"), (3, "HR"), (4, "IT")]
df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "dept"])
joined_df = df1.join(df2, on="id", how="right")
result = joined_df.collect()

A[Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR')]

B[Row(id=1, name='Alice', dept=None), Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR')]

C[Row(id=1, name='Alice', dept=None), Row(id=4, name=None, dept='IT')]

D[Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR'), Row(id=4, name=None, dept='IT')]

Attempts:

2 left

❓ Predict Output

advanced

2:00remaining

Output of Full Outer Join in Spark

What is the output of this Spark code performing a full outer join?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data1 = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
data2 = [(2, "Sales"), (3, "HR"), (4, "IT")]
df1 = spark.createDataFrame(data1, ["id", "name"])
df2 = spark.createDataFrame(data2, ["id", "dept"])
joined_df = df1.join(df2, on="id", how="outer")
result = joined_df.collect()

A[Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR')]

B[Row(id=1, name='Alice', dept=None), Row(id=2, name='Bob', dept='Sales'), Row(id=3, name='Charlie', dept='HR'), Row(id=4, name=None, dept='IT')]

C[Row(id=1, name='Alice', dept=None), Row(id=4, name=None, dept='IT')]

D[Row(id=1, name='Alice', dept='Sales'), Row(id=2, name='Bob', dept='HR'), Row(id=3, name='Charlie', dept='IT')]

Attempts:

2 left

🧠 Conceptual

expert

1:30remaining

Understanding Join Types in Spark

Which join type would you use in Spark to keep all rows from the left dataframe and only matching rows from the right dataframe, filling unmatched right columns with nulls?

ALeft outer join

BInner join

CRight outer join

DFull outer join

Attempts:

2 left