Challenge - 5 Problems

🎖️

Spark Column Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of adding a new column with a constant value

What is the output DataFrame after running this Spark code that adds a new column with a constant value?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice"), (2, "Bob")]
df = spark.createDataFrame(data, ["id", "name"])
df2 = df.withColumn("country", lit("USA"))
df2.show()

A[Row(id=1, name='Alice', country='USA'), Row(id=2, name='Bob', country='USA')]

B[Row(id=1, name='Alice'), Row(id=2, name='Bob')]

C[Row(id=1, name='Alice', country=None), Row(id=2, name='Bob', country=None)]

DSyntaxError

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Result of renaming a column in a Spark DataFrame

After renaming the column 'name' to 'first_name' in this DataFrame, what are the column names?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice"), (2, "Bob")]
df = spark.createDataFrame(data, ["id", "name"])
df_renamed = df.withColumnRenamed("name", "first_name")
print(df_renamed.columns)

A['id', 'name', 'first_name']

B['id', 'first_name']

C['first_name', 'id']

D['id', 'name']

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error when adding a column with an expression

What error does this code produce when trying to add a new column 'age_plus_ten' by adding 10 to the 'age' column?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 20), (2, 30)]
df = spark.createDataFrame(data, ["id", "age"])
df2 = df.withColumn("age_plus_ten", df["age"] + 10)
df2.show()

ATypeError: unsupported operand type(s) for +: 'Column' and 'int'

BSyntaxError

CNameError: name 'df' is not defined

DNo error, outputs DataFrame with 'age_plus_ten' column

Attempts:

2 left

❓ visualization

advanced

2:00remaining

Visualize the effect of renaming multiple columns

Given this DataFrame, what is the output of df_renamed.columns after renaming 'name' to 'first_name' and 'age' to 'years'?

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice", 20), (2, "Bob", 30)]
df = spark.createDataFrame(data, ["id", "name", "age"])
df_renamed = df.withColumnRenamed("name", "first_name").withColumnRenamed("age", "years")
print(df_renamed.columns)

A['id', 'first_name', 'years']

B['id', 'name', 'age']

C['first_name', 'years', 'id']

D['id', 'first_name', 'age']

Attempts:

2 left

🚀 Application

expert

3:00remaining

Add a new column based on condition and rename existing column

You have a DataFrame with columns 'id' and 'score'. You want to add a new column 'passed' that is True if score >= 50, else False. Then rename 'score' to 'exam_score'. Which code produces the correct final DataFrame?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import when

spark = SparkSession.builder.getOrCreate()
data = [(1, 45), (2, 75), (3, 50)]
df = spark.createDataFrame(data, ["id", "score"])

# Choose the correct option below

)(wohs.2fd
)'erocs_maxe' ,'erocs'(demaneRnmuloChtiw.))eslaF(esiwrehto.)eurT ,05 =&gt; ]'erocs'[fd(nehw ,'dessap'(nmuloChtiw.fd = 2fd

df2 = df.withColumnRenamed('score', 'exam_score').withColumn('passed', when(df['score'] &gt;= 50, True).otherwise(False))
df2.show()

df2 = df.withColumn('passed', when(df['score'] &gt;= 50, True).otherwise(False)).withColumnRenamed('score', 'exam_score')
df2.show()

df2 = df.withColumn('passed', when(df['score'] &gt;= 50, 'Yes').otherwise('No')).withColumnRenamed('score', 'exam_score')
df2.show()

Attempts:

2 left