Challenge - 5 Problems
Spark Column Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of adding a new column with a constant value
What is the output DataFrame after running this Spark code that adds a new column with a constant value?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import lit spark = SparkSession.builder.getOrCreate() data = [(1, "Alice"), (2, "Bob")] df = spark.createDataFrame(data, ["id", "name"]) df2 = df.withColumn("country", lit("USA")) df2.show()
Attempts:
2 left
💡 Hint
Adding a column with lit() sets the same value for all rows.
✗ Incorrect
The withColumn method adds a new column 'country' with the constant value 'USA' for every row.
❓ data_output
intermediate2:00remaining
Result of renaming a column in a Spark DataFrame
After renaming the column 'name' to 'first_name' in this DataFrame, what are the column names?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, "Alice"), (2, "Bob")] df = spark.createDataFrame(data, ["id", "name"]) df_renamed = df.withColumnRenamed("name", "first_name") print(df_renamed.columns)
Attempts:
2 left
💡 Hint
withColumnRenamed changes the name of one column only.
✗ Incorrect
The column 'name' is renamed to 'first_name', so the DataFrame columns are ['id', 'first_name'].
🔧 Debug
advanced2:00remaining
Identify the error when adding a column with an expression
What error does this code produce when trying to add a new column 'age_plus_ten' by adding 10 to the 'age' column?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, 20), (2, 30)] df = spark.createDataFrame(data, ["id", "age"]) df2 = df.withColumn("age_plus_ten", df["age"] + 10) df2.show()
Attempts:
2 left
💡 Hint
Spark supports arithmetic operations on Column objects.
✗ Incorrect
Adding an integer to a Spark Column object works correctly and creates a new column with the sum.
❓ visualization
advanced2:00remaining
Visualize the effect of renaming multiple columns
Given this DataFrame, what is the output of df_renamed.columns after renaming 'name' to 'first_name' and 'age' to 'years'?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, "Alice", 20), (2, "Bob", 30)] df = spark.createDataFrame(data, ["id", "name", "age"]) df_renamed = df.withColumnRenamed("name", "first_name").withColumnRenamed("age", "years") print(df_renamed.columns)
Attempts:
2 left
💡 Hint
Each withColumnRenamed changes one column name.
✗ Incorrect
Both 'name' and 'age' columns are renamed, so the columns are ['id', 'first_name', 'years'].
🚀 Application
expert3:00remaining
Add a new column based on condition and rename existing column
You have a DataFrame with columns 'id' and 'score'. You want to add a new column 'passed' that is True if score >= 50, else False. Then rename 'score' to 'exam_score'. Which code produces the correct final DataFrame?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import when spark = SparkSession.builder.getOrCreate() data = [(1, 45), (2, 75), (3, 50)] df = spark.createDataFrame(data, ["id", "score"]) # Choose the correct option below
Attempts:
2 left
💡 Hint
Order matters: use original column name in condition before renaming.
✗ Incorrect
Option C correctly adds 'passed' based on 'score' >= 50, then renames 'score' to 'exam_score'. Option C uses 'score' after renaming, causing error. Option C uses > 50 instead of >= 50. Option C uses strings instead of booleans.