0
0
Apache Sparkdata~20 mins

Adding and renaming columns in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Spark Column Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of adding a new column with a constant value
What is the output DataFrame after running this Spark code that adds a new column with a constant value?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice"), (2, "Bob")]
df = spark.createDataFrame(data, ["id", "name"])
df2 = df.withColumn("country", lit("USA"))
df2.show()
A[Row(id=1, name='Alice', country='USA'), Row(id=2, name='Bob', country='USA')]
B[Row(id=1, name='Alice'), Row(id=2, name='Bob')]
C[Row(id=1, name='Alice', country=None), Row(id=2, name='Bob', country=None)]
DSyntaxError
Attempts:
2 left
💡 Hint
Adding a column with lit() sets the same value for all rows.
data_output
intermediate
2:00remaining
Result of renaming a column in a Spark DataFrame
After renaming the column 'name' to 'first_name' in this DataFrame, what are the column names?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice"), (2, "Bob")]
df = spark.createDataFrame(data, ["id", "name"])
df_renamed = df.withColumnRenamed("name", "first_name")
print(df_renamed.columns)
A['id', 'name', 'first_name']
B['id', 'first_name']
C['first_name', 'id']
D['id', 'name']
Attempts:
2 left
💡 Hint
withColumnRenamed changes the name of one column only.
🔧 Debug
advanced
2:00remaining
Identify the error when adding a column with an expression
What error does this code produce when trying to add a new column 'age_plus_ten' by adding 10 to the 'age' column?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, 20), (2, 30)]
df = spark.createDataFrame(data, ["id", "age"])
df2 = df.withColumn("age_plus_ten", df["age"] + 10)
df2.show()
ATypeError: unsupported operand type(s) for +: 'Column' and 'int'
BSyntaxError
CNameError: name 'df' is not defined
DNo error, outputs DataFrame with 'age_plus_ten' column
Attempts:
2 left
💡 Hint
Spark supports arithmetic operations on Column objects.
visualization
advanced
2:00remaining
Visualize the effect of renaming multiple columns
Given this DataFrame, what is the output of df_renamed.columns after renaming 'name' to 'first_name' and 'age' to 'years'?
Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [(1, "Alice", 20), (2, "Bob", 30)]
df = spark.createDataFrame(data, ["id", "name", "age"])
df_renamed = df.withColumnRenamed("name", "first_name").withColumnRenamed("age", "years")
print(df_renamed.columns)
A['id', 'first_name', 'years']
B['id', 'name', 'age']
C['first_name', 'years', 'id']
D['id', 'first_name', 'age']
Attempts:
2 left
💡 Hint
Each withColumnRenamed changes one column name.
🚀 Application
expert
3:00remaining
Add a new column based on condition and rename existing column
You have a DataFrame with columns 'id' and 'score'. You want to add a new column 'passed' that is True if score >= 50, else False. Then rename 'score' to 'exam_score'. Which code produces the correct final DataFrame?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import when

spark = SparkSession.builder.getOrCreate()
data = [(1, 45), (2, 75), (3, 50)]
df = spark.createDataFrame(data, ["id", "score"])

# Choose the correct option below
A
)(wohs.2fd
)'erocs_maxe' ,'erocs'(demaneRnmuloChtiw.))eslaF(esiwrehto.)eurT ,05 => ]'erocs'[fd(nehw ,'dessap'(nmuloChtiw.fd = 2fd
B
df2 = df.withColumnRenamed('score', 'exam_score').withColumn('passed', when(df['score'] >= 50, True).otherwise(False))
df2.show()
C
df2 = df.withColumn('passed', when(df['score'] >= 50, True).otherwise(False)).withColumnRenamed('score', 'exam_score')
df2.show()
D
df2 = df.withColumn('passed', when(df['score'] >= 50, 'Yes').otherwise('No')).withColumnRenamed('score', 'exam_score')
df2.show()
Attempts:
2 left
💡 Hint
Order matters: use original column name in condition before renaming.