0
0
Apache Sparkdata~20 mins

Column expressions and functions in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Column Expressions Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of chained column expressions
What is the output of the following Apache Spark code snippet?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(1, 10), (2, 20), (3, 30)]
df = spark.createDataFrame(data, ["id", "value"])

result = df.select((col("value") + 5).alias("value_plus_5"))
result.show()
A
+------------+
|value_plus_5|
+------------+
|          15|
|          25|
|          35|
+------------+
B
+-----+-----+
|   id|value|
+-----+-----+
|    1|   10|
|    2|   20|
|    3|   30|
+-----+-----+
C
+------------+
|value_plus_5|
+------------+
|           5|
|          15|
|          25|
+------------+
DSyntaxError: invalid syntax
Attempts:
2 left
💡 Hint
Look at how the column 'value' is transformed by adding 5 and aliased.
data_output
intermediate
2:00remaining
Result of filtering with column functions
Given the DataFrame below, what is the output after applying the filter?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["name", "age"])

filtered_df = df.filter(col("age") > 28)
filtered_df.show()
A
+-------+---+
|   name|age|
+-------+---+
|    Bob| 30|
|Charlie| 35|
+-------+---+
B
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
+-------+---+
C
+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
+-------+---+
DTypeError: '>' not supported between instances of 'Column' and 'int'
Attempts:
2 left
💡 Hint
Filter keeps rows where age is greater than 28.
visualization
advanced
2:30remaining
Visualizing aggregated data with groupBy and functions
You have a DataFrame with sales data. Which option correctly shows the output of this aggregation?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum

spark = SparkSession.builder.getOrCreate()
data = [("A", 100), ("B", 200), ("A", 300), ("B", 400)]
df = spark.createDataFrame(data, ["category", "sales"])

agg_df = df.groupBy("category").agg(sum("sales").alias("total_sales"))
agg_df.orderBy("category").show()
A
+--------+-----------+
|category|total_sales|
+--------+-----------+
|       A|        300|
|       B|        400|
+--------+-----------+
B
+--------+-----------+
|category|total_sales|
+--------+-----------+
|       A|        400|
|       B|        600|
+--------+-----------+
C
+--------+-----------+
|category|total_sales|
+--------+-----------+
|       A|        100|
|       B|        200|
+--------+-----------+
DAttributeError: 'DataFrame' object has no attribute 'groupby'
Attempts:
2 left
💡 Hint
Sum sales grouped by category and order by category.
🔧 Debug
advanced
2:00remaining
Identify the error in column expression
What error does the following code produce?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.getOrCreate()
data = [(1, 2), (3, 4)]
df = spark.createDataFrame(data, ["a", "b"])

result = df.select(col("a") + col("c"))
result.show()
ATypeError: unsupported operand type(s) for +: 'Column' and 'Column'
BNameError: name 'col' is not defined
CAnalysisException: cannot resolve '`c`' given input columns: [a, b]
DNo error, outputs sum of columns 'a' and 'c'
Attempts:
2 left
💡 Hint
Check if column 'c' exists in the DataFrame.
🚀 Application
expert
3:00remaining
Calculate new column with conditional logic
Which option correctly creates a new column 'status' with value 'adult' if age >= 18, else 'minor'?
Apache Spark
from pyspark.sql import SparkSession
from pyspark.sql.functions import when, col

spark = SparkSession.builder.getOrCreate()
data = [("John", 17), ("Jane", 20)]
df = spark.createDataFrame(data, ["name", "age"])

result = df.withColumn("status", ???)
result.select("name", "age", "status").show()
Acase when col("age") >= 18 then "adult" else "minor" end
Bcol("age") >= 18 ? "adult" : "minor"
Cif(col("age") >= 18, "adult", "minor")
Dwhen(col("age") >= 18, "adult").otherwise("minor")
Attempts:
2 left
💡 Hint
Use Spark SQL functions for conditional column creation.