Challenge - 5 Problems
Column Expressions Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of chained column expressions
What is the output of the following Apache Spark code snippet?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate() data = [(1, 10), (2, 20), (3, 30)] df = spark.createDataFrame(data, ["id", "value"]) result = df.select((col("value") + 5).alias("value_plus_5")) result.show()
Attempts:
2 left
💡 Hint
Look at how the column 'value' is transformed by adding 5 and aliased.
✗ Incorrect
The code adds 5 to each value in the 'value' column and selects only this new column with alias 'value_plus_5'. The output shows the incremented values.
❓ data_output
intermediate2:00remaining
Result of filtering with column functions
Given the DataFrame below, what is the output after applying the filter?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate() data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)] df = spark.createDataFrame(data, ["name", "age"]) filtered_df = df.filter(col("age") > 28) filtered_df.show()
Attempts:
2 left
💡 Hint
Filter keeps rows where age is greater than 28.
✗ Incorrect
The filter condition selects rows with age > 28, so Bob (30) and Charlie (35) remain.
❓ visualization
advanced2:30remaining
Visualizing aggregated data with groupBy and functions
You have a DataFrame with sales data. Which option correctly shows the output of this aggregation?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import sum spark = SparkSession.builder.getOrCreate() data = [("A", 100), ("B", 200), ("A", 300), ("B", 400)] df = spark.createDataFrame(data, ["category", "sales"]) agg_df = df.groupBy("category").agg(sum("sales").alias("total_sales")) agg_df.orderBy("category").show()
Attempts:
2 left
💡 Hint
Sum sales grouped by category and order by category.
✗ Incorrect
The sum of sales for category A is 100 + 300 = 400, for B is 200 + 400 = 600.
🔧 Debug
advanced2:00remaining
Identify the error in column expression
What error does the following code produce?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.getOrCreate() data = [(1, 2), (3, 4)] df = spark.createDataFrame(data, ["a", "b"]) result = df.select(col("a") + col("c")) result.show()
Attempts:
2 left
💡 Hint
Check if column 'c' exists in the DataFrame.
✗ Incorrect
Column 'c' does not exist in the DataFrame, so Spark raises an AnalysisException.
🚀 Application
expert3:00remaining
Calculate new column with conditional logic
Which option correctly creates a new column 'status' with value 'adult' if age >= 18, else 'minor'?
Apache Spark
from pyspark.sql import SparkSession from pyspark.sql.functions import when, col spark = SparkSession.builder.getOrCreate() data = [("John", 17), ("Jane", 20)] df = spark.createDataFrame(data, ["name", "age"]) result = df.withColumn("status", ???) result.select("name", "age", "status").show()
Attempts:
2 left
💡 Hint
Use Spark SQL functions for conditional column creation.
✗ Incorrect
Option D uses the correct Spark function 'when' with 'otherwise' to create the conditional column.