Challenge - 5 Problems
Integration Testing Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of Spark DataFrame after pipeline transformation
Given the following Spark pipeline code, what is the output DataFrame content after applying the transformations?
Apache Spark
from pyspark.sql import SparkSession from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, VectorAssembler spark = SparkSession.builder.master('local').appName('Test').getOrCreate() data = [(0, 'cat'), (1, 'dog'), (2, 'cat'), (3, 'bird')] schema = ['id', 'animal'] df = spark.createDataFrame(data, schema) indexer = StringIndexer(inputCol='animal', outputCol='animalIndex') assembler = VectorAssembler(inputCols=['animalIndex'], outputCol='features') pipeline = Pipeline(stages=[indexer, assembler]) model = pipeline.fit(df) result = model.transform(df).select('id', 'animal', 'animalIndex', 'features') result.show()
Attempts:
2 left
💡 Hint
Remember StringIndexer assigns indices based on label frequency descending order.
✗ Incorrect
StringIndexer assigns indices starting from 0 for the most frequent label. 'cat' appears twice, so it gets index 0 because it is the most frequent. 'dog' appears once and 'bird' appears once, so they get indices 1 and 2 respectively based on label ordering. VectorAssembler creates a vector from the animalIndex column.
❓ data_output
intermediate1:30remaining
Number of rows after filtering in Spark pipeline
Consider a Spark DataFrame with 1000 rows. A pipeline stage filters rows where the column 'age' is greater than 30. If 400 rows have age > 30, how many rows remain after the pipeline transformation?
Apache Spark
from pyspark.sql import SparkSession from pyspark.ml import Pipeline from pyspark.sql.functions import col spark = SparkSession.builder.master('local').appName('Test').getOrCreate() data = [(i, 20 + (i % 50)) for i in range(1000)] schema = ['id', 'age'] df = spark.createDataFrame(data, schema) class AgeFilter: def __init__(self, age_limit): self.age_limit = age_limit def transform(self, df): return df.filter(col('age') > self.age_limit) def fit(self, df): return self filter_stage = AgeFilter(30) pipeline = Pipeline(stages=[filter_stage]) model = pipeline.fit(df) result = model.transform(df) count = result.count()
Attempts:
2 left
💡 Hint
Count how many rows satisfy the condition age > 30.
✗ Incorrect
Since ages cycle from 20 to 69, exactly 400 rows have age > 30, so after filtering, 400 rows remain.
🔧 Debug
advanced2:00remaining
Identify the error in Spark pipeline code
What error does the following Spark pipeline code raise when executed?
Apache Spark
from pyspark.sql import SparkSession from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer spark = SparkSession.builder.master('local').appName('Test').getOrCreate() data = [(0, 'apple'), (1, 'banana')] schema = ['id', 'fruit'] df = spark.createDataFrame(data, schema) indexer = StringIndexer(inputCol='fruit', outputCol='fruitIndex') pipeline = Pipeline(stages=[indexer]) model = pipeline.fit(df) result = model.transform(df).select('id', 'fruitIndex') result.show()
Attempts:
2 left
💡 Hint
Check if the pipeline stages are correctly defined and used.
✗ Incorrect
The code correctly fits the pipeline and transforms the DataFrame. The output DataFrame contains 'id' and 'fruitIndex' columns without error.
🚀 Application
advanced1:30remaining
Best approach to test Spark pipeline integration
Which approach is best to test the integration of multiple Spark pipeline stages in a data science project?
Attempts:
2 left
💡 Hint
Integration testing means testing combined parts together.
✗ Incorrect
Integration testing verifies that all pipeline stages work together correctly by running the full pipeline on sample data and checking outputs.
🧠 Conceptual
expert2:00remaining
Effect of caching in Spark pipeline integration tests
In Spark pipeline integration tests, what is the main benefit of caching intermediate DataFrames during the pipeline execution?
Attempts:
2 left
💡 Hint
Think about how caching affects performance in Spark.
✗ Incorrect
Caching stores intermediate results in memory, so repeated accesses during tests are faster, reducing overall test time.