0
0
Apache Sparkdata~10 mins

Reading from Kafka with Spark in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Reading from Kafka with Spark
Start Spark Session
Set Kafka Configurations
Create Spark DataFrame from Kafka
Select and Cast Kafka Data
Process or Show Data
Stop Spark Session
This flow shows how Spark connects to Kafka, reads data, processes it, and then stops.
Execution Sample
Apache Spark
spark = SparkSession.builder.appName("KafkaExample").getOrCreate()

kafka_df = spark.read.format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "test-topic")
  .load()

selected_df = kafka_df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
selected_df.show()

spark.stop()
This code connects Spark to Kafka, reads messages from 'test-topic', and shows key and value as strings.
Execution Table
StepActionSpark VariableResult/Output
1Start Spark SessionsparkSparkSession object created
2Configure Kafka sourcekafka_dfDataFrame configured to read from Kafka topic 'test-topic'
3Load data from Kafkakafka_dfDataFrame loaded with Kafka messages (key, value, topic, partition, offset, timestamp)
4Select and cast key and valueselected_dfDataFrame with key and value as strings
5Show dataOutput[Shows rows with key and value columns as strings]
6Stop Spark SessionsparkSparkSession stopped
ExitEnd of processNo more data to read or process
💡 Spark session stopped, no further Kafka data read
Variable Tracker
VariableStartAfter Step 2After Step 3After Step 4After Step 5Final
sparkNoneSparkSession createdSparkSession activeSparkSession activeSparkSession activeStopped
kafka_dfNoneConfiguredLoaded with Kafka dataLoaded with Kafka dataLoaded with Kafka dataReleased
selected_dfNoneNoneNoneDataFrame with casted key/valueSame as previousReleased
Key Moments - 3 Insights
Why do we need to cast the key and value from Kafka data?
Kafka data comes as binary by default. Casting to string makes it readable and usable, as shown in step 4 of the execution_table.
What happens if the Kafka topic does not exist or is unreachable?
Spark will throw an error or hang at the load step (step 3), because it cannot connect or find the topic to read data.
Why do we stop the Spark session at the end?
Stopping the Spark session (step 6) frees resources and ends the streaming or batch job cleanly.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the state of 'kafka_df' after step 3?
AConfigured but not loaded
BLoaded with Kafka data
CStopped
DNone
💡 Hint
Check the 'kafka_df' column in the execution_table row for step 3
At which step do we convert Kafka binary data to readable strings?
AStep 2
BStep 3
CStep 4
DStep 5
💡 Hint
Look for 'Select and cast key and value' in the execution_table
If we do not stop the Spark session, what is the likely outcome?
ASpark session continues running and holds resources
BSpark session automatically stops
CKafka stops sending data
DData is lost
💡 Hint
Refer to the key_moments explanation about stopping Spark session
Concept Snapshot
Reading from Kafka with Spark:
- Start SparkSession
- Configure Kafka source with bootstrap servers and topic
- Load Kafka data as DataFrame
- Cast key and value from binary to string
- Process or show data
- Stop SparkSession to release resources
Full Transcript
This visual execution shows how to read data from Kafka using Apache Spark. First, we start a Spark session. Then, we configure the Kafka source by specifying the Kafka servers and the topic to subscribe to. Next, we load the data from Kafka into a Spark DataFrame. Since Kafka data is in binary format, we cast the key and value columns to strings to make them readable. After that, we can process or display the data. Finally, we stop the Spark session to free resources. The execution table traces each step and variable state, helping beginners understand the flow and transformations.