0
0
Hadoopdata~30 mins

Kafka integration with Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available
Kafka integration with Hadoop
📖 Scenario: You work at a company that collects real-time data streams from various sources. The data is sent through Kafka topics. Your task is to integrate Kafka with Hadoop to store and analyze this streaming data efficiently.
🎯 Goal: Build a simple pipeline that reads messages from a Kafka topic and writes them into Hadoop's HDFS for further analysis.
📋 What You'll Learn
Create a Kafka consumer configuration
Set up Hadoop file system path for data storage
Write code to consume messages from Kafka topic
Save consumed messages into HDFS files
💡 Why This Matters
🌍 Real World
Companies use Kafka to handle real-time data streams and Hadoop to store large volumes of data for analysis. Integrating them allows efficient data processing pipelines.
💼 Career
Data engineers and data scientists often build pipelines that connect streaming platforms like Kafka with big data storage systems like Hadoop.
Progress0 / 4 steps
1
Create Kafka consumer configuration
Create a dictionary called kafka_config with these exact entries: 'bootstrap.servers': 'localhost:9092', 'group.id': 'hadoop_integration_group', and 'auto.offset.reset': 'earliest'.
Hadoop
Need a hint?

Use a Python dictionary with the exact keys and values as specified.

2
Set Hadoop HDFS output path
Create a string variable called hdfs_output_path and set it to '/user/hadoop/kafka_data/'.
Hadoop
Need a hint?

Assign the exact string path to the variable hdfs_output_path.

3
Consume messages from Kafka topic
Write code to create a Kafka consumer using kafka_config and subscribe to the topic 'sensor_readings'. Use a for loop with variable message to consume messages from the consumer. Append each message's value decoded as UTF-8 to a list called messages. Initialize messages as an empty list before the loop.
Hadoop
Need a hint?

Use KafkaConsumer with the config values and subscribe to 'sensor_readings'. Decode each message value and add to messages list.

4
Save consumed messages to HDFS
Import pyarrow.fs and create a Hadoop filesystem object called hdfs. Open a file at path hdfs_output_path + 'data.txt' in write mode. Write each message from the messages list to the file, each followed by a newline. Close the file after writing. Finally, print the number of messages saved using print(f"Saved {len(messages)} messages to HDFS").
Hadoop
Need a hint?

Use pyarrow.fs.HadoopFileSystem to open a file stream and write each message with a newline. Then print the confirmation message.