Hadoopdata~30 mins

Kafka integration with Hadoop - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Kafka integration with Hadoop

📖 Scenario: You work at a company that collects real-time data streams from various sources. The data is sent through Kafka topics. Your task is to integrate Kafka with Hadoop to store and analyze this streaming data efficiently.

🎯 Goal: Build a simple pipeline that reads messages from a Kafka topic and writes them into Hadoop's HDFS for further analysis.

📋 What You'll Learn

Create a Kafka consumer configuration

Set up Hadoop file system path for data storage

Write code to consume messages from Kafka topic

Save consumed messages into HDFS files

💡 Why This Matters

🌍 Real World

Companies use Kafka to handle real-time data streams and Hadoop to store large volumes of data for analysis. Integrating them allows efficient data processing pipelines.

💼 Career

Data engineers and data scientists often build pipelines that connect streaming platforms like Kafka with big data storage systems like Hadoop.

Progress0 / 4 steps

Create Kafka consumer configuration

Create a dictionary called kafka_config with these exact entries: 'bootstrap.servers': 'localhost:9092', 'group.id': 'hadoop_integration_group', and 'auto.offset.reset': 'earliest'.

Hadoop

# Create kafka_config dictionary with required entries
# Your code here

Need a hint?

Use a Python dictionary with the exact keys and values as specified.

Set Hadoop HDFS output path

Create a string variable called hdfs_output_path and set it to '/user/hadoop/kafka_data/'.

Hadoop

kafka_config = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'hadoop_integration_group',
    'auto.offset.reset': 'earliest'
}

# Set the HDFS output path
# Your code here

Need a hint?

Assign the exact string path to the variable hdfs_output_path.

Consume messages from Kafka topic

Write code to create a Kafka consumer using kafka_config and subscribe to the topic 'sensor_readings'. Use a for loop with variable message to consume messages from the consumer. Append each message's value decoded as UTF-8 to a list called messages. Initialize messages as an empty list before the loop.

Hadoop

kafka_config = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'hadoop_integration_group',
    'auto.offset.reset': 'earliest'
}

hdfs_output_path = '/user/hadoop/kafka_data/'

# Import KafkaConsumer from kafka package
from kafka import KafkaConsumer

# Create Kafka consumer and consume messages
# Your code here

Need a hint?

Use KafkaConsumer with the config values and subscribe to 'sensor_readings'. Decode each message value and add to messages list.

Save consumed messages to HDFS

Import pyarrow.fs and create a Hadoop filesystem object called hdfs. Open a file at path hdfs_output_path + 'data.txt' in write mode. Write each message from the messages list to the file, each followed by a newline. Close the file after writing. Finally, print the number of messages saved using print(f"Saved {len(messages)} messages to HDFS").

Hadoop

kafka_config = {
    'bootstrap.servers': 'localhost:9092',
    'group.id': 'hadoop_integration_group',
    'auto.offset.reset': 'earliest'
}

hdfs_output_path = '/user/hadoop/kafka_data/'

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    'sensor_readings',
    bootstrap_servers=kafka_config['bootstrap.servers'],
    group_id=kafka_config['group.id'],
    auto_offset_reset=kafka_config['auto.offset.reset']
)

messages = []
for message in consumer:
    messages.append(message.value.decode('utf-8'))
    if len(messages) >= 5:
        break

# Import pyarrow.fs and write messages to HDFS
# Your code here

Need a hint?

Use pyarrow.fs.HadoopFileSystem to open a file stream and write each message with a newline. Then print the confirmation message.