Hadoopdata~30 mins

Why HDFS handles petabyte-scale storage in Hadoop - See It in Action

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Why HDFS Handles Petabyte-Scale Storage

📖 Scenario: Imagine you work for a company that collects huge amounts of data every day, like videos, logs, and sensor readings. You need a way to store all this data safely and access it quickly, even when it grows to petabytes (millions of gigabytes).

🎯 Goal: You will create a simple simulation to understand how HDFS (Hadoop Distributed File System) manages very large data by splitting it into blocks and storing copies across many machines.

📋 What You'll Learn

Create a dictionary to represent files and their sizes in gigabytes

Set a block size variable to split files into blocks

Calculate how many blocks each file needs

Print the number of blocks per file to see how HDFS handles large data

💡 Why This Matters

🌍 Real World

Big companies like Netflix and Facebook store massive data using HDFS to keep it safe and accessible.

💼 Career

Understanding HDFS block management is key for roles in big data engineering and data infrastructure.

Progress0 / 4 steps

Create a dictionary of files with their sizes

Create a dictionary called files with these exact entries: 'video1.mp4': 1500, 'log_data.txt': 3000, 'sensor_readings.csv': 500. The sizes are in gigabytes.

Hadoop

# Create the dictionary called files with given file names and sizes
# Your code here

Need a hint?

Use curly braces {} to create a dictionary with file names as keys and sizes as values.

Set the block size for splitting files

Create a variable called block_size_gb and set it to 128 to represent the block size in gigabytes.

Hadoop

files = {'video1.mp4': 1500, 'log_data.txt': 3000, 'sensor_readings.csv': 500}
# Set block_size_gb to 128
# Your code here

Need a hint?

Just assign the number 128 to the variable block_size_gb.

Calculate the number of blocks for each file

Create a new dictionary called file_blocks. Use a for loop with variables file_name and size to iterate over files.items(). For each file, calculate the number of blocks by dividing size by block_size_gb and rounding up using math.ceil(). Store the result in file_blocks[file_name]. Import the math module first.

Hadoop

files = {'video1.mp4': 1500, 'log_data.txt': 3000, 'sensor_readings.csv': 500}
block_size_gb = 128
import math
# Calculate number of blocks per file and store in file_blocks
# Your code here

Need a hint?

Use math.ceil() to round up the division result so partial blocks count as full blocks.

Print the number of blocks per file

Use a for loop with variables file_name and blocks to iterate over file_blocks.items(). Print each file name and its number of blocks in the format: File: video1.mp4, Blocks: 12.

Hadoop

files = {'video1.mp4': 1500, 'log_data.txt': 3000, 'sensor_readings.csv': 500}
block_size_gb = 128
import math
file_blocks = {}
for file_name, size in files.items():
    file_blocks[file_name] = math.ceil(size / block_size_gb)
# Print file names and their block counts
# Your code here

Need a hint?

Use an f-string inside the print statement to format the output exactly as shown.