0
0
Hadoopdata~5 mins

HBase vs HDFS comparison in Hadoop

Choose your learning style9 modes available
Introduction

HBase and HDFS are both used to store big data, but they serve different purposes. Knowing their differences helps you choose the right tool for your data needs.

When you need fast read and write access to large amounts of data with random access.
When you want to store huge files in a distributed way for batch processing.
When you need a database-like system on top of Hadoop for real-time queries.
When you want to store data in files and process them later with tools like MapReduce.
When you want to handle structured data with flexible schemas.
Syntax
Hadoop
HDFS:
- Stores data as large files split into blocks.
- Blocks are distributed across many machines.
- Good for batch processing.

HBase:
- Stores data in tables with rows and columns.
- Built on top of HDFS.
- Supports fast random reads and writes.
- Uses column families to group data.

HDFS is like a big file cabinet storing files across many computers.

HBase is like a spreadsheet on top of that cabinet, allowing quick lookups and updates.

Examples
This allows processing the video in parts using batch jobs.
Hadoop
HDFS stores a large video file split into blocks across machines.
You can quickly find or update a user's profile without scanning all data.
Hadoop
HBase stores user profiles in tables with columns like name, age, and location.
HDFS can be empty initially and files are added as needed.
Hadoop
HDFS with no files (empty storage).
Even a single row can be stored and accessed efficiently.
Hadoop
HBase with one row and one column family.
Sample Program

This code connects to HBase, creates a table, inserts a row, and prints data before and after insertion.

Hadoop
# This is a conceptual Python example using happybase to show HBase usage
import happybase

# Connect to HBase
connection = happybase.Connection('localhost')
connection.open()

# Create a table with one column family
connection.create_table('users', {'info': dict()})

# Get the table
table = connection.table('users')

# Insert data
print('Before insert:')
for key, data in table.scan():
    print(key, data)

table.put(b'user1', {b'info:name': b'Alice', b'info:age': b'30'})

print('After insert:')
for key, data in table.scan():
    print(key, data)

connection.close()
OutputSuccess
Important Notes

HDFS is optimized for high throughput and large files, not for quick random access.

HBase provides low latency access but requires HDFS underneath.

Common mistake: Using HDFS when you need fast updates or random reads; use HBase instead.

Use HDFS for storing raw data files and HBase for real-time querying on top of that data.

Summary

HDFS stores big files across many machines for batch processing.

HBase stores data in tables for fast random access and updates.

HBase runs on top of HDFS and adds database-like features.