Overview - CRUD operations in HBase

What is it?

CRUD operations in HBase refer to the basic actions of Creating, Reading, Updating, and Deleting data in HBase tables. HBase is a distributed, scalable database built on top of Hadoop that stores data in a column-oriented way. These operations allow users to manage data efficiently in large datasets spread across many machines. Understanding CRUD in HBase helps you interact with big data stored in a fast and flexible manner.

Why it matters

Without CRUD operations, you cannot manage or manipulate data in HBase, making it impossible to build applications or analyze data stored there. CRUD operations solve the problem of handling huge amounts of data in a distributed system, allowing real-time access and updates. Without them, big data systems would be static and unusable for dynamic tasks like tracking user activity or updating records.

Where it fits

Before learning CRUD in HBase, you should understand basic Hadoop concepts and how HBase stores data in tables with rows and column families. After mastering CRUD, you can explore advanced topics like HBase filters, scans, and integration with MapReduce or Spark for big data processing.

Mental Model

Core Idea

CRUD operations in HBase are the fundamental ways to add, get, change, and remove data from a large, distributed table organized by rows and columns.

Think of it like...

Think of HBase like a giant library where each book is a row, and each chapter inside the book is a column family. CRUD operations are like adding new books, reading chapters, updating pages, or removing books from the library.

┌─────────────┐
│   HBase    │
│  Table     │
│────────────│
│ Row Key 1  │
│ ┌────────┐ │
│ │ColFam1 │ │
│ │ Col1   │ │
│ │ Col2   │ │
│ └────────┘ │
│ Row Key 2  │
│ ...        │
└────────────┘

Operations:
Create -> Add new row or column
Read   -> Get data by row key
Update -> Change existing data
Delete -> Remove row or column

Build-Up - 7 Steps

1

FoundationUnderstanding HBase Table Structure

Concept: Learn how data is organized in HBase tables using row keys and column families.

HBase stores data in tables made of rows and column families. Each row has a unique key. Column families group related columns. Data is stored as cells identified by row key, column family, and column qualifier. This structure allows fast lookups by row key.

Result

You can visualize HBase data as a table with rows identified by keys and columns grouped in families.

Understanding the table structure is essential because all CRUD operations depend on knowing how to locate and organize data in HBase.

2

FoundationBasic Create and Read Operations

3

IntermediateUpdating Data in HBase

4

IntermediateDeleting Data from HBase

5

IntermediateUsing HBase API for CRUD Operations

6

AdvancedHandling Versions and Timestamps in CRUD

7

ExpertOptimizing CRUD with Batch and Atomic Operations

Under the Hood

HBase stores data in HDFS as files called HFiles. When you perform a put, data is first written to a write-ahead log (WAL) for durability, then to an in-memory store called MemStore. Once MemStore fills, data is flushed to disk as an HFile. Reads merge data from MemStore and HFiles. Deletes create tombstones marking data as deleted, which are cleaned during compaction. This design ensures fast writes and consistent reads in a distributed environment.

Why designed this way?

HBase was designed to handle massive data with low latency on commodity hardware. Using WAL and MemStore balances durability and speed. The column-family model optimizes sparse data storage. Tombstones allow efficient deletes without immediate costly disk operations. Alternatives like relational databases couldn't scale horizontally as easily.

┌───────────────┐
│ Client CRUD   │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Write-Ahead   │
│ Log (WAL)     │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ MemStore      │
│ (In-memory)   │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ HFiles on HDFS│
└───────────────┘

Reads combine MemStore + HFiles
Deletes add tombstones
Compaction cleans tombstones and merges files

Myth Busters - 4 Common Misconceptions

Quick: Does deleting a row in HBase immediately free all disk space? Commit to yes or no.

Common Belief:Deleting a row instantly removes all its data from disk.

Tap to reveal reality

Quick: When updating a cell, does HBase create a new version or overwrite the old one? Commit to your answer.

Common Belief:Updating a cell overwrites the old value and deletes previous versions.

Tap to reveal reality

Quick: Are HBase CRUD operations synchronous by default? Commit to yes or no.

Common Belief:All CRUD operations block until completion before returning.

Tap to reveal reality

Quick: Does HBase support transactions across multiple rows? Commit to yes or no.

Common Belief:HBase supports multi-row transactions like relational databases.

Tap to reveal reality

Expert Zone

1

HBase's use of timestamps for versions means that clock synchronization across clients can affect data visibility and ordering.

2

Batch operations reduce network overhead but require careful error handling because partial failures can occur.

3

Tombstones improve delete performance but can cause read amplification if compaction is delayed, impacting latency.

When NOT to use

HBase CRUD is not suitable when strong multi-row transactional guarantees are needed; in such cases, use distributed SQL databases like Google Spanner or Apache Phoenix on top of HBase. Also, for small datasets or simple key-value needs, lighter databases like RocksDB or Cassandra might be better.

Production Patterns

In production, CRUD operations are often combined with filters and scans to efficiently query subsets of data. Batch puts and deletes are used to optimize throughput. Atomic CheckAndPut operations enforce conditional updates. Monitoring compaction and tuning MemStore size are common to maintain performance.

Connections

Relational Database CRUD

Similar pattern of Create, Read, Update, Delete but with different storage and consistency models.

Understanding relational CRUD helps grasp the purpose of HBase CRUD, but HBase differs by being distributed and column-oriented.

Distributed Systems Consistency Models

HBase CRUD operations reflect eventual consistency and atomicity at row level in distributed storage.

Knowing distributed consistency concepts clarifies why HBase handles deletes with tombstones and limits transactions.

Version Control Systems

HBase cell versions are like commits in version control, storing history with timestamps.

Recognizing this connection helps understand how HBase manages multiple data versions and supports historical queries.

Common Pitfalls

#1Trying to delete data and expecting immediate disk space recovery.

Wrong approach:deleteall 'table', 'row1' # Then checking disk space immediately expecting it to be freed

Correct approach:deleteall 'table', 'row1' # Understand that compaction will free space later

Root cause:Misunderstanding that HBase uses tombstones and deferred cleanup.

#2Updating a cell without specifying the correct column family or qualifier.

Wrong approach:put 'table', 'row1', 'col1', 'value' # Missing column family causes error or wrong data placement

Correct approach:put 'table', 'row1', 'cf1:col1', 'value' # Correctly specifying column family and qualifier

Root cause:Confusing HBase's column family:qualifier syntax.

#3Assuming batch operations are atomic across all rows.

Wrong approach:table.batch([put1, put2, delete1], results) # Expecting all succeed or all fail together

Correct approach:Use CheckAndPut for atomicity on single rows; handle partial failures in batch manually

Root cause:Misunderstanding batch operation guarantees.

Key Takeaways

CRUD operations in HBase allow you to create, read, update, and delete data in a distributed, column-oriented database.

Data is organized by row keys and column families, and operations target specific cells identified by these keys.

Updates overwrite cell values but keep older versions accessible by timestamp.

Deletes mark data with tombstones and actual removal happens later during compaction.

Batch and atomic operations optimize performance and consistency in real-world big data applications.