Hbase — Architecture

3 min readNov 6, 2023

Apache HBase is an open-source, distributed, and scalable NoSQL database built on top of the Hadoop Distributed File System (HDFS). It is designed for storing and managing large volumes of data in a fault-tolerant and distributed manner. HBase follows a column-family-oriented data model and provides low-latency, real-time read and write access to data.

Here’s an overview of HBase’s architecture

HBase Data Model

HBase stores data in tables, similar to traditional relational databases. Each table consists of rows and columns.
Tables are divided into column families, which group related columns together.
Each column family can contain multiple columns, and each column can store a value.
Rows are uniquely identified by a row key.

HMaster

The HMaster is a central component in an HBase cluster and is responsible for managing metadata and coordinating cluster operations. It keeps track of regions, assigns regions to Region Servers, and handles region splits and merges.

Region Servers

Region Servers are responsible for serving data in HBase. They host a set of regions.
A region is a subset of a table and represents a range of row keys.
Each Region Server can serve multiple regions and is responsible for reading, writing, and managing data within those regions.

ZooKeeper

HBase relies on Apache ZooKeeper for coordination and distributed synchronization. ZooKeeper helps in managing the cluster state, leader election, and failover.

HDFS Integration

HBase uses HDFS as the underlying storage system. Data is stored in HDFS files, and HBase provides a layer on top for efficient and real-time access.
Each Region Server stores data for its regions in HDFS data files.

Write Path

When data is written to HBase, it is first written to a MemStore (an in-memory write buffer).
When the MemStore reaches a certain size, it is flushed to an HFile on HDFS.
Write-ahead logs (WALs) are used to ensure durability and recovery in case of crashes.

Read Path

When a client requests data, it contacts the HMaster to locate the Region Server serving the data.
The Region Server reads data from the HFiles on HDFS and serves it to the client.

Load Balancing

HBase has mechanisms for load balancing to evenly distribute regions across Region Servers to ensure efficient resource utilization.

Automatic Region Splitting and Merging

Regions can be split into two regions to handle data growth, and smaller regions can be merged to manage compacted data.
HBase automates the splitting and merging process to maintain data distribution and performance.

Compaction

HBase performs compaction to optimize storage and improve query performance by removing deleted data and merging smaller HFiles into larger ones.

Bloom Filters

HBase uses Bloom filters to reduce disk I/O when looking up data, improving query performance.

Caching

HBase supports caching to reduce the number of disk reads by keeping frequently accessed data in memory. Block Cache and MemStore are examples of caching mechanisms.

Coprocessors

HBase allows you to define custom code to run on the server-side, enabling additional functionality such as filtering and aggregation without data transfer over the network.

Security

HBase provides access control lists (ACLs) for fine-grained security and authentication mechanisms to control who can access and modify data.

Backup and Restore

HBase supports backup and restore operations to safeguard data and recover from failures.

HBase’s architecture is designed to provide scalability, fault tolerance, and low-latency data access, making it suitable for use cases where real-time, large-scale data storage and retrieval are required, such as in time-series data, IoT applications, and other big data scenarios.