HBase Data

Currently, HBase data can be backed up in the following modes:

Table 1 compares the impact of operations from six perspectives.

Table 1 Data backup mode comparison on HBase

Backup Mode

Performance Impact

Data Footprint

Downtime

Incremental Backup

Ease of Implementation

Mean Time to Repair (MTTR)

Snapshots

Minimal

Tiny

Brief (Only for Restore)

No

Easy

Seconds

Replication

Minimal

Large

None

Intrinsic

Medium

Seconds

Export

High

Large

None

Yes

Easy

High

CopyTable

High

Large

None

Yes

Easy

High

HTable API

Medium

Large

None

Yes

Difficult

Up to you

Offline backup of HDFS data

-

Large

Long

No

Medium

High

Snapshots

You can perform the snapshot operation on a table to generate a snapshot for the table. The snapshot can be used to back up the original table, roll back the original table when the original table is faulty, as well as back up data cross clusters. After a snapshot is executed, the .hbase-snapshot directory is generated in the HBase root directory (/hbase by default) on HBase. The directory contains details about each snapshot. When the ExportSnapshot command is executed to export the snapshot, an MR task is submitted locally to copy the snapshot information and table's HFile to /hbase/.hbase-snapshot and /hbase/archive of the standby cluster respectively. For details, see http://hbase.apache.org/2.2/book.html#ops.snapshots.

Perform the following operations on the active cluster:

  1. Create a snapshot for a table. For example, create snapshot member_snapshot for the member table.

    snapshot 'member','member_snapshot'

  2. Copy the snapshot to the standby cluster.

    hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot member_snapshot -copy-to hdfs://IP address of the active NameNode of the HDFS service in the standby cluster:Port number/hbase -mappers 3

    • The data directory of the standby cluster must be the HBase root directory (/hbase).
    • mappers indicates the number of maps to be submitted for an MR task.

Perform the following operations on the standby cluster:

Run the restore command to automatically create a table in the standby cluster and establish a link between HFile in archive and the table.

restore_snapshot 'member_snapshot'

If only table data needs to be backed up, Snapshots is highly recommended. Use SnapshotExport to submit an MR task locally and copies Snapshot and HFile to the standby cluster. Then, data can be directly loaded to the standby cluster, more efficient than using other methods.

Replication

In Replication backup mode, a disaster recovery relationship is established between the active and standby clusters on HBase. When data is written to the active cluster, the active cluster pushes data to the standby cluster through WAL to implement real-time data synchronization between the active and standby clusters. For details, see http://hbase.apache.org/2.2/book.html#_cluster_replication.

For details about how to use and configure HBase backup, see Configuring HBase Replication and Using the ReplicationSyncUp Tool.

Export/Import

Export/Import starts a MapReduce task to scan the data table and writes SequenceFile to the remote HDFS. Then, Import reads SequenceFile and puts it on HBase.

Perform the following operations on the active cluster:

Run the Export command to export the table.

hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir>

Example: hbase org.apache.hadoop.hbase.mapreduce.Export member hdfs://IP address of the active NameNode of the HDFS service in the standby cluster:Port number/user/table/member

In the command, member indicates the name of the table to be exported.

Perform the following operations on the standby cluster:

  1. After operations are executed on the active cluster, you can view the generated directory data on the standby cluster, as shown in Figure 1.

    Figure 1 Directory data

  2. Run the create command to create a table in the standby cluster with the same structure as that of the active cluster, for example, member_import.
  3. Run the Import command to generate the HFile data on HDFS.

    hbase org.apache.hadoop.hbase.mapreduce.Import <tablename> <inputdir>

    Example: hbase org.apache.hadoop.hbase.mapreduce.Import member_import /user/table/member -Dimport.bulk.output=/tmp/member

    • member_import indicates a table in the standby cluster with the same table structure as that of the active cluster.
    • Dimport.bulk.output indicates the output directory of the HFile data.
    • /user/table/member indicates the directory for storing data exported from the active cluster.

  4. Perform the Load operation to write the HFile data to HBase.

    hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /tmp/member member

    • /tmp/member indicates the output directory of the HFile data in 3.
    • member indicates the name of the table to which data is to be imported in the standby cluster.

CopyTable

The function of CopyTable is similar to that of Export. Like Export, CopyTable uses HBase API to create a MapReduce task to read data from the source table. However, the difference is that the output of CopyTable is an HBase table that can be stored in a local or remote cluster. For details, see http://hbase.apache.org/2.2/book.html#copy.table

Perform the following operations on the standby cluster:

Run the create command to create a table in the standby cluster with the same structure as that of the active cluster, for example, member_copy.

Perform the following operations on the active cluster:

Run the following CopyTable command to copy the table:

hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=xxxxxx] [--endtime=xxxxxx] --new.name=member_copy --peer.adr=server1,server2,server3:2181:/hbase [--families=myOldCf:myNewCf,cf2,cf3] TestTable

If data is copied to a remote cluster, a MapReduce task is submitted on the host cluster to import the data. After the full or partial data in the original table is read, it is written to the remote cluster in put mode. Therefore, if the table contains a large amount of data (remote copy does not support bulkload), the efficiency is unsatisfactory.

HTable API

HTable API imports and exports data of the original HBase table in the code. You can use the public API to write customized client applications to directly query tables, or design other methods based on the batch processing advantages of MapReduce tasks. This mode requires in-depth understanding of Hadoop development and the impact on the production cluster.

Offline backup of HDFS data

Offline backup of HDFS data means stopping the HBase service and allowing users to manually copy the HDFS data.

Perform the following operations on the active cluster:

  1. Run the following command to save the data in the current cluster to HDFS permanently:

    flush 'tableName'

  2. Stop the HBase service.
  3. Run the following commands to copy the HDFS data of the current cluster to the standby cluster:

    hadoop distcp -i /hbase/data hdfs://IP address of the active NameNode of the HDFS service in the standby cluster:Port number/hbase

    hadoop distcp –update –append –delete /hbase/ hdfs://IP address of the active NameNode of the HDFS service in the standby cluster:Port number/hbase/

    The second command is used to incrementally copy files except the data directory. For example, data in archive may be referenced by the data directory.

Perform the following operations on the standby cluster:

  1. Restart the HBase service for the data migration to take effect. During the restart, HBase loads the data in the current HDFS and regenerates metadata.
  2. After the restart is complete, run the following command on the Master node client to load the HBase table data:

    $HBase_Home/bin/hbase hbck -fixMeta -fixAssignments

  3. After the command is executed, run the following command repeatedly to check the health status of the HBase cluster until the health status is normal:

    hbase hbck

    If the HBase coprocessor is used and custom JAR files are stored in the regionserver/hmaster of the active cluster, you need to copy the custom JAR files before restarting the HBase service on the standby cluster.