Backing Up and Restoring Greenplum Databases

posted Sep 12, 2012, 1:38 PM by Sachchida Ojha
Greenplum recommends that you take regular backups of your databases. These backups can be used to restore your data or to rebuild a Greenplum Database system in the event of a system failure or data corruption. You can also use the backups to migrate data from one Greenplum Database system to another.

Parallel Backup

Greenplum provides a parallel dump utility called gp_dump. This utility backs up the Greenplum master instance and each active segment instance at the same time.

Because the segments are dumped in parallel, the time it takes to do a backup should scale regardless of the number of segments in your system. The dump files on the master host consist of DDL statements and the Greenplum-specific system catalog tables (such as gp_segment_configuration). On the segment hosts, there is one dump file created for each segment instance. The segment dump files contain the data for an individual segment instance. All of the dump files that comprise a total backup set are identified by a unique 14-digit timestamp key.

In order to automate routine backups, Greenplum also provides the gpcrondump utility, which is a wrapper for gp_dump that can be called directly or from a scheduled CRON job. gpcrondump also allows you to backup additional objects besides your databases and data, such as database roles and server configuration files.

Non-Parallel Backups

Greenplum also supports the regular PostgreSQL dump utilities: pg_dump and pg_dumpall. The PostgreSQL dump utilities (when used on Greenplum Database) will create one big dump file on the master host containing the data from all active segments. In most cases, this is probably not practical, as there is most likely not enough disk space on the master host to create a single backup file of an entire distributed database. These utilities are mostly supported for users who are migrating from regular PostgreSQL to Greenplum Database.

Another useful command for getting data out of a database is the COPY TO SQL command. This allows you to copy all or a portion of a table out of the database to a text-delimited file on the master host.

If you are migrating your data to another Greenplum Database system with a different segment configuration (for example, if the system you are migrating to has greater or fewer segment instances), Greenplum recommends using your parallel dump files created by gp_dump or gpcrondump and following the restore process.

Parallel Restores

To do a parallel restore, you must have a complete backup set created by gp_dump or gpcrondump. Greenplum provides a parallel restore utility called gp_restore. This utility takes the timestamp key generated by gp_dump, validates the backup set, and restores the database objects and data into a distributed database. As with a parallel dump, each segment’s data is restored in parallel.

Greenplum also provides the gpdbrestore utility, which is a wrapper for gp_restore. gpdbrestore provides additional flexibility and verification options, which are useful if you are using automated backup files produced by gpcrondump, or have moved your backup files off of the Greenplum array to an alternate location.

Non-Parallel Restores

Greenplum also supports the regular PostgreSQL restore utility: pg_restore. This utility is mostly supported for users who are migrating to Greenplum Database from regular PostgreSQL, and have compressed dump files created by pg_dump or pg_dumpall. Before restoring PostgreSQL dump files into Greenplum Database, make sure to modify the CREATE TABLE statements in the dump files to include the Greenplum DISTRIBUTED clause.

It may also sometimes be necessary to do a non-parallel restore from a parallel backup set. For example, suppose you are migrating from a Greenplum system that has four segments to one that has five segments. You cannot do a parallel restore in this case, because your backup set only has four backup files and would not be evenly distributed across the new expanded system. A non-parallel restore using parallel backup files involves collecting each backup file from the segment hosts, copying them to the master host, and loading them through the master.