Understanding How Data is Stored in Greenplum database

posted Sep 12, 2012, 11:12 AM by Sachchida Ojha
In Greenplum Database all tables are distributed, which means a table is divided into non-overlapping sets of rows or parts. Each part resides on a single database known as a segment within the Greenplum Database system. The parts are distributed across all of the available segments using a sophisticated hashing algorithm. Database administrators choose the hash key (one or more table columns) when defining the table.

The Greenplum Database physical database implements the logical database on an array of individual database instances — a master instance and two or more segment instances. The master instance does not contain any user data, only the global catalog tables. The segment instances contain disjoint parts (collections of rows) for each distributed table.

When you create or alter a table in Greenplum Database, there is an additional DISTRIBUTED clause to define the distribution policy of the table. The distribution policy determines how to divide the rows of a table across the Greenplum segments. Greenplum Database provides two types of distribution policy:

Hash Distribution - With hash distribution, one or more table columns is used as the distribution key for the table. The distribution key is used by a hashing algorithm to assign each row to a particular segment. Keys of the same value will always hash to the same segment. Choosing a unique distribution key, such as a primary key, will ensure the most even data distribution. Hash distribution is the default distribution policy for a table. If a DISTRIBUTED clause is not supplied, then either the PRIMARY KEY (if the table has one) or the first column of the table will be used as the table distribution key.

Random Distribution - With random distribution, rows are sent to the segments as they come in, cycling across the segments in a round-robin fashion. Rows with columns having the same values will not necessarily be located on the same segment. Although a random distribution ensures even data distribution, there are performance advantages to choosing a hash distribution policy whenever possible.