Choosing the Table Storage Model in Greenplum

Greenplum Database provides an agile and flexible processing engine capable of supporting several storage models (or a hybrid of storage models). When you create a new table, you have several options as to how its data is stored on disk. This section explains the various options for table storage and how to decide on the best storage model for your workload.

1. Choosing Heap or Append-Only Storage
2. Choosing Row or Column-Oriented Storage
3. Using Compression (Append-Only Tables Only)
4. Checking the Compression and Distribution of an Append-Only Table

Choosing Heap or Append-Only Storage

By default, Greenplum Database uses the same heap storage model as PostgreSQL. Heap table storage favors OLTP-type workloads where the data is often modified after it is initially loaded. UPDATE and DELETE operations require row-level versioning information to be stored in order to ensure that database transactions are processed reliably. Heap tables are best suited for smaller tables, such as dimension tables, that are often updated after they are initially loaded.
Greenplum Database also offers an append-only table storage model. Append-only table storage favors denormalized fact tables in a data warehouse environment, which are typically the largest tables in the system. Fact tables are typically loaded in batches, and then accessed by read-only queries.

Data is not updated after it is loaded. Moving large fact tables to an append-only storage model eliminates the storage overhead of the per-row update visibility information (about 20 bytes per row is saved). This allows for a leaner and easier-to-optimize page structure. Append-only tables do not allow UPDATE and DELETE operations. The storage model of append-only tables is optimized for bulk data loading. Single row INSERT statements are not recommended.

To create a heap table
Row-oriented heap tables are the default storage type, so no extra CREATE TABLE command syntax is required to create a heap table. For example:
=> CREATE TABLE foo (a int, b text) DISTRIBUTED BY (a);

To create an append-only table
The WITH clause of the CREATE TABLE command is used to declare the storage options of the table. If not declared, the table will be created as a regular row-oriented heap-storage table. For example, to create an append-only table with no compression:
=> CREATE TABLE bar (a int, b text)
WITH (appendonly=true)
DISTRIBUTED BY (a);

Choosing Row or Column-Oriented Storage
Greenplum provides a choice of storage orientation models: row or column (or a hybrid of both). This section provides some general guidelines for choosing the correct storage orientation for a table; however you are encouraged to evaluate performance using your own data and query workloads.
For most general purpose or mixed workloads, row-oriented storage offers the best combination of flexibility and performance. However, there are certain specific use cases where a column-oriented storage model provides more efficient I/O and storage. Consider the following requirements when deciding on the storage orientation model of a table:

1. Updates of table data. If table data must be updated after it is loaded, choose a row-oriented heap table. Column-oriented table storage is only available on append-only tables. 

2. Frequent INSERTs. If new rows are frequently inserted into the table, consider a row-oriented model. Column-oriented tables are not optimized for write operations, as column values for a row must be written to different places on disk.

3. Number of columns requested in queries. If you typically request all or the majority of columns in the SELECT list or WHERE clause of your queries, consider a row-oriented model. Column-oriented tables are best suited to queries that aggregate many values of a single column where the WHERE or HAVING predicate is also on the aggregate column,
SELECT SUM(salary)...
SELECT AVG(salary)... WHERE salary > 10000
or where the WHERE predicate is on a single column and is highly selective (returns a relatively small number of rows).
SELECT salary, dept ... WHERE state='CA'

4. Number of columns in the table. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small. Column-oriented tables can offer better query performance on wide tables (lots of columns) where you typically only access a small subset of columns in your queries.

5. Compression. Since column data is of the same data type, there are some storage size optimizations available in column-oriented data that are not available in row-oriented data. For example, many compression schemes make use of the similarity of adjacent data to compress. However, the greater adjacent compression achieved, the more difficult random access may become, as data might need to be uncompressed to be read.

To create a column-oriented table

The WITH clause of the CREATE TABLE command is used to declare the storage options of the table. If not declared, the table will be created as a row-oriented heap table. Tables that use column-oriented storage must also be append-only tables. For example, to create a column-oriented table:
=> CREATE TABLE bar (a int, b text)
WITH (appendonly=true, orientation=column)
DISTRIBUTED BY (a);

Using Compression (Append-Only Tables Only)
Tables that utilize the append-only storage model also have the option of using in-database compression (with zlib or QuickLZ) to save disk space. Using in-database compression requires that your segment systems have the available CPU power to compress and uncompress the data. Compressed append-only tables should not be used on file systems that also are using compression. If the file system where your segment data directory resides is a compressed file system, your append-only table should not use compression.

When choosing a compression type and level for append-only tables, consider these factors:

1. CPU usage
2. Compression ratio/disk size
3. Speed of compression
4. Speed of decompression/scan rate

Though minimizing disk size may be the main goal in compressing tables, the time and CPU capacity required to compress and scan data is also important to consider. Every system has an optimal range of settings where compression most efficiently reduces data size without causing excessively long compression times or slow scan rates.

QuickLZ compression generally uses less CPU capacity and compresses data faster at a lower compression ratio than zlib. Conversely, zlib provides higher compression ratios at lower speeds. At compression level 1 (compresslevel=1), QuickLZ and zlib may yield comparable compression ratios (though at different speeds). However,  using zlib compression at a higher level of 6 might dramatically increase its advantage over QuickLZ in compression ratio (though consequently lowering the speed of compression).

Performance with compressed append-only tables depends on hardware, query tuning settings, and other factors. Greenplum recommends performing comparison testing to determine the actual performance in your environment.

Note: QuickLZ compression level can only be set to level 1; no other options are available. Compression level with zlib can be set at any value from 1 - 9.

To create a compressed table
The WITH clause of the CREATE TABLE command is used to declare the storage options of the table. Tables that use compression must also be append-only tables. For example, to create an append-only table with zlib compression at a compression level of 5:
=> CREATE TABLE foo (a int, b text)
WITH (appendonly=true, compresstype=zlib,
compresslevel=5);


Checking the Compression and Distribution of an Append-Only Table

Greenplum provides built-in functions to check the compression ratio and the distribution of an append-only table. Both functions take either the object ID or name of a table. The table name may be qualified with a schema name.
======================================================================================================================
Functions for compressed append-only table metadata
======================================================================================================================
Function                                             Return Type                           Description
=====================                 =====================    ==================================
get_ao_distribution(oid,name)                 Set of (dbid, tuplecount) rows   Shows the distribution of rows of an append-only table across the array. Returns a set of 
                                                                                                         rows, each of which includes a segment dbid and the number of tuples stored on the 
                                                                                                         segment.
get_ao_compression_ratio(oid,name)      float8                                      Calculates the compression ratio for a compressed append-only table. If information is not 
                                                                                                         available, this function returns a value of -1.
======================================================================================================================

The compression ratio is returned as a common ratio. For example, a returned value of 3.19, or 3.19:1, means that the uncompressed table is slightly larger than three times the size of the compressed table.

The distribution of the table is returned as a set of rows indicating how many tuples are stored on each segment. For example, in a system with four primary segments with dbid values ranging from 0 - 3, the function returns four rows similar to the following:

=# SELECT get_ao_distribution('lineitem_comp');
get_ao_distribution
---------------------
(0,7500721)
(1,7501365)
(2,7499978)
(3,7497731)
(4 rows)

Altering the Table Storage Model

It is not possible to alter the storage model of a table that has already been created. Storage, compression, and orientation can only be declared at CREATE TABLE time. If you have an existing table for which you want to change the storage model, you must created table, drop the original table and rename the new table to the original name. You must also re-grant any table permissions. For example:

CREATE TABLE sales2 (LIKE sales)
WITH (appendonly=true, compresstype=quicklz, compresslevel=1, orientation=column);
INSERT INTO sales2 SELECT * FROM sales;
DROP TABLE sales;
ALTER TABLE sales2 RENAME TO sales;
GRANT ALL PRIVILEGES ON sales TO admin;
GRANT SELECT ON sales TO guest;
Comments