Greenplum Database Best Practice - Part1

  • Data Distribution: For the best possible performance, all of the segments should contain equal portions of data. If the data is unbalanced or skewed, then the segments with more data will have to work harder to perform their portion of the query processing. To ensure an even distribution of data, you want to choose a distribution key that is unique for each record, such as the primary key or if that is not possible, then choose DISTRIBUTED RANDOMLY.

  • Distribution Key: Make sure tables share a common distribution key as possible. Greenplum database joins or sort on their shared distribution key columns locally at the segment-level. That results the most efficient query processing. During query processing, it is faster if the work associated with join, sort or aggregation operations can be done locally at the segment-level rather than at the system-level. When tables share a common distribution key in Greenplum Database, joining or sorting on their shared distribution key columns will result in the most efficient query processing, as the majority of the work is done locally at the segment-level. Local operations are approximately 5 times faster than distributed operations. With a random distribution policy, local operations are not an option.

  • Primary Key: If you need to create primary key on a table then using a SERIAL or other sequence generator as the table’s primary key is preferred. However, if your application does make use of OIDs to identify specific rows of a table, it is recommended to create a unique constraint on the OID column of that table, to ensure that OIDs in the table will indeed uniquely identify rows even after counter wrap-around.

  • Unique Constraints: For a unique constraint to be enforced in Greenplum Database, the table must be hash-distributed (not DISTRIBUTED RANDOMLY), and the constraint columns must be the same as (or a superset of) the table’s distribution key columns.

  • Index/No Index: The general recommendation with Greenplum is to start with no indexes at all. Let's put it like that: data distribution is somehow a form of indexing itself, as it allows for some data to be in a specific node/segments. Same principle applies for data partitioning. Data distribution and horizontal partitioning allow us to physically store big logical tables into smaller ones, depending on the number of nodes/segments.

  • Partitioning: Partitioning improve query performance on a very large tables. Potential candidates are large fact tables. Table partitioning is a tool to logically divide big tables to improve query performance and facilitate data warehouse maintenance tasks. Partitioning does not change the physical distribution of the table data across the segments.

  • VACUUM: VACUUM reclaims storage occupied by deleted tuples. In normal GPDB operation, tuples that are deleted or obsoleted by an update are not physically removed from their table. They remain present on disk until a VACUUM is done. Therefore, it is necessary to do VACUUM periodically, especially on frequently-updated table.

  • ANALYZE: ANALYZE collects statistics about the contents of tables in the database, and stores the results in the system table pg_statistic. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries.

    It is a good idea to run ANALYZE periodically, or just after making major changes in the contents of a table. Accurate statistics will help the query planner to choose the most appropriate query plan, and thereby improve the speed of query processing. A common strategy is to run VACUUM and ANALYZE once a day during a low-usage time of day.

  • Search Path setting: Search path parameter specifies the order in which schemas are searched when an object is referenced by a simple name with no schema component.  When there are objects of identical names in different schemas, the one found first in the search path is used. The system catalog schema, pg_catalog, is always searched, whether it is mentioned in the path or not. When objects are created without specifying a particular target schema, they will be placed in the first schema listed in the search path.

  • Row-oriented tables: If table data must be updated after it is loaded, choose a row-oriented heap table. If new rows are frequently inserted into the table, consider a row-oriented model. If you typically request all or the majority of columns in the SELECT list or WHERE clause of your queries, consider a row-oriented model. Row-oriented storage is more efficient when many columns are required at the same time, or when the row-size of a table is relatively small.

  • Column-oriented tables: Column-oriented tables are best suited to queries that aggregate many values of a single column where the WHERE or HAVING predicate is also on the aggregate column. Column-oriented tables are not optimized for write operations, as column values for a row must be written to different places on disk. Column-oriented tables can offer better query performance on wide tables (lots of columns) where you typically only access a small subset of columns in your queries

  • Compressing append-only tables: When choosing a compression type and level for append-only tables, consider these factors. A) CPU usage B) Compression ratio/disk size C) Speed of compression D) Speed of decompression/scan rate. Column-oriented table storage is only available on append-only tables. 

  • OID’s: Avoid assuming that OIDs are unique across tables; if you need a database-wide unique identifier, use the combination of table OID and row OID for the purpose. Using OIDs in new applications is not recommended.

  • User defined data types: Columns of geometric or user-defined data types are not eligible as Greenplum distribution key columns. If a table does not have a column of an eligible data type, the rows are distributed based on a round-robin or random distribution.

  • Primary key index: No need to create an index explicitly for the primary key columns. Greenplum Database automatically creates an index for each unique constraint or primary key constraint to enforce uniqueness.)

  • Foreign Keys: Foreign key constraints are not supported in Greenplum Database.

  • Avoid unnecessary partitioning:  Do not create partitions if your table is small in size. Partitioned tables are also distributed across Greenplum Database segments as is any non-partitioned table. Table distribution in Greenplum Database physically divides a table across the Greenplum segments to enable parallel query processing.

  • Avoid CTAS for large table: If you need to create a duplicate copy of large fact table in another user schema, use transactions to split the tasks. Avoid using CTAS. CTAS will consume all your available resources and may freeze the greenplum database.

  • Avoid VACUUM FULL: Unless you need to return space to the OS so that other tables or other parts of the system can use that space, you should use VACUUM instead of VACUUM FULL.VACUUM FULL is only needed when you have a table that is mostly dead rows, that is, the vast majority of its contents have been deleted. Even then, there is no point using VACUUM FULL unless you urgently need that disk space back for other things or you expect that the table will never again grow to its past size. Do not use it for table optimization or periodic maintenance as it is counterproductive.

  • Avoid default first column distribution with many NULL values: Do not let the table distribution default to the first column which might have lots of null or common values (If a DISTRIBUTED clause is not supplied, then either the PRIMARY KEY (if the table has one) or the first column of the table will be used as the table distribution key).