Using Indexes in Greenplum Database

In most traditional databases, indexes can greatly improve data access times. However, in a distributed database such as Greenplum, indexes should be used more sparingly. Greenplum Database is very fast at sequential scanning (indexes use a random seek pattern to locate records on disk). Also, unlike a traditional database, the data is distributed across the segments. This means each segment scans a smaller portion of the overall data in order to get the result. If using table partitioning, the total data to scan may be even a fraction of that. Typically, a business intelligence (BI) query workload returns very large data sets, and thus does not make efficient use of indexes.

Greenplum recommends that you first try your query workload without adding any additional indexes. Indexes are more likely to improve performance for OLTP type workloads, where the query is returning a single record or a small subset of data. Indexes can also improve performance on compressed append-only tables for queries that return a targeted set of rows, as the optimizer has the option to use an index access method rather than a full table scan when appropriate. For compressed data, an index access method means only the necessary rows are uncompressed.

Note that Greenplum Database will automatically create PRIMARY KEY indexes for tables with primary keys. If you want to create an index on a partitioned table, you must index each partitioned child table directly. Indexes on the parent table are not passed down to child table partitions.

Indexes do add some database overhead — they take up storage space and have to be maintained whenever the table is updated. Make sure that the indexes you create are actually being used by your query workload. Also, check to see that the indexes you add are indeed improving query performance (as compared to a sequential scan of the table). You can look at the EXPLAIN plans for a query to determine if indexes are being used.

Some other general considerations when creating indexes are:

1. Your Query Workload. Indexes are more likely to improve performance for OLTP type workloads, where the query is returning a single record or a very small data set. Typically, a business intelligence (BI) query workload returns very large data sets, and thus does not make efficient use of indexes. For this type of workload, it is better to use sequential scans to locate large chunks of data on disk rather than to randomly seek the disk using index scans.

2. Compressed Tables. Indexes can improve performance on compressed append-only tables for queries that return a targeted set of rows, as the optimizer has the option to use an index access method rather than a full table scan when appropriate. For compressed data, an index access method means only the necessary rows are uncompressed.

3. Avoid indexes on frequently updated columns. Creating an index on a column that is frequently updated increases the amount of writes required when the column is updated.

4. Create selective B-tree indexes. Index selectivity is a ratio of the number of distinct values a column has divided by the number of rows in a table. For example, if a table has 1000 rows and a column has 800 distinct values, the selectivity of the index is 0.8, which is considered good. Unique indexes always have a selectivity ratio of 1.0, which is the best possible. Note that unique indexes are only allowed on distribution key columns in Greenplum Database.

5. Use Bitmap indexes for low selectivity columns. Greenplum Database has an additional index type called a Bitmap index, which is not available in regular PostgreSQL.

7. Index columns used in joins. An index on a column used for frequent joins (such as a foreign key column) may improve join performance, as it can enable additional join methods for the query planner to use.

8. Index columns frequently used in predicates. For queries on large tables, examine the WHERE predicates for the columns that are referenced most often. These may be good candidates for indexes.

9. Avoid overlapping indexes. Overlapping indexes (those that have the same leading column) are redundant and unnecessary.

10. Drop indexes for bulk loads. For mass loads of data into a table, consider dropping the indexes and re-creating them after the load is complete. This is often faster than updating the indexes.

11. Consider a clustered index. Clustering an index means that the records are physically ordered on disk according to the index. If the records you need are distributed randomly on disk, then the database has to seek across the disk to get the records requested. If those records are stored more closely together, then the fetching from disk is more sequential. A good example for a clustered index is on a date column where the data is ordered sequentially by date. A query against a specific date range will result in an ordered fetch from the disk, which leverages fast sequential access.

Comments