Data distributions, Data Skew and local joins in Greenplum

Every table in Greenplum database has a distribution method. The DISTRIBUTED clause specifies the distribution method for a table. There are two distribution methods either, 

Hash distribution=> which is designated by specifying a column name in the DISTRIBUTED BY clause or 
Random distribution => which is designated, by specifying the DISTRIBUTED RANDOMLY clause. 

The DISTRIBUTED BY (column name) clause will distribute the data across all segment instances using a hash algorithm. Use a distribution column with unique values and high cardinality to distribute the data evenly across all segments. Boolean keys, for example True/False or columns with low cardinality for example MALE/FEMALE, are not candidates for a distribution column as they will be distributed to two segments. 

The DISTRIBUTED RANDOMLY clause will distribute the data across all segment instances using a random algorithm, and is not guaranteed to provide a perfectly even distribution. 

In an MPP shared nothing environment overall response time for a query is measured by the completion time for all segments. If the data is skewed then segments with more data will have a longer completion time. The optimal goal is for each segment to have a comparable number of rows and to perform approximately the same amount of processing. 

For any table that uses random distribution either a redistribution motion operation or broadcast motion operation will be required to perform a table join. There are performance implications when performing a redistribution or broadcast of very large tables. 

To perform a join matching rows must be located together on the same segment. 

In the case where data was not distributed on the same join column, a dynamic redistribution of the needed rows from one of the tables to another segment will be performed. There is a performance cost to redistributing data that must be performed every time the join is required for a query. Though the performance impact is minimal for small tables, avoid redistribution motion operations for very large tables when possible.

In some cases a broadcast motion will be performed rather than a redistribute motion. In a broadcast motion every segment performs a broadcast (or sends) its own individual rows to all other segments. This will result in every segment having its own complete and local copy of the entire table. A broadcast motion may not be as optimal as a redistribute motion therefore the Greenplum database optimizer typically only selects a broadcast motion for small tables. A broadcast motion is not 
acceptable for very large tables. Random distribution should be used for small tables and when a hash distribution method is not feasible on large tables due to significant data skew.

To achieve substantial performance gains when joining large tables use a hash distribution that evenly distributes table rows across all segments and results in local joins (co-located joins). A local join is performed within the segment, operating independently of other segments, without network traffic or communication between segments, eliminating or minimizing broadcast motion operations and redistribution motion operations. 

To achieve local joins distribute on the same column for large tables commonly joined together (WHERE clause). The distribution columns must be the same data type to obtain a local join. While the values might appear to be the same representatively, different data types are stored differently at the disk level and hash to different values resulting with like rows being stored on different segments. 

After loading data always check for data skew using either of the SQL statements. The first query will return one row for each segment. The second query will return a single row for the table.

SELECT COUNT(*), gp_segment_id FROM <table-name> GROUP BY gp_segment_id;

SELECT 'facts' as "Table Name",max(c) as "Max Seg Rows", min(c) as "Min Seg Rows", (max(c)-min(c))*100.0/max(c) as "Percentage Difference Between Max & Min" from (SELECT count(*) c, gp_segment_id from facts group by 2) as a;