How to debug bad performing queries due to uneven data distribution

posted Apr 28, 2017, 3:41 PM by Sachchida Ojha
If the data is not distributed evenly, then query processing performance may suffer. 

All tables in Greenplum Database are distributed, meaning their data is divided across all of the segments in the system.  The following views can help diagnose if a table has uneven data distribution:

1. gp_skew_coefficients
2. gp_skew_idle_fractions


1. gp_skew_coefficients

gpadmin=# \d gp_toolkit.gp_skew_coefficients
View "gp_toolkit.gp_skew_coefficients"
    Column    |  Type   | Modifiers 
--------------+---------+-----------
 skcoid       | oid     | 
 skcnamespace | name    | 
 skcrelname   | name    | 
 skccoeff     | numeric | 
View definition:
 SELECT skew.skewoid AS skcoid, pgn.nspname AS skcnamespace, pgc.relname AS skcrelname, skew.skewval AS skccoeff
   FROM gp_toolkit.__gp_skew_coefficients() skew(skewoid, skewval)
   JOIN pg_class pgc ON skew.skewoid = pgc.oid
   JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid;

This view shows data distribution skew by calculating the coefficient of variation (CV) for the data stored on each segment. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.

skcoid=>The object id of the table.
skcnamespace=>The namespace where the table is defined.
skcrelname=>The table name.
skccoeff=>The coefficient of variation (CV) is calculated as the standard deviation divided by the average. It takes into account both the average and variability around the average of a data series. The lower the value, the better. Higher values indicate greater data skew.

2. gp_skew_idle_fractions

gpadmin=# \d gp_toolkit.gp_skew_idle_fractions
View "gp_toolkit.gp_skew_idle_fractions"
Column | Type | Modifiers 
--------------+---------+-----------
sifoid | oid | 
sifnamespace | name | 
sifrelname | name | 
siffraction | numeric | 
View definition:
SELECT skew.skewoid AS sifoid, pgn.nspname AS sifnamespace, pgc.relname AS sifrelname, skew.skewval AS siffraction
FROM gp_toolkit.__gp_skew_idle_fractions() skew(skewoid, skewval)
JOIN pg_class pgc ON skew.skewoid = pgc.oid
JOIN pg_namespace pgn ON pgc.relnamespace = pgn.oid;

This view shows data distribution skew by calculating the percentage of the system that is idle during a table scan, which is an indicator of processing data skew. This view is accessible to all users, however non-superusers will only be able to see tables that they have permission to access.

sifoid
=>The object id of the table.
sifnamespace=>The namespace where the table is defined.
sifrelname=>The table name.
siffraction=>The percentage of the system that is idle during a table scan, which is an indicator of uneven data distribution or query processing skew. For example, a value of 0.1 indicates 10% skew, a value of 0.5 indicates 50% skew, and so on. Tables that have more than 10% skew should have their distribution policies evaluated.
Comments