Monitoring distribution keys in Greenplum

posted Apr 28, 2017, 12:20 PM by Sachchida Ojha
1. Run the following query to find out distribution keys of the every table in the Greenplum database . (Filter can be applied to schema level)
select table_owner,table_name, COALESCE(distribution_keys,'DISTRIBUTED RANDOMLY'distribution_keys
 from
(
SELECT pgn.nspname as table_owner,
pgc.relname as table_name,
pga.attname as distribution_keys
FROM (SELECT gdp.localoid,
CASE
WHEN ( Array_upper(gdp.attrnums, 1) > 0 ) THEN
Unnest(gdp.attrnums)
ELSE NULL
END AS attnum
FROM gp_distribution_policy gdp
ORDER BY gdp.localoid) AS distrokey
INNER JOIN pg_class AS pgc
ON distrokey.localoid = pgc.oid
INNER JOIN pg_namespace pgn
ON pgc.relnamespace = pgn.oid
LEFT OUTER JOIN pg_attribute pga
ON distrokey.attnum = pga.attnum
AND distrokey.localoid = pga.attrelid
ORDER BY pgn.nspname,
pgc.relname) as a;

2. Analyze distribution keys for each table

3. There might be some table where there is no distribution key. Recreate table with proper distribution key.

4. Run the following query to see distributions of table data at segment level.

SELECT COUNT(*), gp_segment_id FROM <table-name> GROUP BY gp_segment_id;
SELECT 'facts' as "Table Name",max(c) as "Max Seg Rows", min(c) as "Min Seg Rows", (max(c)-min(c))*100.0/max(c) as "Percentage Difference Between Max & Min" from (SELECT count(*) c, gp_segment_id from facts group by 2) as a;

5. Analyze the output. If data is not distributed evenly across the segments then analyze the actual data if different column or combination of columns can be used as new distribution keys.

6. Alter table to update the distributions keys.

Altering Table Distribution
ALTER TABLE provides options to change a table’s distribution policy . When the table distribution options change, the table data is redistributed on disk, which can be resource intensive. You can also redistribute table data using the existing distribution policy.

Changing the Distribution Policy
For partitioned tables, changes to the distribution policy apply recursively to the child partitions. This operation preserves the ownership and all other attributes of the table. For example, the following command redistributes the table sales across all segments using the customer_id column as the distribution key:

ALTER TABLE sales SET DISTRIBUTED BY (customer_id);

When you change the hash distribution of a table, table data is automatically redistributed. 

Changing the distribution policy to a random distribution does not cause the data to be redistributed. For example:
ALTER TABLE sales SET DISTRIBUTED RANDOMLY;

Redistributing Table Data
To redistribute table data for tables with a random distribution policy (or when the hash distribution policy has not changed) use REORGANIZE=TRUE. Reorganizing data may be necessary to correct a data skew problem, or when segment resources are added to the system. For example, the following command redistributes table data across all segments using the current distribution policy, including random distribution.
ALTER TABLE sales SET WITH (REORGANIZE=TRUE);
Altering the Table Storage Model
Table storage, compression, and orientation can be declared only at creation. To change the storage model, you must create a table with the correct storage options,
load the original table data into the new table, drop the original table, and rename the new table with the original table’s name. You must also re-grant any table
permissions. For example:

CREATE TABLE sales2 (LIKE sales) WITH (appendonly=true, compresstype=quicklz, compresslevel=1, orientation=column);
INSERT INTO sales2 SELECT * FROM sales;
DROP TABLE sales;
ALTER TABLE sales2 RENAME TO sales;
GRANT ALL PRIVILEGES ON sales TO admin;
GRANT SELECT ON sales TO guest;

Use ALTER TABLE command to add a compressed column to a table. 

ALTER TABLE sachi ADD COLUMN newcompresscol1 int DEFAULT 0 ENCODING (COMPRESSTYPE=zlib);
Comments