Unloading data over gpfdists in Greenplum

Test unloading data over gpfdist

-- Table Structure

mydb=> \d tab1
                            Table "schema.tab1"
             Column             |            Type             | Modifiers
--------------------------------+-----------------------------+-----------
 column_01                      | bytea                       | not null
 column_02                      | bytea                       | not null
 column_03                      | smallint                    | not null
 column_04                      | date                        | not null
 column_05                      | bytea                       | not null
 column_06                      | date                        | not null
 column_07                      | date                        | not null
 column_08                      | bytea                       |
 column_09                      | smallint                    | not null
 column_10                      | smallint                    | not null
 column_11                      | numeric(15,2)               | not null
 column_12                      | date                        | not null
 column_13                      | smallint                    | not null
 column_14                      | smallint                    | not null
 column_15                      | smallint                    | not null
 column_16                      | character varying(14)       | not null
 column_17                      | character varying(8)        | not null
 column_18                      | timestamp without time zone | not null
 column_19                      | smallint                    | not null
 column_20                      | bytea                       |
Distributed by: (column_02)

-- Record Count

mydb=> select count(*) from tab1;
  count  
---------
 1000000
(1 row)

Time: 182.207 ms

-- Table Size

mydb=> select pg_size_pretty(pg_total_relation_size('tab1'));
 pg_size_pretty 
----------------
 218 MB
(1 row)

Time: 52.392 ms

--- Check Skew

mydb=> select max(c), min(c), avg(c), count(*) from (select gp_segment_id, count(*) c from tab1 group by gp_segment_id) a;
  max  |  min  |       avg        | count 
-------+-------+------------------+-------
 21086 | 20559 | 20833.3333333333 |    48
(1 row)

Time: 400.475 ms

-- No Skew in Data

--- Start two gpfdist Instances (ssl and no ssl) on server etlserver (connected to the 10GB VLAN on the DCA)

gpfdist -p 8080 -d $PWD -l $PWD/gpfdist_ssl.log -m 4000000 --ssl /certs_dir &
gpfdist -p 8081 -d $PWD -l $PWD/gpfdist_nossl.log -m 4000000 &


-- Dump Data over gpfdist SSL

mydb=> create writable external table ext_w_tab1_ssl (like tab1) location('gpfdists://etlserver:8080/tab1_data_ssl.txt') format 'text';
NOTICE:  Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE EXTERNAL TABLE

mydb=> insert into ext_w_tab1_ssl select * from tab1;
INSERT 0 1000000
Time: 811513.791 ms

-- Data Transfer Rate

mydb=> select 218.0 / (811513.791/1000) "Data Transfer MB/Sec";
  Data Transfer MB/Sec  
------------------------
 0.26863375880694059579
(1 row)

-- Dump data over gpfdist (no SSL)

mydb=> create writable external table ext_w_tab1_nossl (like tab1) location('gpfdist://etlserver:8081/tab1_data_nossl.txt') format 'text';
NOTICE:  Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE EXTERNAL TABLE

mydb=> insert into ext_w_tab1_nossl select * from tab1;
INSERT 0 1000000
Time: 8147.634 ms

-- Data Transfer Rate
mydb=> select 218.0 / (8147.634/1000) "Data Transfer MB/Sec";
 Data Transfer MB/Sec 
----------------------
  26.7562337728965243
(1 row)


====================================================

Test loading the same data over gpfdist

====================================================

-- Create table

mydb=> create table tab2 (like tab1);
NOTICE:  Table doesn't have 'distributed by' clause, defaulting to distribution columns from LIKE table
CREATE TABLE

-- Load over gpfdist SSL

mydb=> create external table ext_r_tab1_ssl (like tab2) location('gpfdists://etlserver:8080/tab1_data_ssl.txt') format 'text';
CREATE EXTERNAL TABLE

mydb=> insert into tab2 select * from ext_r_tab1_ssl;
INSERT 0 1000000
Time: 12490.419 ms

-- Data Transfer Rate

mydb=> select 218.0 / (12490.419/1000) "Data Transfer MB/Sec";
 Data Transfer MB/Sec 
----------------------
  17.4533776649126022
(1 row)

-- Load over gpfdist No SSL

mydb=> create external table ext_r_tab1_nossl (like tab2) location('gpfdist://etlserver:8081/tab1_data_ssl.txt') format 'text';
CREATE EXTERNAL TABLE

mydb=> insert into tab2 select * from ext_r_tab1_nossl;
INSERT 0 1000000
Time: 6578.678 ms

-- Data Transfer Rate

mydb=> select 218.0 / (6578.678/1000) "Data Transfer MB/Sec";
 Data Transfer MB/Sec 
----------------------
  33.1373567759358339
(1 row)

My observation shows that data download over gpfdists is very slow . This is what I have found.

 

1. Data download using writable external table using gpfdists protocol (gpfdist with ssl) is about a 100 times slower than downloading over gpfdist (no ssl).

2. Data download speed does not increase while using gpfdists by using multiple gpfdist instances / multiple download files. The speed increases almost linearly when using plain gpfdist.

3. Data upload over gpfdists is slower than gpfdist, but by a much smaller margin (about 2 times).

4. Some observations:

 

Data download speed using gpfdists: 0.27 MB/Sec

Data download speed using gpfdist: 27 MB/Sec

Data upload speed using gpfdists: 17.45 MB/Sec

Data upload speed using gpfdist: 33.14 MB/Sec

Comments