gpfdist

posted Sep 13, 2012, 7:06 AM by Sachchida Ojha
gpfdist is Greenplum’s parallel file distribution program. It is used by readable external tables and gpload to serve external table files to all Greenplum Database segments in parallel. It is used by writable external tables to accept output streams from Greenplum Database segments in parallel and write them out to a file.

For the best performance and ease of administrative setup, the gpfdist protocol is preferred. One advantage of using the gpfdist file server program (as opposed to the file protocol) is that it ensures that all of the segments in your Greenplum Database system are fully utilized when reading from external table data files.

Installing gpfdist

The gpfdist program is installed in $GPHOME/bin of your Greenplum Database master host installation. Most likely you will want to run gpfdist from a machine other than your Greenplum Database master, such as on a machine devoted to ETL processing. To install gpfdist on another machine:
1.Copy gpfdist from $GPHOME/bin in your Greenplum installation to the remote machine.
2.Add gpfdist to your $PATH.

Starting and Stopping gpfdist

To start gpfdist, you must tell it from which directory it will be serving files and optionally the port to run on (defaults to HTTP port 8080).

To start gpfdist in the background (and log output messages and errors to a log file):
$ gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &

If starting multiple gpfdist instances on the same ETL host (see Figure 12.3), then each should use a different port and base directory. For example:

$ gpfdist -d /var/load_files1 -p 8081 -l /home/gpadmin/log1 &
$ gpfdist -d /var/load_files2 -p 8082 -l /home/gpadmin/log2 &

To stop gpfdist when its running in the background:

--First find its process id:
$ ps ax | grep gpfdist (Linux)
$ ps –ef | grep gpfdist (Solaris)

--Then kill the process, for example:
$ kill 3456

The gpfdist program can serve data to the segment instances at an average rate of about 350 MB/s for delimited text formatted files and 200 MB/s for CSV formatted files. Therefore, you should consider the following options when running gpfdist in order to maximize the network bandwidth of your ETL systems:

1. If your ETL machine is configured with multiple network interface cards (NICs), run one instance of gpfdist on your ETL host and then define your external table definition so that the host name of each NIC is declared in the LOCATION clause. This allows network traffic between your Greenplum segment hosts and your ETL host to use all NICs simultaneously.

2. Run multiple gpfdist instances on your ETL host and divide your external data files equally between each instance. For example, if you have an ETL system with two network interface cards (NICs), then you could run two gpfdist instances on that machine to maximize your load performance. You would then divide the external table data files evenly between the two gpfdist programs.

Controlling Segment Parallelism
You can use the following server configuration parameter to control how many segment instances access a single gpfdist program at a time. 64 is the default. This allows you to control the number of segments processing external table files, while reserving some segments for other database processing. This parameter can be set in the postgresql.conf file of your master instance:
gp_external_max_segs


Troubleshooting gpfdist

Keep in mind that gpfdist is accessed at runtime by the segment instances. Therefore, you must ensure that the Greenplum segment hosts have network access to gpfdist. The gpfdist program is a simple web server, so to test connectivity you can run the following command from each host in your Greenplum array (segments and master):

$ wget http://gpfdist_hostname:port/filename

Also, make sure that your CREATE EXTERNAL TABLE definition has the correct host name, port, and file names for gpfdist. The file names and paths specified should be relative to the directory from which gpfdist is serving files (the directory path used when you started the gpfdist program).

Example

First, start gpfdist as a background process.

gpfdist -d /home/gpadmin -p 8080 &

Create an emp table in Greenplum.

CREATE TABLE emp
(emp_id int not null,
 emp_name text,
 job_description text)
DISTRIBUTED BY (emp_id);

Now create my External Table.

CREATE EXTERNAL TABLE ext_emp
(LIKE emp) LOCATION ('gpfdist://localhost:8080/demo/emp.txt')
FORMAT 'TEXT' (DELIMITER AS '|' NULL AS 'null');
==================================================================================================================
Assumptions : 1. The location indicates it is using gpfdist and the host is local with port 8080 2.My file is located /home/gpadmin/demo/emp.txt but gpfdist is serving /home/gpadmin so I need to only specify the demo subdirectory and then the file name 3.You can pick TEXT of CSV format but TEXT will execute faster 4.I used pipes as my delimiter and spelled out null for my null values
==================================================================================================================
Insert the data.
INSERT INTO emp SELECT * FROM ext_emp;

Yahooooo! you are done.

Comments