How gpfdist is used in the DIA servers for data loading?

posted Sep 14, 2012, 6:06 AM by Sachchida Ojha
gpfdist

gpfdist is Greenplum’s parallel file distribution server utility software. It is used with read-only external tables for fast, parallel data loading of text files into a Greenplum data warehouse.

gpfdist can be considered as a networking protocol, much like the http protocol. Running gpfdist is similar to running a HTTP server. It exposes the target file via TCP/IP to a local file directory containing the files. The target files are usually delimited files or CSV files, although it can also read tar and gziped files. In the case of tar and gzip files, the PATH must contain the location of the tar and gzip utilities.

For data uploading into a Greenplum database or data warehouse, you generate the flat files from an operational database or transactional database, using export, COPY, dump, or user-written software, depending on your business requirements. This process can be automated to run periodically.

How to install gpfdist
gpfdist is installed on each DIA server by default. If it is not already installed, there are two ways you can install it:

1. copy gpfdist from another source to the DIA server, and put the location of the file into the PATH environment variable.
2. download and install the Greenplum loaders package. Select the loaders package that is appropriate to your DCA database version.

For example, if your DCA Greenplum database is of version 4.0.5.0, you will want to download and install the Greenplum loaders package: Greenplum-loaders-4.0.50.0-build-8-RHEL5-x86_64.bin.

For EMC customers, the Greenplum loaders packages are available for download at the EMC Powerlink web site. (go to https://emc.subscribnet.com and click the Greenplum link.) 

Running gpfdist on a DIA server
gpfdist runs in a client-server model. The DIA server acts as the server, while the master server of the DCA is the client. You start the gpfdist process on a DIA server, by indicating the directory where you drop your source files. Optionally, you may also designate the TCP port number to be used.

A simple startup of the gpfdist server is the following command syntax:

gpfdist –d <file_files_directory> –p <port_number> –l <log_file> &

For example,

# gpfdist -d /etl-data -p 8887 -l gpfdist_8887.log &
[1] 28519
# Serving HTTP on port 8887, directory /home/gpadmin/etl-log

In the above example, gpfdist server is set up to run, anticipating data loading from flat files stored in a file directory /etl-data. Port 8887 is opened and listening for data requests, and a log file is created in /home/gpadmin called etl-log.

For each DIA server, you can run multiple instances of gpfdist, under different TCP ports and directories.

Running gpfdist clients

To initiate the data extraction process, we use the DCA Master Server as the client. We connect to gpfdist through the external tables. These tables can be created
using the psql command.

For example:

# psql -d gpdb
gpdb=#
gpdb=# create external table ext_load_performance (like
performance_table)
gpdb=# location (‘gpfdist://etl3:8887/performance_test.dat’)
gpdb=# format ‘text’ (delimiter ‘|’)
gpdb=# segment reject limit 20 rows;

In the above example, we create an external table. This table has all the attributes of a table called ‘performance_table’ (using the like performance_table clause), and makes uses of flat files stored in the host ‘etl3’, using port 8887. The flat file name is expected to be ‘performance_test.dat’.

Combining this statement with the statement started in the gpfdist server, we expect the file to be in the directory ‘/etl-data’. Both client and server will be communicating using TCP port 8887. Of course, the process itself will not initiate any process, or do any data loading. It has simply defined a connection between the client and the server.

To start the data loading process, you issue a command that requests data from the external table – ext_load_performance. This will trigger the input data to be read into the external table. For example:

gpdb=# insert into performance_table select * from
ext_load_performance;
INSERT 0 1093680

When this command is run, each segment server will connect to gpfdist simultaneously. The gpfdist utility will divide the flat file in chunks, and distribute the work among the segment servers. Taking advantage of the ‘share nothing’ architecture, Greenplum database is able to make use of the parallelism given to the data loading operations, achieving 2TB/hour data loading performance.

Comments