Loading Data into Greenplum Database

posted Sep 12, 2012, 6:35 PM by Sachchida Ojha
how to load data into Greenplum Database using both parallel load operations (file-based external tables, web external tables, and gpload) and non-parallel load operations (COPY).

1. file-based external tables

When you create an external table definition, you must specify the format of your input files and the location of your external data sources.

There are three protocols that you can use to access external table data sources, but you cannot mix the protocols in a CREATE EXTERNAL TABLE statement. The protocols are as follows:

1. gpfdist
If using the gpfdist:// protocol, you must have the Greenplum file distribution program (gpfdist) running on the host where the external data files reside. This program points to a given directory on the file host and serves external data files to all Greenplum Database segments in parallel. If files are compressed using gzip or bzip2 (have a .gz or .bz2 file extension), gpfdist will uncompress the files automatically (provided that gunzip or bunzip2 is in your $PATH).

All primary segments access the external file(s) in parallel regardless of how many URIs you specify when defining the external table. You can use multiple gpfdist data sources in a CREATE EXTERNAL TABLE statement to scale the scan performance of the external table.

When specifying which files to get using gpfdist, you can use the wildcard character (*) or other C-style pattern matching to denote multiple files. The files specified are assumed to be relative to the directory from which gpfdist is serving files (the directory specified when you started the gpfdist program).

gpfdist is located in $GPHOME/bin on your Greenplum Database master host. See the gpfdist reference documentation for more information on using this file distribution program with external tables.

2. file
If using the file:// protocol the external data file(s) must reside on a segment host in a location accessible by the Greenplum superuser (gpadmin). The number of URIs specified corresponds to the number of segment instances that will work in parallel to access the external table. So for example, if you have a Greenplum Database system with 8 primary segments and you specify 2 external files, only 2 of the 8 segments will access the external table in parallel at query time. The number of external files per segment host cannot exceed the number of primary segment instances on that host. For example, if your array has 4 primary segment instances per segment host, you may place 4 external files on each segment host. Also, the host name used in the URI must match the segment host name as registered in the gp_configuration system catalog table.

3. gphdfs
This protocol specifies a file on a Hadoop Distributed File Sytem. There are several restrictions for HDFS files, as follows:

a) Wildcard characters are not permitted with gphdfs. Be sure to specify the file name.

b) Only one file can be specified for the external table when using the gphdfs protocol.

c) TEXT is the only format allowed for HDFS files.

When Greenplum links with HDFS files, all the data is read in parallel from the HDFS Data Nodes into the Greenplum Segments for rapid processing. Greenplum determines the connections between the segments and nodes

The HDFS file is split into equally-sized chunks. Each Greemplum segment reads one chunk. For writing, each Greenplum segment writes only the data contained on it.

The FORMAT clause is used to describe how the external table files are formatted. Valid file formats are delimited text (TEXT) for all protocols and comma separated values (CSV) format for gpdist and html protocols, similar to the formatting options available with the PostgreSQL COPY command. If the data in the file does not use the default column delimiter, escape character, null string and so on, you must specify the additional formatting options so that the data in the external file is read correctly by Greenplum Database.The gpfdist and gphdfs protocols require one-time setup.

Errors in External Table Data

By default, if external table data contains an error, the entire command fails and no data is loaded into the target database table. To isolate data errors in external table data while still loading correctly formatted rows, define the external table with single row error handling.

There is a system view named pg_max_external_files that you can use to determine how many external table files are permitted per external table. This view lists the available file slots per segment host (if using the file:// protocol). For example:

SELECT * FROM pg_max_external_files;

If using the gpfdist:// protocol, the following server configuration parameter can be used to set the maximum number of segment instances that will go to one gpfdist file distribution program to get external table data in parallel. The default is 64 segment instances.

gp_external_max_segs = <integer>

The gpfdist program serves files via the HTTP protocol. Queries of external tables that use a LIMIT clause will break off the HTTP connection after retrieving the rows causing an HTTP socket error. If using LIMIT in queries of external tables that use the gpfdist:// or http:// protocols, it is safe to ignore these HTTP socket errors from the gpfdist or web server – data is still returned to the database client as expected.

During dump/restore operations, only external and web external table definitions will be backed up and restored. The data sources are not included.