Unloading Data from Greenplum Database

posted Sep 13, 2012, 1:02 PM by Sachchida Ojha
A writable external table allows you to select rows from other database tables and output the rows to files, named pipes, or to other executable programs. Writable external tables can also be used as output targets for Greenplum parallel MapReduce calculations. There are two ways to define writable external tables, file-based and web-based tables.

We will learn here how unload data from Greenplum Database using both parallel unload (writable external tables) and non-parallel unload (COPY). This section contains the following topics:

1. Defining a File-Based Writable External Table
2. Defining a Command-Based Writable External Web Table
3. Unloading Data Using a Writable External Table
4. Unloading Data Using COPY

1. Defining a File-Based Writable External Table

Writable external tables that output data to files use the Greenplum parallel file server program (gpfdist) or the Hadoop file system interface (gphdfs), just as file-based readable external tables do.

Use the CREATE WRITABLE EXTERNAL TABLE command to define the external table and specify the location and format of the output files.

With a writable external table that uses the gpfdist protocol, the Greenplum segments send their output rows to the gpfdist process, which then writes out the data to the named file. You must have the Greenplum file distribution program (gpfdist) running on a host that is accessible over the network by the Greenplum segments. This program points to a given file location on the output host and writes output data received from the Greenplum segments to this file. You can list multiple gpfdist URIs in your writable external table definition if you want your output data to be divided across multiple files.

A writable external web table outputs data to an executable program. For example, you could unload data from Greenplum Database and send it to an executable that connects to another database or ETL tool and loads the data elsewhere. Web tables use the EXECUTE clause to specify a shell command, script, or executable program to run on the segment hosts. For writable external web tables, this command must be prepared to accept an input stream of data.

Unlike readable external tables, writable external tables also have the option to declare a distribution policy. By default, writable external tables use a random distribution policy. If the source table you are exporting data from has a hash distribution policy, defining the same distribution key column(s) for the writable external table will improve unload performance by eliminating the need to move rows over the interconnect. If you are unloading data from a particular table, you can use the LIKE clause to copy the column definitions and distribution policy from the source table.

Example 1—Greenplum file server (gpfdist)
=# CREATE WRITABLE EXTERNAL TABLE unload_expenses
( LIKE expenses )
LOCATION ('gpfdist://etlhost-1:8081/expenses1.out',
'gpfdist://etlhost-2:8081/expenses2.out')
FORMAT 'TEXT' (DELIMITER ',')
DISTRIBUTED BY (exp_id);

Example 2—Hadoop file server (gphdfs)
=# CREATE WRITABLE EXTERNAL TABLE unload_expenses
( LIKE expenses )
LOCATION ('gphdfs://hdfslhost-1:8081/path')
FORMAT 'TEXT' (DELIMITER ',')
DISTRIBUTED BY (exp_id);

There are two additional restrictions for writing file-based external tables with the gphdfs protocol, as follows:

1. You can specify only a directory for a writable external table with the gphdfs protocol. (You can specify only one file for a readable external table with the gphdfs protocol.)

2. TEXT is the only format allowed.

Note: The default port number is 9000.

2. Defining a Command-Based Writable External Web Table

Similar to readable web external tables that execute a command or program, writable external web tables can also be defined to send output rows to an executable program or script. The executable program used in the writable external table definition must be prepared to accept an input stream, reside in the same location on all of the Greenplum segment hosts, and be executable by the gpadmin user. The command specified in your writable external table definition is executed by all segments in the Greenplum system, whether or not a segment has output rows to process or not.

Use the CREATE WRITABLE EXTERNAL WEB TABLE command to define the external table and specify the executable command or program to run on the segment hosts. If you use environment variables in the executable command (such as $PATH), keep in mind that the command is executed from within the database and not from a login shell. Therefore the .bashrc or .profile of the current user will not be sourced. However, you can set desired environment variables from within the EXECUTE clause of your writable external table definition, for example:

=# CREATE WRITABLE EXTERNAL WEB TABLE output (output text)
EXECUTE 'export PATH=$PATH:/home/gpadmin/programs;
myprogram.sh'
FORMAT 'TEXT'
DISTRIBUTED RANDOMLY;

Disabling EXECUTE for Web or Writable External Tables

External tables that execute OS commands or scripts have a certain security risk associated with them. Some database administrators may decide that they do not want their Greenplum Database systems exposed to this functionality. If this is the case, you can disable the use of EXECUTE in web and writable external table definitions by setting the following server configuration parameters in your master postgresql.conf file:

gp_external_enable_exec = off

Using Environment Variables in EXECUTE commands

If you use environment variables in external table EXECUTE commands (such as $PATH), keep in mind that the command is executed from within the database and not from a login shell. Therefore the .bashrc or .profile of the current user is not sourced. However, you can set desired environment variables from within the EXECUTE clause of your external table definition, for example:

CREATE EXTERNAL WEB TABLE test (column1 text) EXECUTE 'MAKETEXT=text-text-text; export MAKETEXT; echo $MAKETEXT' FORMAT 'TEXT';
SELECT * FROM test;
column1
----------------
text-text-text
text-text-text
(2 rows)

The following additional Greenplum Database variables are also available for use in OS commands executed by a web or writable external table. These variables are set as environment variables in the shell that executes the command(s). They can be used to identify a set of requests made by an external table statement across the Greenplum Database array of hosts and segment instances

Variable                                   Description
$GP_CID                                 Command count of the session executing the external table statement.
$GP_DATABASE                     The database that the external table definition resides in.
$GP_DATE                             The date the external table command was executed.
$GP_MASTER_HOST             The host name of the Greenplum master host from which the external table statement was dispatched.
$GP_MASTER_PORT             The port number of the Greenplum master instance from which the external table statement was dispatched.
$GP_SEG_DATADIR                 The location of the data directory of the segment instance executing the external table command.
$GP_SEG_PG_CONF             The location of the postgresql.conf file of the segment instance executing the external table command.
$GP_SEG_PORT                     The port number of the segment instance executing the external table command.
$GP_SEGMENT_COUNT         The total number of primary segment instances in the Greenplum Database system.
$GP_SEGMENT_ID                 The ID number of the segment instance executing the external table command (same as dbid in gp_segment_configuration).
$GP_SESSION_ID                    The database session identifier number associated with the external table statement.
$GP_SN                                 Serial number of the external table scan node in the query plan of the external table statement.
$GP_TIME                                The time the external table command was executed.
$GP_USER                             The database user executing the external table statement.
$GP_XID                                 The transaction ID of the external table statement.

3. Unloading Data Using a Writable External Table

Because writable external tables only allow INSERT operations, users who are not the table owner or a superuser must be granted INSERT permissions on the table in order to use it. For example:

GRANT INSERT ON writable_ext_table TO admin;

To unload data using a writable external table, select the data from the source table(s) and insert it into the writable external table. The rows produced by the SELECT statement will be output to the writable external table. For example:

INSERT INTO writable_ext_table SELECT * FROM regular_table;

4. Unloading Data Using COPY

COPY TO copies data from a table to a file (or standard input) on the Greenplum master host. COPY is non-parallel, meaning that data is unloaded in a single process via the Greenplum master instance. You can use COPY to output a table’s entire contents, or filter the output using a SELECT statement. For example:

COPY (SELECT * FROM country WHERE country_name LIKE 'A%') TO '/home/gpadmin/a_list_countries.out';



Readable External Tables and Query Planner Statistics

Because the data sources for readable external tables are outside of the database, statistics needed by the query planner are not collected when you run the ANALYZE command. You can set some rough statistics for a readable external table by manually editing the system catalog table pg_class and specifying the number of rows and database pages (calculated as data_size / 32K). By default, pg_class.reltuples (number of rows) is set to 1000000 and pg_class.relpages (number of pages) is set to 1000 for all external tables when they are first defined. To change these defaults, you can update these values for your external table in pg_class (as the database superuser). For example:

UPDATE pg_class SET reltuples=500000, relpages=150 WHERE relname='my_ext_table';

Comments