Loading Data Using an External Table and Handling Load Errors

posted Sep 13, 2012, 9:15 AM by Sachchida Ojha
Loading Data Using an External Table

Once you have defined the external table, and have placed the necessary data files in the correct location (and have started the Greenplum files server(s) if using the gpfdist protocol), you can then select from the external table as you would an ordinary database table. For example, if you wanted to load a portion of the external table data into a database table, you could do something like:

=# INSERT INTO expenses_travel
SELECT * from ext_expenses where category='travel';

Or if you wanted to quickly load all data into a new database table:

=# CREATE TABLE expenses AS SELECT * from ext_expenses;

Handling Load Errors

The most common use of readable external tables is selecting data from them to load into regular database tables. This is typically done by issuing a CREATE TABLE AS SELECT or INSERT INTO SELECT command, where the SELECT statement queries external table data. By default, if the external table data contains an error, the entire command fails and no data is loaded into the target database table. To isolate data errors in external table data while still loading correctly formatted rows, you can define a readable external table with a SEGMENT REJECT LIMIT clause in the CREATE EXTERNAL TABLE command.

1. The reject limit count can be specified as number of ROWS (the default) or a PERCENT of total rows (1-100).

2. To keep error rows for further examination, declare an error table using the LOG ERRORS INTO clause.

Defining an External Table with Single Row Error Isolation

You can specify the number of error rows acceptable (on a per-segment basis), after which the entire external table operation will be aborted and no rows will be processed or loaded. Note that the count of error rows is per-segment, not per entire operation. If the per-segment reject limit is not reached, then all rows not containing an error will be processed. If the limit is not reached, all good rows will be processed and any error rows discarded. The following example uses a LOG ERRORS INTO clause, so any rows containing a format error are logged to the error table file err_expenses.

For example:

=# CREATE EXTERNAL TABLE ext_expenses ( name text,
date date, amount float4, category text, desc text )
LOCATION ('gpfdist://etlhost-1:8081/*',
'gpfdist://etlhost-2:8082/*')
FORMAT 'TEXT' (DELIMITER '|')
LOG ERRORS INTO err_expenses SEGMENT REJECT LIMIT 10 ROWS;

When SEGMENT REJECT LIMIT is used, then the external data will be scanned in single row error isolation mode. This can be helpful in isolating errors when loading data from and external table using CREATE TABLE AS SELECT or INSERT INTO SELECT. Single row error isolation mode only applies to external data rows with format errors – for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences. Constraint errors are not checked, however it is easy to filter out constraint errors when using external tables by limiting the SELECT from an external table at runtime. For example, to eliminate duplicate key errors:

=# INSERT INTO table_with_pkeys
SELECT DISTINCT * FROM external_table;

Viewing Bad Rows in the Error Table

If you are using the single row error isolation feature any rows with formatting errors are logged into an error table.

You can use SQL commands to query the error table and view the rows that were not loaded. For example:

=# SELECT * from err_expenses;

Identifying Invalid CSV Files in Error Table Data

If a comma separated values (CSV) file has invalid formatting, the rawdata field in the error table may contain several combined rows. For example, if a closing quote for a specific field is missing, all the following newlines are treated as embedded newlines. When this happens, Greenplum stops parsing a row when it reaches 64K, puts that 64K of data into the error table as a single row, resets the quote flag, and continues. If this happens three times during load processing, the load file is determined to be invalid and the entire load fails with the message “rejected N or more rows”.

Moving Data between Tables

You can use CREATE TABLE AS or INSERT...SELECT to load external and web external table data into another (non-external) database table, and the data will be loaded in parallel according to the external or web external table definition.

If an external table file or web external table data source has an error, any operation that reads from that table fails. Similar to COPY, loading from external and web external tables is an all or nothing operation.


Comments