Loading Data with COPY

posted Sep 13, 2012, 10:42 AM by Sachchida Ojha
COPY FROM copies data from a file (or standard input) into a table (appending the data to whatever is in the table already). If copying data from a file, the file must be accessible to the master host and the name must be specified from the viewpoint of the master host. When STDIN or STDOUT is specified, data is transmitted via the connection between the client and the master server. COPY is non-parallel, meaning that data is loaded in a single process via the Greenplum master instance

To maximize the performance and throughput of COPY, consider running multiple COPY commands concurrently in separate sessions and dividing the data to be loaded evenly across all concurrent processes. To maximize throughput, run one concurrent COPY operation per CPU.

Running COPY in Single Row Error Isolation Mode

By default, COPY stops an operation at the first error, meaning if the data being loaded contains an error, the entire operation fails and no data is loaded. Optionally, a COPY FROM command can be run in single row error isolation mode. In this mode, rows containing format errors – for example, extra or missing attributes, attributes of a wrong data type, or invalid client encoding sequences – can be skipped over while still loading all properly formatted rows. In this release, single row error isolation mode only applies to rows in the input file with format errors. Constraint errors such as violation of a NOT NULL, CHECK, or UNIQUE constraint will still be handled in ‘all-or-nothing’ input mode.

The SEGMENT REJECT LIMIT clause when added to a COPY FROM operation will run the command in single row error isolation mode. The user can specify the number of error rows acceptable (on a per-segment basis), after which the entire COPY FROM operation will be aborted and no rows will be loaded. Note that the count of error rows is per Greenplum segment, not per entire load operation. If the per-segment reject limit is not reached, then all rows not containing an error will be loaded. If the limit is not reached, all good rows will be loaded and any error rows discarded. If you would like to keep error rows for further examination, you can optionally declare an error table using the LOG ERRORS INTO clause. Any rows containing a format error would then be logged to the specified error table. For example:

=> COPY country FROM '/data/gpdb/country_data'

Data Loading Performance Tips

Drop Indexes Before Loading – If you are loading a freshly created table, the fastest way is to create the table, load the data, then create any indexes needed for the table. Creating an index on pre-existing data is quicker than updating it incrementally as each row is loaded. If you are adding large amounts of data to an existing table, it may be a faster to drop the index, load the table, and then recreate the index. Temporarily increasing the maintenance_work_mem server configuration parameter will help to speed up CREATE INDEX commands, although it will not help performance of the load itself. Dropping and recreating indexes should be done when there are no users on the system.

Run ANALYZE After Loads – Whenever you have significantly altered the data in a table, running ANALYZE is strongly recommended. Running ANALYZE (or VACUUM ANALYZE) ensures that the query planner has up-to-date statistics about the table. With no statistics or obsolete statistics, the planner may make poor decisions during query planning, leading to poor performance on any tables with inaccurate or nonexistent statistics.

Run VACUUM After Load Errors – If not running in single row error isolation mode, a load operation stops at the first encountered error. The target table will already have received earlier rows before the error occurred. These rows will not be visible or accessible, but they still occupy disk space. This may amount to a considerable amount of wasted disk space if the failure happened well into a large load operation. Invoking the VACUUM command will recover the wasted space.