Loading Data from Greenplum Database

posted Sep 13, 2012, 10:38 AM by Sachchida Ojha
There are three ways to load data from readable external tables.

1. Use the gpload utility
2. Use the gphdfs protocol
3. Load with copy

Loading Data with gpload

gpload is a Greenplum utility that can be used to load data using readable external tables and the Greenplum parallel file server (gpfdist). It facilitates some of the maintenance tasks of setting up parallel file-based external tables by allowing users to configure their data format, external table definition and gpfdist setup into one easy-to-manage configuration file.

To use gpload

1. First, make sure your environment is set up to run gpload. You’ll need some dependent files from your Greenplum Database installation such as gpfdist and Python, as well as network access to the Greenplum segment hosts.

2. Create your load control file. This is a YAML-formatted file that specifies the Greenplum Database connection information, gpfdist configuration information, external table options, and data format.

For example:
---
VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- etl1-1
- etl1-2
- etl1-3
- etl1-4
PORT: 8081
FILE:
- /var/load/data/*
- COLUMNS:
- name: text
- amount: float4
- category: text
- desc: text
- date: date
- FORMAT: text
- DELIMITER: '|'
- ERROR_LIMIT: 25
- ERROR_TABLE: payables.err_expenses
OUTPUT:
- TABLE: payables.expenses
- MODE: INSERT
SQL:
- BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
- AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"

3.Run gpload passing in the load control file.

For example:


gpload -f my_load.yml

Comments