Creating and Using Web External Tables

posted Sep 13, 2012, 9:09 AM by Sachchida Ojha
CREATE EXTERNAL WEB TABLE creates a web table definition in Greenplum Database. A web table is a type of external table that allows you to access dynamic data sources as though they were a regular database table. Web table data is considered dynamic (meaning the data could possibly change midstream during the execution of a query). Therefore, the query planner must choose plans that do not allow for rescanning of the web table data.

There are two forms of defining a web external table. You can use one of the following forms per CREATE EXTERNAL WEB TABLE statement, but you cannot mix the two forms.

1. Web URLs. Specify the LOCATION of files on a web server using the http:// protocol. The web data file(s) must reside on a web server that is accessible by the Greenplum segment hosts. The number of URLs specified corresponds to the number of segment instances that will work in parallel to access the web table. So for example, if you have a Greenplum Database system with 8 primary segments and you specify 2 external files, only 2 of the 8 segments will access the web table in parallel at query runtime.

2. OS Command. Specify a shell command or script to EXECUTE on one or more segments, the output of which will comprise the web table’s data at the time of access. A web table defined with an EXECUTE clause will execute the given OS shell command or script on the specified segment host or hosts. By default, the command is executed by all active segment instances on all segment hosts. For example, if each segment host has four primary segment instances running, the command will be executed four times per segment host. You can optionally limit the number of segment instances that execute the web table command.

Web table data is comprised of the output of the command at the time the web table statement is executed. All segment instances included in the web table definition (as specified by the ON clause) execute the command in parallel.

Defining Command-Based Web External Tables

The command or program specified in your external table definition must reside on all of the Greenplum segment hosts. If you use environment variables in external web table commands (such as $PATH), keep in mind that the command is executed from within the database and not from a login shell. Therefore the .bashrc or .profile of the current user will not be sourced. However, you can set desired environment variables from within the EXECUTE clause of your external web table definition, for example:

=# CREATE EXTERNAL WEB TABLE output (output text)
EXECUTE 'PATH=/home/gpadmin/programs; export PATH; myprogram.sh'
FORMAT 'TEXT';

Also, any scripts that you want to execute must be present in the same location on the segment hosts and be executable by the gpadmin user.
For example, here is a command that defines a web table that executes a script once per segment host:

=# CREATE EXTERNAL WEB TABLE log_output (linenum int, message text)
EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST
FORMAT 'TEXT' (DELIMITER '|');

Defining URL-Based Web External Tables

A URL-based web table is similar to a file-based external table except that the data is accessed from a web server via the HTTP protocol. Also, web table data is assumed to be dynamic, so unlike file-based external tables, the data is not rescannable.

For example, here is a command that defines a web table that gets data from several different URLs:

=# CREATE EXTERNAL WEB TABLE ext_expenses (name text,
date date, amount float4, category text, description text)
LOCATION (
'http://intranet.company.com/expenses/sales/file.csv',
'http://intranet.company.com/expenses/exec/file.csv',
'http://intranet.company.com/expenses/finance/file.csv',
'http://intranet.company.com/expenses/ops/file.csv',
'http://intranet.company.com/expenses/marketing/file.csv',
'http://intranet.company.com/expenses/eng/file.csv',
)
FORMAT 'CSV' ( HEADER );
Comments