Greenplum Query Processing

Users issue queries to Greenplum Database just as they would to any database management system (DBMS). They connect to the database instance on the Greenplum master host using a client application (such as psql) and submit an SQL statement.

Understanding Query Planning and Dispatch

The query is received through the master, which parses the query, optimizes the query, and creates either a parallel or targeted query plan (depending on the query). The master then dispatches the plan to the segments for execution. Each segment is then responsible for executing local database operations on its own particular set of data.

Most database operations—such as table scans, joins, aggregations, and sorts—execute in parallel across the segments simultaneously. Each operation is performed on a segment database independent of the data associated with the other segment databases.

Certain queries may only access data on a single segment, such as single-row INSERT, UPDATE, DELETE or SELECT operations, or queries that return a small set of rows and filter on the table distribution key column(s). In queries such as these, the query plan is not dispatched to all segments, but is targeted to the segment that contains the affected row(s).

Understanding Greenplum Query Plans

A query plan is the set of operations that Greenplum Database will perform to produce the answer to a given query. Each node or step in the plan represents a database operation such as a table scan, join, aggregation or sort. Plans are read and executed from bottom to top.

In addition to the typical database operations (tables scans, joins, etc.), Greenplum Database has an additional operation type called a motion. A motion operation involves moving tuples between the segments during query processing. Note that not every query requires a motion. For example, a query of the system catalog tables on the master does not require data to move across the interconnect.

In order to achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. A slice is a portion of the plan that can be worked on independently at the segment-level. A query plan is sliced wherever a motion operation occurs in the plan, one slice on each side of the motion.
For example, consider the following simple query involving a join between two tables:

SELECT customer, amount FROM sales JOIN customer USING (cust_id)
WHERE dateCol = '04-30-2008';

Figure A

Figure A shows the query plan. Note that each segment gets a copy of the query plan and works on it in parallel. For this particular plan, there is a redistribute motion that moves tuples between the segments in order to complete the join. The plan is sliced on either side of the redistribute motion (slice 1 and slice 2). This query plan also has another type of motion operation called a gather motion. A gather motion is when the segments send results back up to the master for presentation to the client. Since a query plan is always sliced wherever a motion occurs, this plan also has an implicit slice at the very top of the plan (slice 3). Not all query plans involve a gather motion. For example, a CREATE TABLE x AS SELECT... statement would not have a gather motion (tuples are sent to the newly created table, not to the master).

Understanding Parallel Query Execution

Greenplum creates a number of database processes to handle the work of a query. On the master, the query worker process is called the query dispatcher (QD). The QD is responsible for creating and dispatching the query plan, and for accumulating and presenting the final results. On the segments, a query worker process is called a query executor (QE). A QE is responsible for completing its portion of work and communicating its intermediate results to the other worker processes.

For each slice of the query plan there is at least one worker process assigned. A worker process works on its assigned portion of the query plan independently. During query execution, each segment will have a number of processes working on the query in parallel.
Related processes that are working on the same portion of the query plan are referred to as gangs. As a portion of work is completed, tuples flow up the query plan from one gang of processes to the next. This inter-process communication between the segments is what is referred to as the interconnect component of Greenplum Database.

Figure B shows the query worker processes on the master and two segment instances for the query plan illustrated in Figure A.

Figure B