What is Disk Spill and how to view information about all the queries that are currently using disk spill space

posted Apr 28, 2017, 12:18 PM by Sachchida Ojha
What is Disk Spill
Running a SQL usually requires the database to allocate some working memory to the process(es) that execute a SQL. The memory is especially important for steps that sort data or build a transient in-memory hash tables (for joins or aggregations). When there are many active SQLs  (or very large SQL) query that require working memory, each will get a smaller piece of the memory (or else the system will start swapping). So, a SQL that could typically join or sort in a single memory pass might start spilling temporary results to disk during high concurrency, dramatically affecting its run time and its resource usage (extra I/O to write and later read temporary data). So, another effect of “too much” concurrency is potentially making each SQL work a lot harder (reducing overall throughput)
Greenplum Database creates work files on disk if it does not have sufficient memory to execute the query in memory. This information can be used for troubleshooting and tuning queries.

If you have very large queries that need more memory, you can change the memory policy to use more memory rather than spilling to disk. You can see a query wanting more memory by looking at the explain plan of a query by using “explain analyze”. 

For example:EXPLAIN ANALYZE SELECT * FROM .....; 
Work_mem used: 23430K bytes avg, 23430K bytes max (seg0).
Work_mem wanted: 33649K bytes avg, 33649K bytes max (seg0) to lessen workfile I/O affecting 2 workers.
==================================================================================================================
EXPLAIN displays the query plan that the Greenplum planner generates for the supplied statement. Query plans are a tree plan of nodes. Each node in the plan 
represents a single operation, such as table scan, join, aggregation or a sort. Plans should be read from the bottom up as each node feeds rows into the node 
directly above it. The bottom nodes of a plan are usually table scan operations (sequential, index or bitmap index scans). If the query requires joins, aggregations, or 
sorts (or other operations on the raw rows) then there will be additional nodes above the scan nodes to perform these operations. The topmost plan nodes are usually the Greenplum Database motion nodes (redistribute, explicit redistribute, broadcast, or gather motions). These are the operations responsible for moving rows between the segment instances during query processing.

The output of EXPLAIN has one line for each node in the plan tree, showing the basic node type plus the following cost estimates that the planner made for the execution of that plan node:

1. cost - measured in units of disk page fetches; that is, 1.0 equals one sequential disk page read. The first estimate is the start-up cost (cost of getting to the first row) and the second is the total cost (cost of getting all rows). Note that the total cost assumes that all rows will be retrieved, which may not always be the case (if using LIMIT for example).

2. rows - the total number of rows output by this plan node. This is usually less than the actual number of rows processed or scanned by the plan node, reflecting the estimated selectivity of any WHERE clause conditions. Ideally the top-level nodes estimate will approximate the number of rows actually returned, updated, or deleted by the query.

3. width - total bytes of all the rows output by this plan node. 
It is important to note that the cost of an upper-level node includes the cost of all its child nodes. The topmost node of the plan has the estimated total execution cost for the plan. This is this number that the planner seeks to minimize. It is also important to realize that the cost only reflects things that the query planner cares about. In particular, the cost does not consider the time spent transmitting result rows to the client.
EXPLAIN ANALYZE causes the statement to be actually executed, not only planned. The EXPLAIN ANALYZE plan shows the actual results along with the planner’s estimates. This is useful for seeing whether the planner’s estimates are close to reality. In addition to the information shown in the EXPLAIN plan, EXPLAIN ANALYZE will show the following additional information:

The total elapsed time (in milliseconds) that it took to run the query.
• The number of workers (segments) involved in a plan node operation. Only segments that return rows are counted.
• The maximum number of rows returned by the segment that produced the most rows for an operation. If multiple segments produce an equal number of rows, the 
one with the longest time to end is the one chosen.
• The segment id number of the segment that produced the most rows for an operation.
• For relevant operations, the work_mem used by the operation. If work_mem was not sufficient to perform the operation in memory, the plan will show how much 
data was spilled to disk and how many passes over the data were required for the lowest performing segment. For example:
Work_mem used: 64K bytes avg, 64K bytes max (seg0).
Work_mem wanted: 90K bytes avg, 90K bytes max (seg0) to abate workfile I/O affecting 2 workers.
[seg0] pass 0: 488 groups made from 488 rows; 263 rows written to workfile
[seg0] pass 1: 263 groups made from 263 rows
• The time (in milliseconds) it took to retrieve the first row from the segment that produced the most rows, and the total time taken to retrieve all rows from that 
segment. The <time> to first row may be omitted if it is the same as the <time> to end.
Very Important: Keep in mind that the statement is actually executed when EXPLAIN ANALYZE is used. Although EXPLAIN ANALYZE will discard any output that a SELECT would return, other side effects of the statement will happen as usual. If you wish to use EXPLAIN ANALYZE on a DML statement without letting the command affect your data, use this approach:
BEGIN;
EXPLAIN ANALYZE ...;
ROLLBACK;
====================================================================================================================
Note that the bytes wanted message from EXPLAIN ANALYZE is only a hint, based on the amount of data written to work files and is not exact. The minimum work_mem needed could be more or less than the suggested value.

The output will show the plan used but a key item to look for is “Work_mem wanted”. When you see this, it means that Greenplum had to spill to disk because there wasn’t enough memory available. The best approach is likely to rewrite the query. Alternatively, you can increase the amount of memory available.

The “auto” setting allows you to increase or decrease the amount of memory a query will use by changing the “statement_mem” value. The maximum value you can set for statement_mem is determined by “max_statement_mem”. The default max_statement_mem is 2MB.

On the Master, execute the following to increase the statement_mem:gpconfig -c max_statement_mem -v 8GB gpstop -u 

Now, you can change the memory setting in your session. You can also do this with gpconfig to make the setting for all sessions.set gp_resqueue_memory_policy = auto; set statement_mem = '4GB'; 

Re-run your query and see if it executes faster and if it still has “bytes wanted” in the query plan.
Compressed Work Files
If you know you are spilling to disk when executing queries because EXPLAIN ANALYZE showed that more bytes were wanted than available, you can trade CPU for IO by compressing the work files. This is is done with “gp_workfile_compress_algorithm”. The default value is “none” but you can change this to “zlib”. It can be done at the session or with gpconfig to make it system wide.

Temporary Tables
Another way to deal with very large queries that spill to disk is to use temporary tables that are compressed. This is ideal when you use a subquery that is then joined to other tables. If you know it is spilling to disk (again from EXPLAIN ANALYZE showing more bytes wanted than available), you can populate a compressed temporary table instead. For example:CREATE TEMPORARY TABLE foo (myid int, bar text) WITH (APPENDONLY=true, COMPRESSTYPE=quicklz) ON COMMIT DROP DISTRIBUTED BY (myid); 

The gp_workfile_* views show information about all the queries that are currently using disk spill space. The information in the views can also be used to specify the values for the Greenplum Database configuration parameters gp_workfile_limit_per_query and gp_workfile_limit_per_segment. 

Greenplum Database configuration parameters 

 Parameter  Value Range Default      Description Classification
 gp_workfile_limit_per_query  kilobytes 0Sets the maximum disk size an individual query is allowed to use for creating temporary spill files at each segment. The 
default value is 0, which means a limit is not enforced. 
master 
session 
reload
 gp_workfile_limit_per_segment  kilobytes 0Sets the maximum total disk size that all running queries are allowed to use for creating temporary spill files at each 
segment. The default value is 0, which means a limit is not enforced. 
local 
system 
restart
 gp_workfile_checksumming Boolean onAdds a checksum value to each block of a work file (or spill file) used by HashAgg and HashJoin query operators. This adds an additional safeguard from faulty OS disk drivers writing corrupted blocks to disk. When a checksum operation fails, the query will cancel and rollback rather than potentially writing bad data to disk.master
session
reload
 gp_workfile_compress_algorithm nonenone 
zlib
When a hash aggregation or hash join operation spills to disk during query 
processing, specifies the compression algorithm to use on the spill files. If using zlib, it must be in your $PATH on all segments
master
session
reload

Workfile Disk Spill Space Information

In Greenplum Database 4.3 gp_workfile_* views in the gp_toolkit administrative schema contain show information about all the queries that are currently using disk spill space. Previously in 4.2.x.x releases, you created the views by running SQL scripts.
Let look into details about these view.

1. gp_workfile_entries
2. gp_workfile_usage_per_query
3. gp_workfile_usage_per_segment
1. gp_workfile_entries
This view contains one row for each operator using disk space for workfiles on a segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access.

Column description of this view

command_cnt->Command ID of the query.
content->The content identifier for a  segment instance.
current_query->Current query that the process is running.
datname->Greenplum database name.
directory->Path to the work file.
optype-> The query operator type that created the work file.
procpid->Process ID of the server process.
sess_id->Session ID.
size->The size of the work file in bytes.
numfiles->The number of files created.
slice-> The query plan slice. The portion of the query plan that is being executed.
state-> The state of the query that created the work file.
usename-> Role name.
workmem->The amount of memory allocated to the operator in KB.


2. gp_workfile_usage_per_query
This view contains one row for each query using disk space for workfiles on a segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access.

Column description of this view

command_cnt-> Command ID of the query.
content->  The content identifier for a segment instance.
current_query-> Current query that the process is running.
datname-> Greenplum database name.
procpid->  Process ID of the server process.
sess_id->  Session ID.
size -> The size of the work file in bytes.
numfiles->  The number of files created.
state-> The state of the query that created the work file.
usename ->  Role name.


3. gp_workfile_usage_per_segment
This view contains one row for each segment. Each row displays the total amount of disk space used for workfiles on the segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access

Column description of this view
content-> The content identifier for a segment instance.
size->  The total size of the work files on a segment.
numfiles->  The number of files created.

Oracle vs Greenplum 

Oracle introduced statement queuing in version 11g Release 2 (and later enhanced it in 11.2.0.2). However, in their case, it is bundled with a bunch of other new parallelism features (automatic DOP and in-memory parallel execution), so it is unfortunately more complex than necessary. In Oracle, the system-wide number of parallel process slaves is fixed and the engine tries to automatically find the optimal per-SQL parallelism based on the current system load before each execution. The DBA controls various parameters (globally and per resource group) to try to tame the beast.
Greenplum Database uses different model. The degree of per-SQL parallelism is fixed. The administrator simply chooses how many active SQLs are allowed per resource queue (group) – if more SQLs are submitted to a queue, they will wait until a slot is available. The administrator can also specify a minimal cost threshold (per resource queue) – to allow quick queries to bypass the queuing mechanism (and of course prioritize between queues).So, to sum it up, “too much” concurrency does hurt database performance. 

Luckily, it can be handled by proper setup in many modern databases – using statement prioritization and statement queuing.

Greenplum is pretty easy to manage memory because it has been designed to leverage the OS caching. The default Eager Free Memory Policy works very well for most of the queries in the database. However, if you do see queries still need more memory than is available, you can set the memory policy to auto and increase the statement_mem. If you are still spilling to disk because your statement needs more memory, you can have Greenplum automatically compress work files or use compressed temporary tables.
Comments