When approaching a performance tuning initiative, it is important to know your system’s expected level of performance and to define measurable performance requirements. Without setting an acceptable threshold for database performance, you will end up chasing a carrot always out of reach. Consider the following when setting performance goals: 1. Baseline Hardware Performance 2. Performance Benchmarks 1. Baseline Hardware Performance The majority of database performance problems are caused not by the database itself, but by the underlying systems on which the database is running. I/O bottlenecks, memory problems, and network problems can significantly degrade database performance. Therefore, it is important to know the baseline capabilities of your hardware and operating system (OS). This will help you to identify and troubleshoot hardware-related problems before undertaking database-level or query-level tuning initiatives. 2. Performance Benchmarks In order to maintain good performance or improve upon performance issues, you need to know the capabilities of your DBMS on a defined workload. A benchmark is a predefined workload that produces a known result set, which can then be used for comparison purposes. Periodically running the same benchmark tests can help identify system-related performance degradation over time. Benchmarks can also be used as a comparison to other workloads in an effort to identify queries or applications in need of optimization. There are many third-party organizations which provide benchmark tools for the database industry, one of those being the Transaction Processing Performance Council (TPC). TPC-H is the ad-hoc, decision support benchmark. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions. Understanding Greenplum Database Performance Factors There are some key performance factors which influence database performance. 1. System Resources 2. Workload 3. Throughput 4. Contention 5. Optimization Understanding these factors helps identify performance opportunities and avoid problems. System Resources: Database performance relies heavily on 1. disk I/O 2.memory usage. To accurately set performance expectations, you need to know the baseline performance of the hardware on which your DBMS is deployed. Performance of hardware components such as CPUs, hard disks, disk controllers, RAM, and network interfaces will significantly affect how fast your database performs. Workload: The total workload is a combination of user queries, applications, batch jobs, transactions, and system commands directed through the DBMS at any given time. For example: It can increase when month-end reports are run or decrease on weekends when most users are out of the office. Workload strongly influences database performance. Knowing your workload and peak demand times helps you plan for the most efficient use of your system resources and enables processing the largest possible workload. The workload equals the total demand from the DBMS, and it varies over time. Throughput: A system’s throughput defines its overall capability to process data. DBMS throughput is measured in queries per second, transactions per second, or average response times. DBMS throughput is closely related to the processing capacity of the underlying systems (disk I/O, CPU speed, memory bandwidth, and so on), so it is important to know the throughput capacity of your hardware when setting DBMS throughput goals. Contention: Contention is the condition in which two or more components of the workload attempt to use the system in a conflicting way . For example, multiple queries that try to update the same piece of data at the same time or multiple large workloads that compete for system resources. As contention increases, throughput decreases. Optimization: DBMS optimizations can affect the overall system performance. SQL formulation, database configuration parameters, table design, data distribution, and so on enable the database query planner and optimizer to create the most efficient access plans. The following issues are the most common cause of performance problems in Greenplum Database
There are 2 configuration parameters which control the amount of data sampled for statistics collection: 1.default_statistics_target 2.gp_analyze_relative_error These parameters control statistics sampling at the system level. It is probably better to only sample increased statistics for the columns used most frequently in query predicates. You can adjust statistics for a particular column using the ALTER TABLE...SET STATISTICS command. For example: ALTER TABLE sales_data ALTER COLUMN region SET STATISTICS 60; This is equivalent to increasing default_statistics_target for a particular column. Subsequent ANALYZE operations will then gather more statistics data for that column, and hopefully produce better query plans as a result. Monitoring Greenplum Database System Resources Utilization You can use system monitoring utilities such as ps, top, iostat, vmstat, netstat to monitor database activity on the hosts in your Greenplum Database array. These tools can help identify Greenplum Database processes (postgres processes) currently running on the system and the most resource intensive tasks with regards to CPU, memory, disk I/O, or network activity. Look at these system statistics to identify queries that degrade database performance by overloading the system and consuming excessive resources. Greenplum Database management tool gpssh allows you to run these system monitoring commands on several hosts simultaneously. You can also use Greenplum Command Center to monitor System Resources Utilization. The Greenplum Command Center collects query and system utilization metrics. See the Greenplum Command Center Administrator Guide for procedures to enable Greenplum Command Center. Greenplum Database creates work files on disk if it does not have sufficient memory to execute the query in memory. This information can be used for troubleshooting and tuning queries. The gp_workfile_* views show information about all the queries that are currently using disk spill space. The information in the views can also be used to specify the values for the Greenplum Database configuration parameters gp_workfile_limit_per_query and gp_workfile_limit_per_segment. Let look into details about these views. 1. gp_workfile_entries 2. gp_workfile_usage_per_query 3. gp_workfile_usage_per_segment 1. gp_workfile_entries This view contains one row for each operator using disk space for workfiles on a segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access. 2. gp_workfile_usage_per_query This view contains one row for each query using disk space for workfiles on a segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access. 3. gp_workfile_usage_per_segment This view contains one row for each segment. Each row displays the total amount of disk space used for workfiles on the segment at the current time. The view is accessible to all users, however non-superusers only to see information for the databases that they have permission to access. Visit www.greenplumdba.com to learn more about performance tuning parameters and performance troubleshooting steps and SQL. |