MPP Data Model

MPP DATA MODEL - CONCEPTS
Before we design the MPP data model for MPP data warehouse system , we must understand what is MPP, how it works and for what it is built for. MPP stands for massively parallel processing. 

MPP systems are better suited for a workload where you need a very high throughput of data. in MPP data warehouse environment, we load large amount of data and we need to efficiently query large portions of this data.

Measuring Performance(Response Time, Throughput)  : When measuring how fast something is, there are two kinds of measures. One is to measure how long it takes and other is to measure how much gets done per unit time. The former is referred to as response time, access time, transmission time, or execution time depending on the context. The latter is referred to as throughput.

Throughput
refers to how much data can be transferred from one location to another in a given amount of time. The throughput is usually measured in bits per second (bit/s or bps), and sometimes in data packets per second or data packets per time slot. A throughput measure is an amount of something per unit time. For operating systems throughput is often measured as tasks or transactions per unit time. For storage systems or networks throughput is measured as bytes or bits per unit time. For processors, the number of instructions executed per unit time is an important component of performance.

A system’s throughput defines its overall capability to process data. DBMS throughput can be measured in queries per second, transactions per second, or average response times. DBMS throughput is closely related to the processing capacity of the underlying systems (disk I/O, CPU speed, memory bandwidth, and so on), so it is important to know the throughput capacity of your hardware when setting DBMS throughput goals.

Response Time: This is specified by an elapsed time from the initiation of some activity until its completion.The phrase response time is the most generic phrase, and is often used in operating systems contexts. The phrase access time is used for evaluating data storage systems. The phrase nodal delay is used for evaluating links in computer networks. The phrase execution time is used for evaluating processors.

Below are the differences between Response Time and Throughput 

 Throughput Response Time
 Throughput measures quantity of queries completed during a time interval Response Time measures the average duration of queries
 Throughput is a measure of the amount of work processed Response Time is a measure of process completion
 Throughput measures - how many queries were processed Response Time measures- how long the particular processing takes
 Throughput measures - the number of queries executed in an hour Response Time is the elapsed time per query

In Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect — there is no disk-level sharing or contention to be concerned with as it is a ‘shared-nothing’ architecture.

Analytic System

Analytic system supports the evaluation of the process. For example: How the sales orders are trending this month versus last month? Who are our best customers? Sales goal versus actual sales? Is particular promotion working or not?

The interaction with an analytic system takes place exclusively through queries that retrieve data about business processes. Here information is neither created nor modified.

                                                                          Operational Systems                                    Analytical Systems
PurposeExecution of a business processMeasurement of a business process
Primary Interaction StyleInsert, Update, Delete, QueryQuery
Scope of InteractionIndividual transactionAggregated transactions
Query PatternsPredictable and stableUnpredictable and changing
Temporal FocusCurrentHistoric and Current
Design OptimizationUpdated concurrencyHigh-performance query
Design PrincipleER design in 3NFDimensional Design (Star Schema, Cube)
Also Known AsOLTP
Source System
Transaction System
Data Warehouse
Data Mart
Operational System vs. Analytic System

Star Schema

Star schema consists of fact and dimension tables. The fact table holds the main data. The dimension tables describe the facts presented in the fact table. Most queries against a star schema follow a consistent pattern. One or more facts are requested, along with the dimensional attributes that provide the desired context. The focus is on summarized data.

The star schema gets its name from its appearance; when drawn with the fact table in the center, it looks like a star or asterisk.

Fact Table

At the core of the star schema is the fact table. In addition to presenting the facts, the fact table includes surrogate keys that refer to each of the associated dimension tables. Each row in the fact table stores facts at a specific level of details. This level of detail is known as the fact table’s grain.

Dimension Tables

In a star schema, a dimension table contains columns representing dimensions. These columns provide context for facts. A well-developed set of dimension tables provides powerful and diverse analytic capabilities. Dimension tables with a large number of attributes maximize analytic value.

Snowflake Schema
In the snowflake schema, multiple centralized fact tables connect to multiple dimension tables. A single dimension table may have multiple parent tables.

ETL
ETL means extract, transform and load. You extract the data, perform the transformation on the ETL machine, and then load the data. ETL is tied to the hardware. There are many ETL tools available in the market.

ELT
ELT means extract, load, and then transform. This strategy is strongly suggested for large data. While this is powerful, it is difficult to integrate with current tools.

MPP data warehouse solutions, like Teradata, Oracle Exadata, IBM Netezza and Microsoft Parallel Data Warehouse or Greenplum are examples of these approaches. Usually, the approach is a shared nothing MPP architecture of nodes, which have their own segment of data on their own disks (not a shared memory or disk). Most consequently all components (including the hardware) are perfectly tuned and aligned for this purpose. To achieve this, pre-installed and configured appliances are commonly used, so instead of buying hardware and software individually and trying to make it run well and fast, you get a “black box” (i.e. one or more racks) of components and software that are selected and configured in the best possible way.