MPP DATA MODEL - CONCEPTS Before we design the MPP data model for MPP data warehouse system , we must understand what is MPP, how it works and for what it is built for. MPP stands for massively parallel processing. MPP systems are better suited for a workload where you need a very high throughput of data. in MPP data warehouse environment, we load large amount of data and we need to efficiently query large portions of this data. Measuring Performance(Response Time, Throughput) : When measuring how fast something is, there are two kinds of measures. One is to measure how long it takes and other is to measure how much gets done per unit time. The former is referred to as response time, access time, transmission time, or execution time depending on the context. The latter is referred to as throughput. Response Time: This is specified by an elapsed time from the initiation of some activity until its completion.The phrase response time is the most generic phrase, and is often used in operating systems contexts. The phrase access time is used for evaluating data storage systems. The phrase nodal delay is used for evaluating links in computer networks. The phrase execution time is used for evaluating processors. Below are the differences between Response Time and Throughput
In Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect — there is no disk-level sharing or contention to be concerned with as it is a ‘shared-nothing’ architecture. Analytic System Analytic system supports the evaluation of the process. For example: How the sales orders are trending this month versus last month? Who are our best customers? Sales goal versus actual sales? Is particular promotion working or not? The interaction with an analytic system takes place exclusively through queries that retrieve data about business processes. Here information is neither created nor modified. Operational Systems Analytical Systems
Star schema consists of fact and dimension tables. The fact table holds the main data. The dimension tables describe the facts presented in the fact table. Most queries against a star schema follow a consistent pattern. One or more facts are requested, along with the dimensional attributes that provide the desired context. The focus is on summarized data. The star schema gets its name from its appearance; when drawn with the fact table in the center, it looks like a star or asterisk. Fact Table At the core of the star schema is the fact table. In addition to presenting the facts, the fact table includes surrogate keys that refer to each of the associated dimension tables. Each row in the fact table stores facts at a specific level of details. This level of detail is known as the fact table’s grain. Dimension Tables In a star schema, a dimension table contains columns representing dimensions. These columns provide context for facts. A well-developed set of dimension tables provides powerful and diverse analytic capabilities. Dimension tables with a large number of attributes maximize analytic value. Snowflake Schema In the snowflake schema, multiple centralized fact tables connect to multiple dimension tables. A single dimension table may have multiple parent tables. ETL ETL means extract, transform and load. You extract the data, perform the transformation on the ETL machine, and then load the data. ETL is tied to the hardware. There are many ETL tools available in the market. ELT ELT means extract, load, and then transform. This strategy is strongly suggested for large data. While this is powerful, it is difficult to integrate with current tools. |