EMC Greenplum Data Integration Accelerator (DIA), is used for fast data loading, using the included gpfdist utility, as well as using a popular data integration (DI) tool called Informatica. For customers interested in using Informatica as a DI tool, we will explain the reference architecture and how the tool is integrated into the DIA. There is a brief section on using the Greenplum PWX Adapter with the DIA and Informatica DI tool, and finally, how the DIA servers can serve as nodes in the Informatica Enterprise Grid. To meet the challenges of fast data loading, the EMC Data integration Accelerator (DIA) is purpose-built for batch loading, and micro-batch loading, and leverages a growing number of data integration applications such as Informatica, Talend, and Pentaho. More software titles are being qualified for use in the DIA at this time. The Data Integration Accelerator (DIA) is specially built to facilitate fast data loading to the DCA. It integrates the Greenplum data loading utility called gpfdist with the server, storage and networking gear into a single system. It leverages the high-speed internal 10 Gb/sec communication network to deliver the data quickly to the DCA. The DIA meets the reliability requirements of the most mission-critical enterprises with the data availability consisting of RAID protection at the disk level. It is offered as a module for the DCA to allow flexible configurations to meet the customer’s particular solution requirements. The DIA comes in blocks of 4 servers. Each block is referred to as a module. You can order up to 4 modules of DIA for a DCA installation. Currently each DIA server is a commodity server with 2TB SATA drives. Each DIA server comes with 12 CPU cores, 48GB of memory and12 2TB SATA disks, with a total usable capacity of about 70TB. The exact server model may change over time, but the architecture should remain the same. With each server, the Greenplum gpfdist utility is pre-loaded by default. The DIA Configurations The following table shows some of the characteristics of the DIA modules. The physical dimensions, weights, power and cooling details can be found in the DIA specification sheet on EMC Powerlink. ======================================= DIA module # servers 4 # CPU cores 48 Total memory 192 GB Total SATA disks 48 Usable capacity 70TB ======================================= Table 1: DIA specifications
For customers looking for an appliance that has all the hardware and software ready to scale the loading performance for their needs, adding DIA modules to the DCA may be the right solution . As with most DCA modules, the DIA are orderable in building blocks (modules) of 4 servers each. For the DIA, the customer must already have at least one module of GPDB, which is considered the base module, and the DIA can reside in the same rack above the GPDB module(s), if there is room in the rack. Building Blocks of the DIA DIAs are purchased as optional modules of the DCA. Currently, customers can order their DCA/DIA in functional modules of up to 6 racks. In the near future, this number will be increased to up to12 racks. For now, this configuration can be requested via RPQ. The physical DCA racks and modules are assigned numbers (see table below): =================================================================== Rack 1 Rack 2 Rack 3 Rack 4 Rack 5 Rack 6 =================================================================== Module 4 Module 4 Module 4 Module 4 Module 4 Module 4 Module 3 Module 3 Module 3 Module 3 Module 3 Module 3 Module 2 Module 2 Module 2 Module 2 Module 2 Module 2 Module 1 Module 1 Module 1 Module 1 Module 1 Module 1 =================================================================== Table 2: Physical numbering of modules in DCA/DIA racks The racks are numbered, starting from 1 and incrementing by 1, up to the maximum number of supported racks. Within each rack, we can house up to 4 modules, each module containing 4 DCA/DIA servers. The modules are made up of the following functional modules. A Greenplum database (GPDB) module is mandatory as module 1 in Rack 1. If more GPDB is desired, they are built up until the rack is filled, and an additional rack can then be ordered. The DIA modules are then added on top of the GPDB modules. The order the modules are added follows a manufacturing rule. During the manufacturing process, certain rules and logic are used in the building sequence: 1. GPDB Module is always the first module in Rack 1 2. From there, if more GPBD modules are needed, they are built upwards 3. DIA modules are added next 4. Then Hadoop and other modules are added |