DIA Configurations

posted Sep 14, 2012, 5:50 AM by Sachchida Ojha
EMC Greenplum Data Integration Accelerator (DIA), is used for fast data loading, using the included gpfdist utility, as well as using a popular data integration (DI) tool called Informatica. 

For customers interested in using Informatica as a DI tool, we will explain the reference architecture and how the tool is integrated into the DIA. There is a brief section on using the Greenplum PWX Adapter with the DIA and Informatica DI tool, and finally, how the DIA servers can serve as nodes in the Informatica Enterprise Grid.

To meet the challenges of fast data loading, the EMC Data integration Accelerator (DIA) is purpose-built for batch loading, and micro-batch loading, and leverages a growing number of data integration applications such as Informatica, Talend, and Pentaho. More software titles are being qualified for use in the DIA at this time.

The Data Integration Accelerator (DIA) is specially built to facilitate fast data 
loading to the DCA. It integrates the Greenplum data loading utility called gpfdist 
with the server, storage and networking gear into a single system. It leverages the 
high-speed internal 10 Gb/sec communication network to deliver the data quickly 
to the DCA.

The DIA meets the reliability requirements of the most mission-critical enterprises with 
the data availability consisting of RAID protection at the disk level. It is offered as a 
module for the DCA to allow flexible configurations to meet the customer’s 
particular solution requirements.

The DIA comes in blocks of 4 servers. Each block is referred to as a module. You 
can order up to 4 modules of DIA for a DCA installation. Currently each DIA server 
is a commodity server with 2TB SATA drives. Each DIA server comes with 12 CPU 
cores, 48GB of memory and12 2TB SATA disks, with a total usable capacity of about 
70TB. The exact server model may change over time, but the architecture should 
remain the same. With each server, the Greenplum gpfdist utility is pre-loaded by 

The DIA Configurations
The following table shows some of the characteristics of the DIA modules. The physical dimensions, weights, power and cooling details can be found in the DIA
specification sheet on EMC Powerlink.

                            DIA module
# servers                     4
# CPU cores               48
Total memory             192 GB
Total SATA disks        48
Usable capacity         70TB
Table 1: DIA specifications

For customers looking for an appliance that has all the hardware and software ready to scale the loading performance for their needs, adding DIA modules to the
DCA may be the right solution .

As with most DCA modules, the DIA are orderable in building blocks (modules) of 4 servers each. For the DIA, the customer must already have at least one module of GPDB, which is considered the base module, and the DIA can reside in the same rack above the GPDB module(s), if there is room in the rack.

Building Blocks of the DIA
DIAs are purchased as optional modules of the DCA. Currently, customers can order their DCA/DIA in functional modules of up to 6 racks. In the near future, this
number will be increased to up to12 racks. For now, this configuration can be requested via RPQ.

The physical DCA racks and modules are assigned numbers (see table below): 

Rack 1        Rack 2        Rack 3        Rack 4        Rack 5        Rack 6
Module 4     Module 4     Module 4     Module 4      Module 4     Module 4
Module 3     Module 3     Module 3     Module 3      Module 3     Module 3
Module 2     Module 2     Module 2     Module 2      Module 2     Module 2
Module 1     Module 1     Module 1     Module 1      Module 1     Module 1

Table 2: Physical numbering of modules in DCA/DIA racks

The racks are numbered, starting from 1 and incrementing by 1, up to the 
maximum number of supported racks. Within each rack, we can house up to 4 
modules, each module containing 4 DCA/DIA servers. The modules are made up 
of the following functional modules. A Greenplum database (GPDB) module is 
mandatory as module 1 in Rack 1. If more GPDB is desired, they are built up until 
the rack is filled, and an additional rack can then be ordered. The DIA modules are 
then added on top of the GPDB modules.

The order the modules are added follows a manufacturing rule. During the 
manufacturing process, certain rules and logic are used in the building sequence:
1. GPDB Module is always the first module in Rack 1
2. From there, if more GPBD modules are needed, they are built upwards
3. DIA modules are added next
4. Then Hadoop and other modules are added