EMC Greenplum Data Integration Accelerator (DIA), is used for fast data loading, using the included gpfdist utility, as well as using a popular data integration (DI) tool called Informatica.
For customers interested in using Informatica as a DI tool, we will explain the reference architecture and how the tool is integrated into the DIA. There is a brief section on using the Greenplum PWX Adapter with the DIA and Informatica DI tool, and finally, how the DIA servers can serve as nodes in the Informatica Enterprise Grid.
To meet the challenges of fast data loading, the EMC Data integration Accelerator (DIA) is purpose-built for batch loading, and micro-batch loading, and leverages a growing number of data integration applications such as Informatica, Talend, and Pentaho. More software titles are being qualified for use in the DIA at this time.
The Data Integration Accelerator (DIA) is specially built to facilitate fast data
loading to the DCA. It integrates the Greenplum data loading utility called gpfdist
with the server, storage and networking gear into a single system. It leverages the
high-speed internal 10 Gb/sec communication network to deliver the data quickly
to the DCA.
The DIA meets the reliability requirements of the most mission-critical enterprises with
the data availability consisting of RAID protection at the disk level. It is offered as a
module for the DCA to allow flexible configurations to meet the customer’s
particular solution requirements.
The DIA comes in blocks of 4 servers. Each block is referred to as a module. You
can order up to 4 modules of DIA for a DCA installation. Currently each DIA server
is a commodity server with 2TB SATA drives. Each DIA server comes with 12 CPU
cores, 48GB of memory and12 2TB SATA disks, with a total usable capacity of about
70TB. The exact server model may change over time, but the architecture should
remain the same. With each server, the Greenplum gpfdist utility is pre-loaded by
The DIA Configurations
The following table shows some of the characteristics of the DIA modules. The physical dimensions, weights, power and cooling details can be found in the DIA
specification sheet on EMC Powerlink.
# servers 4
# CPU cores 48
Total memory 192 GB
Total SATA disks 48
Usable capacity 70TB
Table 1: DIA specifications
For customers looking for an appliance that has all the hardware and software ready to scale the loading performance for their needs, adding DIA modules to the
DCA may be the right solution .
As with most DCA modules, the DIA are orderable in building blocks (modules) of 4 servers each. For the DIA, the customer must already have at least one module of GPDB, which is considered the base module, and the DIA can reside in the same rack above the GPDB module(s), if there is room in the rack.
Building Blocks of the DIA
DIAs are purchased as optional modules of the DCA. Currently, customers can order their DCA/DIA in functional modules of up to 6 racks. In the near future, this
number will be increased to up to12 racks. For now, this configuration can be requested via RPQ.
The physical DCA racks and modules are assigned numbers (see table below):
Rack 1 Rack 2 Rack 3 Rack 4 Rack 5 Rack 6
Module 4 Module 4 Module 4 Module 4 Module 4 Module 4
Module 3 Module 3 Module 3 Module 3 Module 3 Module 3
Module 2 Module 2 Module 2 Module 2 Module 2 Module 2
Module 1 Module 1 Module 1 Module 1 Module 1 Module 1
Table 2: Physical numbering of modules in DCA/DIA racks
The racks are numbered, starting from 1 and incrementing by 1, up to the
maximum number of supported racks. Within each rack, we can house up to 4
modules, each module containing 4 DCA/DIA servers. The modules are made up
of the following functional modules. A Greenplum database (GPDB) module is
mandatory as module 1 in Rack 1. If more GPDB is desired, they are built up until
the rack is filled, and an additional rack can then be ordered. The DIA modules are
then added on top of the GPDB modules.
The order the modules are added follows a manufacturing rule. During the
manufacturing process, certain rules and logic are used in the building sequence:
1. GPDB Module is always the first module in Rack 1
2. From there, if more GPBD modules are needed, they are built upwards
3. DIA modules are added next
4. Then Hadoop and other modules are added