DIA - EMC DATA INTEGRATION ACCELERATOR

The Data Integration Accelerator (DIA) is the first addition to the EMC Greenplum DCA Accelerator family of products. Focused on solving the challenges of data loading in a parallel and scalable model, the DIA is purpose-built for DCA customers looking to shorten batch loads or wanting to implement micro-batch loading. The EMC® Greenplum® Data Integration Accelerator (DIA) is a purpose-built, highly scalable, parallel data integration module engineered for parallel execution of data loading with existing EMC Greenplum Data Computing Appliance installations.

The Greenplum Data Integration Accelerator (DIA) is well suited for customers with fast data loading requirements. It uses the preloaded gpfdist Greenplum utility to
achieve unparalleled fast data loading. For customers entrenched with ETL solutions such as Informatica, the DIA servers can be configured as Informatica
Integration Services servers, as well as nodes in the Enterprise Grid option. The DIA servers will work seamlessly with the data servers in the data center environment to provide a powerful data integration system for the Greenplum database.


DATA INTEGRATION ACCELERATOR FEATURES

posted Sep 17, 2012, 9:41 AM by Sachchida Ojha


1. RAPID DEPLOYMENT AND PREDICTABLE PERFORMANCE : The Greenplum Data Integration Accelerator is a purpose-built, open systems data accelerator
that architecturally integrates Greenplum data loading software (gpfdist), server, storage and networking into a single, easy-to-implement system. The packaging and pre-tuning ensures predictable performance, while dramatically simplifying your data loading activates, resulting in reduced administration overhead.


2. TIGHTLY INTEGRATED WITH THE DATA COMPUTING APPLIANCE FAMILY: The DIA was designed for tight integration with the DCA family of data warehousing and analytic appliances. Removing the need for custom solutions and non-supported hardware, the DIA enables an end-to-end solution with a single support and management infrastructure. By leveraging a common 10 GB/s Ethernet network, the DIA enables the fastest data loading directly into the DCA segment servers.

3. ENGINEERED FOR PARALLEL EXECUTION OF DATA LOADING: The DIA, combined with the Greenplum DCA manages the flow of data into all nodes of the
appliance using the EMC Greenplum’s MPP Scatter/Gather Streaming™ (SG Streaming) technology. The system uses a “parallel-everywhere” approach to loading, in which data flows from all the nodes on the DIA to every segment server of the database without any sequential choke points. The combined solution achieves loading speeds of more than 10 terabytes per hour, two- to five-times faster than other appliance solutions.

4. ENTERPRISE HIGH AVAILABILITY: The Greenplum DIA is a system that meets the reliability requirements of the most mission- critical enterprises with data availability consisting of RAID protection at the disk level. This provides no data loss when losing a disk within any server.

5. GREENPLUM PERFORMANCE MONITOR: The DIA is managed via the Greenplum Performance Monitor application that provides a single view of the Data Computing Appliance and the Data Integration Accelerator from a single management console. The system includes Secure Remote Support (call home) and provides email and SNMP notification in the case of any event needing attention.

6. PROACTIVE EMC ONE SUPPORT STRUCTURE: EMC Customer Support Services provides resources and services to quickly and proactively resolve solution-related issues and questions to ensure business continuity and a highly- available data environment. EMC’s global maintenance and support is available
around-the-clock via comprehensive online support tools including Live Chat and online service request management, Secure Remote Support (call home), live telephone support, and onsite support through the industry’s leading global field service organization.

DATA INTEGRATION ACCELERATOR CONFIGURATIONS

posted Sep 17, 2012, 9:35 AM by Sachchida Ojha

Available in three configurations:
   DIA10 Quarter Rack   DIA100 Half Rack    DIA1000 Full Rack
  DIA Servers   4 Servers   8 Servers  16 Servers
  Total CPU core 48 96 192
  Total Memory  192GB 384GB 768GB
  Total HDD’s (SATA)  48 96 192
  Usable Capacity  70TB 140TB 280TB
  Physical Dimensions Height 75 in-190 cm
Width 24 in–61 cm
Depth: 39.3in – 100cm
 Height 75 in-190 cm
Width 24 in–61 cm
Depth: 39.3in – 100cm
 Height 75 in-190 cm
Width 24 in–61 cm
Depth: 39.3in – 100cm
  Kilos  Weight: 940 lbs – 427 KgsWeight: 1,200 lbs – 545 KgsWeight: 1,700 lbs – 773 Kgs
 Power  2,478  3,980  6,980
  Cooling (BTU/HR)   8,450   13,600   23,800


 

How the DIA servers can be configured as a grid for the Informatica Enterprise Grid option

posted Sep 14, 2012, 6:22 AM by Sachchida Ojha

DIA servers as nodes in Informatica PowerCenter Grids

You can configure the DIA servers as nodes in an Informatica PowerCenter Grid, to make use of the many beneficial features of the Informatica Enterprise Grid Option, such as:

1.  adaptive load balancing
2.  high availability
3.  dynamic partitioning

The Informatica Enterprise Grid Option is an extra-cost option which you can separately purchase from Informatica.

When you configure the DIA servers into a grid, you can configure workflows and sessions to run on the grid. You can run a workflow, or a session on the grid.
When you run a workflow on a grid, the PowerCenter Integration Service runs a service process on each available DIA server in the grid, to load balance, and increase the performance and scalability of the computing processes.

When you run a session on a grid, the Integration Service distributes session threads to multiple Data Transformation Manager (DTM) processes on DIA
servers in the grid.

To set up the DIA servers as a PowerCenter Grid, you use the Administrator’s Console, and create a new Grid option for the domain you are using (see figure below.)

Note: To run workflows on a PowerCenter grid, you must have the Server grid option. To run sessions on a grid, you must have the Session on Grid option.

The Enterprise Grid option is useful when handling mission-critical,enterprise-wide data integration from a single platform. An example of where a grid may be useful is when a workflow has many  sessions that must be completed at the same time. Running this workflow on a single node will force it to run serially, sacrificing performance and usability. When this workflow is run on a grid, the sessions can be distributed to all the available nodes in the grid, according to the computing and memory resources required, ensuring maximum performance and maximum return of investment.

The DIA servers, combined with the massively-parallel processing databases in the DCA, are perfectly configured to be uses as nodes in a PowerCenter grid. The scalability nature of the DIA allows you to add power and performance to your Informatica grid when more performance is needed for your data integration projects.

Using PWX Adapter for Informatica on the DIA servers

posted Sep 14, 2012, 6:18 AM by Sachchida Ojha

Once the Integration servers are set up, you can read and write Greenplum database data using an ODBC (Open DataBase Connectivity) driver. For example, you may use the Progress Software Data Direct ODBC drivers to fetch meta data or write data to a Greenplum database, or you may use an open source psqlODBC PostgreSQL ODBC driver. The Greenplum connectivity package also provides the psqlODBC drivers for use with the PWX adapter. There are some commercial PostgreSQL ODBC drivers that may be suitable as well. All will perform adequately for read and write operations to Greenplum databases.

For the best data loading performance, EMC suggest you consider using the Greenplum PWX Adapter. The PWX Adapter will call the gpload utility, which is another EMC Greenplum utility. gpload will read the format of the source file, and create a control file using YAML syntax, and start a gpfdist session. It will then create an external table, and insert the data into the Greenplum target using gpfdist.

In order to use the PWX Adapter, you will need to install at the DIA server:

1. PWX Adapter
2. Loaders package and optionally,
3. Connectivity package

All three packages must be installed with the appropriate version level, depending on the Informatica version number, the operating systems platform on which it is running, and the Greenplum version number.

Once the PWX Adapter is installed, and registered at the Repository server, you can then start loading data, without using the ODBC drivers.


How can we install the Informatica Integration Services on the DIA servers.

posted Sep 14, 2012, 6:14 AM by Sachchida Ojha   [ updated Sep 14, 2012, 6:24 AM ]

The DIA servers are well-suited to host a number of industry data loading tools. These tools perform the tasks of Extract, Transform, and Load (ETL) of source data into Greenplum databases and data warehouses.

Informatica™ is a popular ETL tool. A large percentage of EMC’s customers are Informatica users. The DIA servers serves as the Integration service servers (RedHat Linux hosts.)

Installing Informatica on a DIA server

Before installing Informatica on a DIA server, install the Informatica Repository server, or take note of an existing one. Install Informatica on the DIA by logging into the DIA host as root or gpadmin. Inflate the zip file by using gunzip, and run the install.sh script. Follow the on-screen prompt and  complete the installation process. 

When asked if you want to create a new Domain, or join an existing Domain, join the existing domain created on the Repository server.

When the install script is done, go to the Repository server, and start the Administrator’s web console. You should be able to add the new Integration server you have just installed (on the DIA server) to the Domain.


Installing Informatica Integration Services for Windows

In the previous section, we took advantage of the DIA servers being RedHat Linux Servers, and installed the Linux version of Informatica Integration service. If a
customer would like to use the Windows version of Informatica Integration service, he or she can do so, but will first have to install Windows operating systems over the RedHat Linux operating system.

After the Windows operating systems is installed on the DIA server, the installing of the Informatica Integration carries on as usual, and should be added to the
Informatica domain as described above.

For Windows users, you will have to install the Greenplum loaders package to include the gpfdist utility, and also install the Python language. Python is an open
source software that can be downloaded freely from the Internet. Currently, version 2.6 and above should work well for GPDB 4.0.5.0 and above. For older
versions of the GPDB, you may wish to check the Greenplum Database Administrator’s Guide or try version 2.5.4.

How gpfdist is used in the DIA servers for data loading?

posted Sep 14, 2012, 6:06 AM by Sachchida Ojha

gpfdist

gpfdist is Greenplum’s parallel file distribution server utility software. It is used with read-only external tables for fast, parallel data loading of text files into a Greenplum data warehouse.

gpfdist can be considered as a networking protocol, much like the http protocol. Running gpfdist is similar to running a HTTP server. It exposes the target file via TCP/IP to a local file directory containing the files. The target files are usually delimited files or CSV files, although it can also read tar and gziped files. In the case of tar and gzip files, the PATH must contain the location of the tar and gzip utilities.

For data uploading into a Greenplum database or data warehouse, you generate the flat files from an operational database or transactional database, using export, COPY, dump, or user-written software, depending on your business requirements. This process can be automated to run periodically.

How to install gpfdist
gpfdist is installed on each DIA server by default. If it is not already installed, there are two ways you can install it:

1. copy gpfdist from another source to the DIA server, and put the location of the file into the PATH environment variable.
2. download and install the Greenplum loaders package. Select the loaders package that is appropriate to your DCA database version.

For example, if your DCA Greenplum database is of version 4.0.5.0, you will want to download and install the Greenplum loaders package: Greenplum-loaders-4.0.50.0-build-8-RHEL5-x86_64.bin.

For EMC customers, the Greenplum loaders packages are available for download at the EMC Powerlink web site. (go to https://emc.subscribnet.com and click the Greenplum link.) 

Running gpfdist on a DIA server
gpfdist runs in a client-server model. The DIA server acts as the server, while the master server of the DCA is the client. You start the gpfdist process on a DIA server, by indicating the directory where you drop your source files. Optionally, you may also designate the TCP port number to be used.

A simple startup of the gpfdist server is the following command syntax:

gpfdist –d <file_files_directory> –p <port_number> –l <log_file> &

For example,

# gpfdist -d /etl-data -p 8887 -l gpfdist_8887.log &
[1] 28519
# Serving HTTP on port 8887, directory /home/gpadmin/etl-log

In the above example, gpfdist server is set up to run, anticipating data loading from flat files stored in a file directory /etl-data. Port 8887 is opened and listening for data requests, and a log file is created in /home/gpadmin called etl-log.

For each DIA server, you can run multiple instances of gpfdist, under different TCP ports and directories.

Running gpfdist clients

To initiate the data extraction process, we use the DCA Master Server as the client. We connect to gpfdist through the external tables. These tables can be created
using the psql command.

For example:

# psql -d gpdb
gpdb=#
gpdb=# create external table ext_load_performance (like
performance_table)
gpdb=# location (‘gpfdist://etl3:8887/performance_test.dat’)
gpdb=# format ‘text’ (delimiter ‘|’)
gpdb=# segment reject limit 20 rows;

In the above example, we create an external table. This table has all the attributes of a table called ‘performance_table’ (using the like performance_table clause), and makes uses of flat files stored in the host ‘etl3’, using port 8887. The flat file name is expected to be ‘performance_test.dat’.

Combining this statement with the statement started in the gpfdist server, we expect the file to be in the directory ‘/etl-data’. Both client and server will be communicating using TCP port 8887. Of course, the process itself will not initiate any process, or do any data loading. It has simply defined a connection between the client and the server.

To start the data loading process, you issue a command that requests data from the external table – ext_load_performance. This will trigger the input data to be read into the external table. For example:

gpdb=# insert into performance_table select * from
ext_load_performance;
INSERT 0 1093680

When this command is run, each segment server will connect to gpfdist simultaneously. The gpfdist utility will divide the flat file in chunks, and distribute the work among the segment servers. Taking advantage of the ‘share nothing’ architecture, Greenplum database is able to make use of the parallelism given to the data loading operations, achieving 2TB/hour data loading performance.

Fast data loading with DIA

posted Sep 14, 2012, 5:54 AM by Sachchida Ojha   [ updated Sep 14, 2012, 6:09 AM ]

The DIA servers are preloaded with RedHat Enterprise Linux operating systems, currently at version 5.5. It is also preloaded with a Greenplum utility called gpfdist. gpfdist is the Greenplum parallel file server utility used for facilitating fast data loading, making use of the DCA database’s MPP architecture.

Since the DIA servers are RedHat Linux hosts, they can also be configured as hosts for data integration software, such as Informatica, Talend, and Pentaho.

Now let's discuss about common questions asked by the customers such as,

1. How gpfdist is used in the DIA servers for data loading
2. How we can install the Informatica Integration Services on the DIA servers. 
3. How the DIA servers can be configured as a grid for the Informatica Enterprise Grid option.

DIA Configurations

posted Sep 14, 2012, 5:50 AM by Sachchida Ojha

EMC Greenplum Data Integration Accelerator (DIA), is used for fast data loading, using the included gpfdist utility, as well as using a popular data integration (DI) tool called Informatica. 

For customers interested in using Informatica as a DI tool, we will explain the reference architecture and how the tool is integrated into the DIA. There is a brief section on using the Greenplum PWX Adapter with the DIA and Informatica DI tool, and finally, how the DIA servers can serve as nodes in the Informatica Enterprise Grid.

To meet the challenges of fast data loading, the EMC Data integration Accelerator (DIA) is purpose-built for batch loading, and micro-batch loading, and leverages a growing number of data integration applications such as Informatica, Talend, and Pentaho. More software titles are being qualified for use in the DIA at this time.

The Data Integration Accelerator (DIA) is specially built to facilitate fast data 
loading to the DCA. It integrates the Greenplum data loading utility called gpfdist 
with the server, storage and networking gear into a single system. It leverages the 
high-speed internal 10 Gb/sec communication network to deliver the data quickly 
to the DCA.

The DIA meets the reliability requirements of the most mission-critical enterprises with 
the data availability consisting of RAID protection at the disk level. It is offered as a 
module for the DCA to allow flexible configurations to meet the customer’s 
particular solution requirements.

The DIA comes in blocks of 4 servers. Each block is referred to as a module. You 
can order up to 4 modules of DIA for a DCA installation. Currently each DIA server 
is a commodity server with 2TB SATA drives. Each DIA server comes with 12 CPU 
cores, 48GB of memory and12 2TB SATA disks, with a total usable capacity of about 
70TB. The exact server model may change over time, but the architecture should 
remain the same. With each server, the Greenplum gpfdist utility is pre-loaded by 
default.


The DIA Configurations
The following table shows some of the characteristics of the DIA modules. The physical dimensions, weights, power and cooling details can be found in the DIA
specification sheet on EMC Powerlink.


=======================================
                            DIA module
# servers                     4
# CPU cores               48
Total memory             192 GB
Total SATA disks        48
Usable capacity         70TB
=======================================
Table 1: DIA specifications

For customers looking for an appliance that has all the hardware and software ready to scale the loading performance for their needs, adding DIA modules to the
DCA may be the right solution .

As with most DCA modules, the DIA are orderable in building blocks (modules) of 4 servers each. For the DIA, the customer must already have at least one module of GPDB, which is considered the base module, and the DIA can reside in the same rack above the GPDB module(s), if there is room in the rack.

Building Blocks of the DIA
DIAs are purchased as optional modules of the DCA. Currently, customers can order their DCA/DIA in functional modules of up to 6 racks. In the near future, this
number will be increased to up to12 racks. For now, this configuration can be requested via RPQ.

The physical DCA racks and modules are assigned numbers (see table below): 

===================================================================
Rack 1        Rack 2        Rack 3        Rack 4        Rack 5        Rack 6
===================================================================
Module 4     Module 4     Module 4     Module 4      Module 4     Module 4
Module 3     Module 3     Module 3     Module 3      Module 3     Module 3
Module 2     Module 2     Module 2     Module 2      Module 2     Module 2
Module 1     Module 1     Module 1     Module 1      Module 1     Module 1
===================================================================

Table 2: Physical numbering of modules in DCA/DIA racks

The racks are numbered, starting from 1 and incrementing by 1, up to the 
maximum number of supported racks. Within each rack, we can house up to 4 
modules, each module containing 4 DCA/DIA servers. The modules are made up 
of the following functional modules. A Greenplum database (GPDB) module is 
mandatory as module 1 in Rack 1. If more GPDB is desired, they are built up until 
the rack is filled, and an additional rack can then be ordered. The DIA modules are 
then added on top of the GPDB modules.

The order the modules are added follows a manufacturing rule. During the 
manufacturing process, certain rules and logic are used in the building sequence:
1. GPDB Module is always the first module in Rack 1
2. From there, if more GPBD modules are needed, they are built upwards
3. DIA modules are added next
4. Then Hadoop and other modules are added

1-8 of 8