ConnectEMC Dial Home Capability

posted Sep 11, 2012, 8:14 AM by Sachchida Ojha   [ updated Oct 9, 2012, 6:07 PM ]

The EMC Greenplum Data Computing Appliance and Data Integration Accelerator support dial home functionality through the ConnectEMC software.

ConnectEMC is a support utility that collects and sends event data - files indicating system errors and other information - from EMC products to EMC Global Services customer support. ConnectEMC sends DCA event files using the secure file transfer protocol (FTPS). If an EMC Secure Remote Support Gateway (ESRS) is used for connectivity, HTTPS or FTP are available protocols for sending alerts.

The ConnectEMC software is configured on the DCA master and standby master server and sent out through the external connection (eth1) either to an ESRS Gateway server or directly to EMC.

Dial Home Severity Levels

Alerts that arrive at EMC Global Services can have one of the following severity levels:

1. WARNING: This indicates a condition that might require immediate attention. This severity will create a service request.

2. ERROR: This indicates that an error occurred on the Greenplum DCA. System operation and/or performance is likely affected. This severity will create a service request.

3. UNKNOWN: This severity level is associated with hosts and devices on the Greenplum DCA that are either disabled (due to hardware failure) or unreachable for some other reason. This severity will create a service request.

4. INFO: An event with this severity level indicates that a previously reported error condition is now resolved. An event with this severity level is also used to provide information about the system that does not require any action. This severity will not create a service request. For example, Greenplum Database startup triggers an INFO alert.

The severity of events determines if a service request is created for EMC support to act on. The events listed in Table “DCA Error Codes” can generate multiple severity levels based on the error condition.

For example, the failure of a segment server disk drive will generate Symptom Code 13 with a severity of ERROR. The ConnectEMC software will dial home to Global Services customer support, and a service request will be created. Upon successful replacement of the disk drive, Symptom Code 13.11001 will be generated again, this time with a severity of INFO to notify the disk drive was replaced.

ConnectEMC Event Alerts

The table below lists all the conditions that cause ConnectEMC to send event data alerts to EMC Global Services.

DCA Error Codes
========================================================================================================================================
Code Description
1.1     Host not responding to SNMP calls, host may be down.
1.4     Interface status: could not open session to host 1.
2.15    Greenplum Database is ready to accept connections.
2.15005 Greenplum Database panic, insufficient resource queues available.
3.2000  Status of power supply, if PS fails, will get error with this code.
4.3000  Status of battery on system. Will report error on failure.
5.4001  Status of cooling device, e.g. fan failure.
5.4002  Temperature of system.
6.5001  Status check of a CPU. CPU failure will register here.
7.6001  Status of a CPU Cache device. Cache device failure will register here.
8.1002  Operating System Memory Status.
9.7000  Memory device status. Failed memory devices will get this code.
10.8003 Status of the network device.
10.8005 A configured network bond is unavailable.
10.8006 Network bonding on master servers: The bond interface has no active link/slave.
10.8007 Network bonding on master servers: The bond interface link/slave has changed.
10.8008 Network bonding on master servers: The bond interface links are all down.
10.8009 Network bonding on master servers: One of the bond interface links is down.
11.9001 Status of IO Controller.
11.9002 Status of battery on the IO Controller.
12.10002 Virtual Disk 1 Status: /dev/sda: nonCritical.
12.10004 Virtual disk size (MB).
12.10005 Write cache policy on virtual disk. For example, expected to be write back mode.
12.10006 Read cache policy of virtual disk. For example, expected to be adaptive read ahead.
12.10007 Detects offline, rebuilding raid and other unexpected virtual disk states.
12.10011 Percentage of disk space on virtual disk used.
12.10012 Virtual disk space used (KB).
13.11001 Status of drive. Drive failures use this ID.
14.12002 Interconnect Switch Operational Status.
14.12005 Operational status of Interconnect switch flash memory.
14.12006 State of Interconnect switch flash memory.
14.13001 Status errors from switch sensors - Fans, Power Supplies, and Temperature.
14.14    Interface 0 Description: unexpected snmp value: val_len<=0.
14.14001 Interface 0 Status: unexpected status from device.
15.2     An error detected in the SNMP configuration of the host.
15.3     Other SNMP related errors.
15.4     Connection aborted by SNMP.
15.5     Unexpected SNMP errors from the SNMP system libraries.
15.6     Can not find expected OID during SNMP walk.
16.0     Test Dial Home.
18.15000 Sent from inside GPDB when starting up.
18.15001 Sent from inside GPDB when GPDB could not access the status of a transaction.
18.15002 Sent from inside GPDB when interrupted in recovery.
18.15003 Sent from inside GPDB when a 2 phase file is corrupted.
18.15004 A test message sent from inside GPDB.
18.15005 Sent from inside GPDB when hitting a panic.
18.17000 Sent by healthmond when GPDB status is normal.
18.17001 Sent by healthmond when GPDB can not be connected to and was not shutdown cleanly, possible GPDB failure.
18.17002 Sent by healthmond when detecting a failed segment.
18.17003 Sent by healthmond when detecting a segment in change tracking.
18.17004 Sent by healthmond when detecting a segment in resync mode.
18.17005 Sent by healthmond when detecting a segment not in its preferred role, unbalanced cluster.
18.17006 Sent by healthmond when detecting a move of the master segment from mdw to smdw.
18.17007 Sent by healthmond when detecting a move of the master segment from smdw to mdw.
18.17008 Sent by healthmond when a query fails during health checking.
18.17009 Healthmond error querying GPDB State.
19.18000 ID for informational dial homes with general system usage information.
21.20000 Core files were found on the system.
21.20001 Linux kernel core dump files were found on the system - indicates a crash and reboot.
22.21000 Master Node Failover was successful.
22.21001 GPActivatestandby command failed during master node failover.
22.21002 Greenplum Database is not reachable after the failover.
22.21003 Error in bringing the remote(other) master server down during master node failover.
22.21004 Error in taking over the remote(other) master server IP.
22.21005 Unknown error in failover.
23.22002 Host did not complete upgrade within the specified timeout period. Timeout period is 12 hours by default unless set in /opt/dca/etc/healthmond/healthmond.cnf.
========================================================================================================================================

Comments