what is the Best practices for a health check after a crashed segment node is brought back to a Greenplum cluster

posted Apr 28, 2017, 4:57 PM by Sachchida Ojha
Hardware issues caused a Greenplum segment node to go down. what is the Best practices for a health check after a crashed segment node is brought back to a Greenplum cluster?

After the crashed segment node is brought back, the following needs to be checked :
  1. Ping every NIC on the segment to check if it is reachable.
  2. Run SSH to the segment:
     
    1. Run vmstat to confirm free memory is enough and CPU usage is quite idle.
    2. Run iostat -xpnC 5 40 on Solaris or  iostat -x 1 10 on Linux to confirm disk IO is OK.
    3. If OS is Solaris, run iostat -En|grep Hard and zpool status to identify hard disk errors.
    4. Run dmesg to confirm hard issues are clean.
       
  3. Check the segment instance's log to see if there is any clue for the root cause. 
     
  4. Run gpcheckperf with proper option to confirm that there is no problem with disk,network and stream tests. 
     
  5. If OS is Solaris, collect explorer output by running /opt/SUNWexplo/bin/explorer -w \!network and the output file is in/opt/SUNWexplo/output.
     
  6. After the above health check steps are completed and everything is clean, restart Greenplum database and create/drop a test table to confirm database is not in read only mode.
     
  7. Check gp_configuration_history system table to note down the exact time of this issue.
     
  8. Check gp_configuration and gp_pgdatabase on GP version 3.3.x or gp_segment_configuration on GP version 4.x to confirm primary and mirror segments are in proper status. Run gprecoverseg if needed.
Comments