posted Apr 28, 2017, 4:57 PM by Sachchida Ojha
Hardware issues caused a Greenplum segment node to go down. what is the Best practices for a health check after a crashed segment node is brought back to a Greenplum cluster?
After the crashed segment node is brought back, the following needs to be checked : - Ping every NIC on the segment to check if it is reachable.
- Run SSH to the segment:
- Run vmstat to confirm free memory is enough and CPU usage is quite idle.
- Run iostat -xpnC 5 40 on Solaris or iostat -x 1 10 on Linux to confirm disk IO is OK.
- If OS is Solaris, run iostat -En|grep Hard and zpool status to identify hard disk errors.
- Run dmesg to confirm hard issues are clean.
- Check the segment instance's log to see if there is any clue for the root cause.
- Run gpcheckperf with proper option to confirm that there is no problem with disk,network and stream tests.
- If OS is Solaris, collect explorer output by running /opt/SUNWexplo/bin/explorer -w \!network and the output file is in/opt/SUNWexplo/output.
- After the above health check steps are completed and everything is clean, restart Greenplum database and create/drop a test table to confirm database is not in read only mode.
- Check gp_configuration_history system table to note down the exact time of this issue.
- Check gp_configuration and gp_pgdatabase on GP version 3.3.x or gp_segment_configuration on GP version 4.x to confirm primary and mirror segments are in proper status. Run gprecoverseg if needed.
|
|