How to recover a failed segment

posted Apr 28, 2017, 3:57 PM by Sachchida Ojha
When the master can not connect to the segment instance, it marks it down in the Greenplum Database system catalog.
The segment instance remains offline until an administrator takes steps to bring the segment back online.Segment can fail due to several reasons
1. Hardware Failure
2. Network Failure
3. Segment instance is not running (there is no postgres database listener process)
4. Data is not accessible. This can be due to 
a) The data directory of the segment instance is corrupt or missing 
b) File system is corrupt 
c) Disk failure

Segment host failures usually cause multiple segment failures: all primary or mirror segments on the host are marked as down and non-operational. If mirroring is not
enabled and a segment goes down, the system automatically becomes non-operational.

Depending on if mirroring is enabled or not there is different steps to recover the failed segments.

a) Recovering the segment when mirroring is enabled

1.Ensure you can connect to the segment host from the master host. 
For example: $ ping failed_seg_host_address
2. Troubleshoot the problem that prevents the master host from connecting to the segment host. 
For example, the host machine may need to be restarted or replaced.
3. After the host is online and you can connect to it, run the gprecoverseg utility from the master host to reactivate the failed segment instances. 
For example:$gprecoverseg
4. The recovery process brings up the failed segments and identifies the changed files that need to be synchronized. 
5. This process can take some time; wait for the process to complete. During this process, database write activity is suspended.
6. After gprecoverseg completes, the system goes into Resynchronizing mode and begins copying the changed files. This process runs in the background while the
system is online and accepting database requests.
7. When the resynchronization process completes, the system state is Synchronized.
8. Run the gpstate utility to verify the status of the resynchronization process: 
For Example $ gpstate -m

b) To recover without mirroring enabled
1. Ensure you can connect to the segment host from the master host. 
For example: $ ping failed_seg_host_address
2. Troubleshoot the problem that is preventing the master host from connecting to the segment host. For example, the host machine may need to be restarted.
3. After the host is online, verify that you can connect to it and restart Greenplum Database. 
For example: $ gpstop -r
4. Run the gpstate utility to verify that all segment instances are online:
For example: $ gpstate -m
If a segment host is not recoverable and you lost one or more segments, recreate your Greenplum Database system from backup files.

If a segment host is not recoverable and you lost one or more segments, recreate your Greenplum Database system from backup files.

c) Recovering when both master and mirror segment failed (also called double faults)
In a double fault, both a primary segment and its mirror are down. This can occur if hardware failures on different segment hosts happen simultaneously. Greenplum
Database is unavailable if a double fault occurs. To recover from a double fault:
1. Restart Greenplum Database:
$gpstop -r
2. After the system restarts, run gprecoverseg:
$ gprecoverseg
3. After gprecoverseg completes, use gpstate to check the status of your mirrors:
$gpstate -m
4. If you still have segments in Change Tracking mode, run a full copy recovery:
$gprecoverseg -F

d) Check for unbalanced segments and rebalance the system
When a primary segment goes down, the mirror activates and becomes the primary segment. After running gprecoverseg, the currently active segment remains the primary and the failed segment becomes the mirror. The segment instances are not returned to the preferred role that they were given at system initialization time. This means that the system could be in a potentially unbalanced state if segment hosts have more active segments than is optimal for top system performance. 
check for unbalanced segments run
$gpstate -e
Note: All segments must be online and fully synchronized to rebalance the system. Database sessions remain connected during rebalancing, but queries in progress are canceled and rolled back.
1. Run gpstate -m to ensure all mirrors are Synchronized.
$gpstate -m
2. If any mirrors are in Resynchronizing mode, wait for them to complete.
3. Run gprecoverseg with the -r option to return the segments to their preferred roles.
$gprecoverseg -r
4. After rebalancing, run gpstate -e to confirm all segments are in their preferred roles.
$gpstate -e
Comments