Notifications v4
Failover Manager sends email notifications and invokes a notification script when a notable event occurs that affects the cluster. If you configured Failover Manager to send an email notification, you must have an SMTP server running on port 25 on each node of the cluster. Use the following parameters to configure notification behavior for Failover Manager:
For more information about editing the configuration properties, see Specifying cluster properties.
The body of the notification contains details about the event that triggered the notification and about the current state of the cluster. For example:
The VIP field displays the IP address and state of the virtual IP if implemented for the node.
Failover Manager assigns a severity level to each notification. The following levels indicate increasing levels of attention required:
INFO
indicates an informational message about the agent and doesn't require any manual intervention (for example, Failover Manager has started or stopped). See List of INFO level notificationsWARNING
indicates that an event has happened that requires the administrator to check on the system (for example, failover has occurred). See List of WARNING level notificationsSEVERE
indicates that a serious event has happened and requires the immediate attention of the administrator (for example, failover was attempted but can't complete). See List of SEVERE level notifications
The severity level designates the urgency of the notification. A notification with a severity level of SEVERE
requires user attention immediately, while a notification with a severity level of INFO
calls your attention to operational information about your cluster that doesn't require user action. Notification severity levels aren't related to logging levels. All notifications are sent regardless of the log level detail specified in the configuration file.
You can use the notification.level
property to specify the minimum severity level to trigger a notification.
Note
In addition to sending notices to the administrative email address, all notifications are recorded in the agent log file (/var/log/efm-4.<x>/<*cluster name*>.log
).
The conditions listed in this table trigger an INFO level notification:
Subject | Description |
---|---|
Executed fencing script | Executed fencing script script_name Results: script_results |
Executed post-promotion script | Executed post-promotion script script_name Results: script_results |
Executed remote pre-promotion script | Executed remote pre-promotion script script_name Results: script_results |
Executed remote post-promotion script | Executed remote post-promotion script script_name Results: script_results |
Executed post-database failure script | Executed post-database failure script script_name Results: script_results |
Executed primary isolation script | Executed primary isolation script script_name Results: script_results |
Witness agent running on node_address for cluster cluster_name | Witness agent is running. |
Primary agent running on node_address for cluster cluster_name | Primary agent is running and database health is being monitored. |
Standby agent running on node_address for cluster cluster_name | Standby agent is running and database health is being monitored. |
Idle agent running on node node_address for cluster cluster_name | Idle agent is running. After starting the local database, the agent can be resumed. |
Assigning VIP to node node_address | Assigning VIP VIP_address to node node_address Results: script_results |
Releasing VIP from node node_address | Releasing VIP VIP_address from node node_address Results: script_results |
Starting auto resume check for cluster cluster_name | The agent on this node checks every auto.resume.period seconds to see if it can resume monitoring the failed database. Check the cluster during this time and stop the agent if the database won't be started again. See the agent log for more details. |
Executed agent resumed script | Executed agent resumed script script_name Results: script_results |
WAL logs backed up during promotion | When reconfiguring this standby to follow the new primary, the pg_wal contents were backed up in the pgdata directory. Remove this backup when convenient to free up disk space. |
Reset members completed | The agent has rejoined the cluster after a call to reset the cluster members. |
The conditions listed in this table below trigger a WARNING level notification:
Subject | Description | Notes |
---|---|---|
Witness agent exited on node_address for cluster cluster_name | Witness agent has exited. | |
Primary agent exited on node_address for cluster cluster_name | Database health is not being monitored. | |
Cluster cluster_name notified that primary agent has left | Failover is disabled for the cluster until the primary agent is restarted. | |
Standby agent exited on node_address for cluster cluster_name | Database health is not being monitored. | |
Agent exited during promotion on node_address for cluster cluster name | Database health is not being monitored. | |
Agent exited on node_address for cluster cluster name | The agent has exited. This is generated by an agent in the Idle state. | |
Agent exited for cluster cluster name | The agent has exited. This notification is usually generated during startup when an agent exits before startup has completed. | |
Virtual IP address assigned to non-primary node | The virtual IP address appears to be assigned to a nonprimary node. To avoid any conflicts, Failover Manager will release the VIP. Confirm that the VIP is assigned to your primary node and manually reassign the address if it isn't. | |
Virtual IP address not assigned to primary node. | The virtual IP address appears not to be assigned to a primary node. Failover Manager will attempt to reacquire the VIP. | |
Standby failed in cluster cluster name | The standby on address has left the cluster. | |
Standby agent failed for cluster cluster name | A standby agent on cluster_name has left the cluster, but the coordinator has detected that the standby database is still running. | |
Standby database failed for cluster cluster name | A standby agent has signaled that its database has failed. The other nodes also can't reach the standby database. | |
Standby agent cannot reach database for cluster cluster name | A standby agent has signaled database failure, but the other nodes have detected that the standby database is still running. | |
Cluster cluster name has dropped below three nodes | At least three nodes are required for full failover protection. Add witness or agent node to the cluster. | |
Subset of cluster cluster name disconnected from primary | This node is no longer connected to the majority of the cluster cluster name. Because this node is part of a subset of the cluster, failover will not be attempted. Current nodes that are visible are: node_address. | |
Promotion has started on cluster cluster name. | The promotion of a standby has started on cluster cluster_name. | |
Witness failure for cluster cluster_name | Witness running at node_address has left the cluster. | |
Idle agent failure for cluster cluster_name. | Idle agent running at node_address has left the cluster. | |
One or more nodes isolated from network for cluster cluster_name | This node appears to be isolated from the network. Other members seen in the cluster are: node_name | |
Node no longer isolated from network for cluster cluster_name. | This node is no longer isolated from the network. | |
Failover Manager tried to promote, but primary DB is still running | Failover Manager has started promotion steps, but detected that the primary DB is still running on address. This usually indicates that the primary Failover Manager agent has exited. Failover has not occurred. There is no failover protection until the primary agent is restarted. | |
Primary agent missing for cluster cluster_name | The primary agent has previously left the cluster. Until a primary agent joins the cluster, there is no failover protection. | Available in Failover Manager 4.2 and later. |
Standby agent started to promote, but primary has rejoined. | The standby Failover Manager agent started to promote itself but found that a primary agent has rejoined the cluster. Failover has not occurred. | |
Standby agent tried to promote, but could not verify primary DB | The standby Failover Manager agent tried to promote itself but could not detect whether the primary DB is still running on node_address. Failover has not occurred. | Description applicable for Failover Manager 4.2 and later. |
Standby agent tried to promote, but could not verify primary DB | The standby Failover Manager agent tried to promote itself but detected that the primary DB is still running on node_address. This usually indicates that the primary Failover Manager agent has exited. Failover has not occurred. | Description applicable for Failover Manager 4.1 and earlier. |
Standby agent tried to promote, but VIP appears to still be assigned | The standby Failover Manager agent tried to promote itself but couldn't because the virtual IP address (VIP_address) appears to still be assigned to another node. Promoting under these circumstances can cause data corruption. Failover has not occurred. | |
Standby agent tried to promote, but appears to be orphaned | The standby Failover Manager agent tried to promote itself but couldn't because the well-known server (server_address) couldn't be reached. This usually indicates a network issue that has separated the standby agent from the other agents. Failover has not occurred. | |
Potential manual failover required on cluster cluster_name. | A potential failover situation was detected for cluster cluster_name. Automatic failover was disabled for this cluster, so manual intervention is required. | |
Failover has completed on cluster cluster_name | Failover has completed on cluster cluster_name. | |
Lock file for cluster cluster_name has been removed | The lock file for cluster cluster_name has been removed from: path_name on node node_address. This lock prevents multiple agents from monitoring the same cluster on the same node. Restore this file to prevent accidentally starting another agent for cluster. | |
A recovery file for cluster cluster_name has been found on primary node | A recovery file for cluster cluster_name hwas found at: path_name on primary node node_address. This can cause a problem if you attempt to restart the DB on this node. | |
recovery_target_timeline is not set to latest in recovery settings | The recovery_target_timeline parameter isn't set to latest in the recovery settings. The standby server can't follow a timeline change that occurs when a new primary is promoted. | |
Promotion has not occurred for cluster cluster_name | A promotion was attempted but there is already a node being promoted: ip_address. | |
Standby will not be reconfigured after failover in cluster cluster_name | The 'auto.reconfigure' property has been set to false for this node, so it won't be reconfigured to follow the new primary node after a promotion. | Subject and Description applicable to Failover Manager 4.2 and later. |
Standby not reconfigured after failover in cluster cluster_name | The auto.reconfigure property was set to false for this node. The node was not reconfigured to follow the new primary node after a failover. | Subject and Description applicable to Failover Manager 4.1 and earlier. |
Could not resume replay for standby standby_id. | Couldn't resume replay for standby. Manual intervention might be required. Error: error_message. | |
Possible problem with database timeout values | Your remote.timeout value (value) is higher than your local.timeout value (value). If the local database takes too long to respond, the local agent might assume that the database has failed though other agents can connect. While this doesn't cause a failover, it might force the local agent to stop monitoring, leaving you without failover protection. | |
No standbys available for promotion in cluster cluster_name | The current number of standby nodes in the cluster has dropped to the minimum number: number. There can't be a failover unless another standby node is added or made promotable. | |
No promotable standby for cluster cluster_name | The current failover priority list in the cluster is empty. You have removed the only promotable standby for the cluster cluster_name. There cannot be a failover unless another promotable standby node(s) is added or made promotable by adding to failover priority list. | |
Synchronous replication has been reconfigured for cluster cluster_name | The number of synchronous standby nodes in the cluster has dropped below number. The synchronous standby names on primary was reconfigured to: new synchronous_standby_names value. | |
Synchronous replication has been disabled for cluster cluster_name. | The number of synchronous standby nodes in the cluster has dropped below count. The primary was taken out of synchronous replication mode. | |
Could not reload database configuration. | Couldn't reload database configuration. Manual intervention is required. Error: error_message. | |
Custom monitor timeout for cluster cluster_name | The following custom monitoring script has timed out: script_name | |
Custom monitor 'safe mode' failure for cluster cluster_name | The following custom monitor script has failed, but is being run in "safe mode": script_name. Output: script_results | |
primary.shutdown.as.failure set to true for primary node | The primary.shutdown.as.failure property has been set to true for this cluster. Stopping the primary agent without stopping the entire cluster is treated by the rest of the cluster as an immediate primary agent failure. If maintenance is required on the primary database, shut down the primary agent and wait for a notification from the remaining nodes that failover will not happen. | |
Primary_or_Standby cannot ping local database for cluster cluster_name | The Primary_or_Standby agent can no longer reach the local database running at node_address. Other nodes are able to access the database remotely, so the agent becomes IDLE and attempts to resume monitoring the database. | Subject and Description applicable to Failover Manager 4.1 and later. |
PPrimary cannot ping local database for cluster cluster_name | The Primary agent can no longer reach the local database running at node_address. Other nodes can access the database remotely, so the primary becomes IDLE and attempts to resume monitoring the database. | Subject and Description applicable to Failover Manager 4.0. |
Standby cannot resume monitoring local database for cluster cluster_name | The standby agent can no longer reach the local database running at node_address. Other nodes can access the database remotely. The standby agent remains IDLE until the resume command is run to resume monitoring the database. | Available in Failover Manager 4.1 and later. |
Primary agent left the cluster and node detached from load balancer | The standby agent can no longer reach the local database running at node_address. Other nodes can to access the database remotely. The standby agent remains IDLE until the resume command is run to resume monitoring the database. | Available in Failover Manager 4.2 and later. |
Standby database in cluster cluster_name not stopped before primary is promoted | The standby.restart.delay property is set for this agent, so it is not reconfigured to follow the new primary until num_seconds seconds after promotion has completed. In some cases it might not be able to follow the new primary without manual intervention. | Available in Failover Manager 4.2 and later. |
The conditions listed in this table trigger a SEVERE notification:
Subject | Description | Notes |
---|---|---|
Standby database restarted but Failover Manager cannot connect | The start or restart command for the database ran successfully but the database is not accepting connections. Failover Manager will keep trying to connect for up to restart.connection.timeout seconds. | |
Unable to connect to DB on node_address | The maximum connections limit was reached. | |
Unable to connect to DB on node_address | Invalid password for db.user=user_name. | |
Unable to connect to DB on node_address | Invalid authorization specification. | |
Primary cannot resume monitoring local database for cluster cluster_name | The primary agent can no longer reach the local database running at node_address. Other nodes are able to access the database remotely, so the primary does not release the VIP or create a recovery.conf file. The primary agent remains IDLE until the resume command runs to resume monitoring the database. | |
Fencing script error | Fencing script script_name failed to execute successfully. Exit Value: exit_code Results: script_results Failover has not occurred. | |
Post-promotion script failed | Post-promotion script script_name failed to execute successfully. Exit Value: exit_code Results: script_results | |
Remote post-promotion script failed | Remote post-promotion script script_name failed to execute successfully Exit Value: exit_code Results: script_resultsNode: node_address | |
Remote pre-promotion script failed | Remote pre-promotion script script_name failed to execute successfully Exit Value: exit_code Results: script_resultsNode: node_address | |
Post-database failure script error | Post-database failure script script_name failed to execute successfully. Exit Value: exit_code Results: script_results | |
Agent resumed script error | Agent resumed script script_name failed to execute successfully. Results: script_results | |
Primary isolation script failed | Primary isolation script script_name failed to execute successfully. Exit Value: exit_code Results: script_results | |
Could not promote standby | The promote command failed on node. Couldn't promote standby. Error details: error_details | |
Error creating recovery.conf file on node_address for cluster cluster_name | There was an error creating the recovery.conf file on primary node node_address during promotion. Promotion has continued but requires manual intervention to ensure that the old primary node can't be restarted. Error details: message_details | |
An unexpected error has occurred for cluster cluster_name | An unexpected error has occurred on this node. Check the agent log for more information. Error: error_details | |
Primary database being fenced off for cluster cluster_name | The primary database has been isolated from the majority of the cluster. The cluster is telling the primary agent at ip_address to fence off the primary database to prevent two primaries when the rest of the failover manager cluster promotes a standby. | |
Isolated primary database shutdown. | The isolated primary database has been shutdown by failover manager. | |
Primary database being fenced off for cluster cluster_name | The primary database was isolated from the majority of the cluster. Before the primary could finish detecting isolation, a standby was promoted and has rejoined this node in the cluster. This node is isolating itself to avoid more than one primary database. | |
Could not assign VIP to node node_address | Failover manager couldn't assign the VIP address for some reason. | |
primary_or_standby database failure for cluster cluster_name | The database has failed on the specified node. | |
Agent is timing out for cluster cluster_name | This agent has timed out trying to reach the local database. After the timeout, the agent successfully pinged the database and resumed monitoring. However, check the node to make sure it is performing normally to prevent a possible database or agent failure. | |
Resume timed out for cluster cluster_name | This agent couldn't resume monitoring after reconfiguring and restarting the local database. See agent log for details. | |
Internal state mismatch for cluster cluster_name | The Failover Manager cluster's internal state didn't match the actual state of the cluster members. This is rare and can be caused by a timing issue of nodes joining the cluster or changing their state. The problem should be resolved, but you check the cluster status as well to verify. Details of the mismatch can be found in the agent log file. | |
Failover has not occurred | An agent has detected that the primary database is no longer available in cluster cluster_name, but there are no standby nodes available for failover. | |
Failover has not occurred | An agent has detected that the primary database is no longer available in cluster cluster_name, but there are not enough standby nodes available for failover. | |
Database in wrong state on node_address | The standby agent has detected that the local database is no longer in recovery. The agent now becomes IDLE. Manual intervention is required. | |
Database in wrong state on node_address | The primary agent has detected that the local database is in recovery. The agent now becomes IDLE. Manual intervention is required. | |
Database connection failure for cluster cluster_name | This node is unable to connect to the database running on: node_address. Until this is fixed, failover might not work properly because this node can't check whether the database is running. | |
Standby custom monitor failure for cluster cluster_name | The following custom monitor script has failed on a standby node. The agent will stop monitoring the local database. Script location: script_name Script output: script_results | |
Primary custom monitor failure for cluster cluster_name | The following custom monitor script has failed on a primary node. Failover Manager will attempt to promote a standby. Script location: script_name Script output: script_results | |
Loopback address set for ping.server.ip | Loopback address is set for ping.server.ip property. This setting can interfere with the network isolation detection and hence you should change it. | |
Load balancer attach script error | Load balancer attach script script_name failed to execute successfully. Exit Value: exit_code Results: script_results | |
Load balancer detach script error | Load balancer detach script script_name failed to execute successfully. Exit Value: exit_code Results: script_results | |
Pgpool attach node error | Failover Manager failed to attach pgpool node. Exit Value: exit_code. Results: script_results | Available in Failover Manager 4.1 and later. |
Pgpool detach node error | Failover Manager failed to detach pgpool node. Exit Value: exit_code. Results: script_results | |
Not enough synchronous standbys available in cluster cluster_name | The number of synchronous standby nodes in the cluster has dropped to count. All write queries on the primary are blocked until enough synchronous standby nodes are added. |