SMP Disaster Recovery Procedure

Overview

A redundant SMP consists of primary servers which run all services (active) and secondary servers which run no services (standby) until a switch-over is made.

If one of the primary servers become unavailable due to a failure or any other unforeseen event and it seems impossible to repair the fault in an acceptable time, the decision can be made to switch over to the secondary servers to restore the service. The failing primary server should then be replaced as soon as possible after which a switch is made from the secondary back to the primary (switch-back).

The standard switch-over procedure can be executed very quickly. However, not all functionality will be available on the secondary after the standard switch-over. If required, an extended switch over can be performed later to enable all functionality (See Limitations below). The extended switch-over takes more time because it involves a high volume of data to be copied and also a DNS cache timeout.

Limitations

The secondary SMP should only be used in case of a failure of the primary and only for a limited time (e.g a few days), just enough to repair the primary.

Following functions are supported on the secondary with the standard switch-over:

  • View and change configuration of SOPs and clusters using the web interface
  • Apply changes
  • Install modules
  • Restore DB snapshots (both from prior and after switch-over)

Following functions are only supported on the secondary if the extended switch-over has been performed:

  • Advanced Reporting and SyncCDRDB
  • Scheduled Tasks and Reports
  • Adding new SOPs on the SMP
  • Making new (versions of) resources, modules, actions available on the SMP
  • Saved DB snapshots will not be available after switch-back

Disaster Recovery Test

The purpose of this test is to validate that in case of an unavailability of the primary SMP, the secondary SMP is capable of taking over the essential SMP services, i.e. that the disaster recovery procedure is actually working.

Preparation

  • Apply change on test sop using primary SMP
  • Validation test of telephony service and applications (by Customer)
  • Disconnect active SMP server(s) from the network (by Customer)

Switch-over

  • Execute switch-over procedure (by ESCAUX)
  • Apply change on test sop using secondary SMP
  • Validation test of telephony service and applications (by Customer)
  • Make "move/add/change" configuration changes (by Customer)
  • Apply change on test sop again using secondary SMP
  • Test that configuration changes are applied (by Customer)

Switch-back

  • Reconnect active SMP server(s) (by Customer)
  • Execute switch-back procedure (by ESCAUX)
  • Apply cluster change (on primary SMP)
  • Test that configuration changes are still applied (by Customer)

Caveats

  • Configuration changes made in the time frame between the last backup (nightly, CET) and the failure of the primary are lost.
    • This can be avoided in case the primary is still functioning such that a backup can be taken prior to the switch-over.
  • the DNS change of smp-boot is excluded from the DRP test to prevent impact on the production environment. The DRP test is simply to short to test this. The change needs time to propagate (TTL = 1 hour). During the time of the DRP test, productions SOPs cannot reestablish their ssh connection.

Copyright © Escaux SA