Service Assurance: Active-Active redundancy

Revision: 11 Apr 2017 - rev 14

History

[2012/05/22]
  • Initial version

Working Principles

Failure detection

The unavailability of a remote SOP or SIP device is detected thanks to the SIP OPTIONS mechanism (activated by the setting "sip qualify"). The SOP periodically sends a SIP OPTIONS request and marks the related SIP trunks or SIP phones as UNREACHABLE. This information is used by the SOP when trying to dial on a SIP trunk or to a SIP phone.

Intra-cluster routing

When trying to route a call to an extension, the SOP will apply the intra-cluster routing prior to the execution of the callflow itself. If the primary SOP of the extension is the local SOP, the call will be sent to the callflow associated with the current status of the extension. If the primary SOP is a remote SOP, the call will be tentatively routed to this one. In case of congestion, the call will be routed to the secondary SOP. If this one is also congested, the call is release with a release cause 'congestion'.

Note that if the secondary SOP for the destination extension is the current SOP, the same routing rule apply. The call will also be routed to the primary sop. No load balancing can be achieved this way.

Phone registration

The SIP phone will always try to register on the primary SOP if possible.

Depending of the phone, it will also try to register in parallel to the second one (e.g. Polycom and Snom).

Some other phone will only register to the second SOP, if the registration with the first one fails (e.g. Eyebeam, Aastra). In this case the extension will be reachable only after the registration expires.

Outbound call from the phone

The SIP phones always try to send the call to the primary SOP first. Some SIP phone (e.g. Polycom) will try after the secondary SOP if no response is received. Some other will only try the secondary SOP after the registration has expired.

Dynamic profiles

If the profile of the extension is dynamic and some parameters or the status can be changed via the SOP, for example via a callflow or via net.Desktop. You will need to activate the profile synchronization in the Unified Communication Model module. This will enables to have the same callflow behaviour on both the primary and the secondary SOP.

How-To

How to check the SIP qualify?

Via the asterisk console do the following command:

00000037*CLI> sip show peers like SOA2
Name/username              Host            Dyn Nat ACL Port     Status
SOA20004/SOA20004          172.16.35.137               5060     OK (1 ms)
SOA20003/SOA20003          172.16.35.137               5060     OK (1 ms)
SOA20002/SOA20002          172.16.35.44                5060     OK (1 ms)
SOA20001/SOA20001          172.16.35.96                5060     OK (1 ms)

How to check if a phone is defined on both the primary and the secondary SOP?

On both SOP1 and SOP2 defined on the phone run the following command
00000037*CLI> sip show peers like SDO20003
Name/username              Host            Dyn Nat ACL Port     Status
SDO20003/SDO20003          172.16.35.26     D          2048     OK (14 ms)
1 sip peers [1 online , 0 offline]

ALERT! Depending of the phone, for example the Aastra phone, the SOP will be defined on both SOP but registered only on one of them.

How to trace the route taken by a call?

Here is an example of trace of the intra-cluster routing.

Here we can the see the primary and the secondary SOP:
May 22 18:51:55    -- Executing Set("SIP/SOA20003-b6bc5d68", "_LastUserSop1=00000034|_LastUserSop2=00000033|_LastUserLogin=|_LastUserOwner=|_LastUserExt=6705|_LastUserFirstName=long conversation robot|_LastUserLastName=|_LastUserEmail=|_LastUserMobileNumber=|_LastUserFaxNumber=|_LastUserHomeNumber=|_LastUserPrimaryPhone=|_LastUserSecondaryPhone=|_LastUserSite=|_LastUserGroup=|_LastUserPincode=1234|_LastUserOffice=|_LastUserOffice=|_LastUserDepartment=|_LastUserLang=") in new stack

Here we can see the MeshSIPTrunk used to call the remote SOP:
May 22 18:51:55    -- Executing Dial("SIP/SOA20003-b6bc5d68", "SIP/6705@SOA20001||o") in new stack
May 22 18:51:55    -- Called 6705@SOA20001
May 22 18:51:55    -- SIP/SOA20001-08624eb0 is ringing

How to make sure that the active-active synchronization for the queue is really active

Who does what ?

Several scripts and files are involved in this mechanism

  • /escaux/bin/queue_mngt.pl This is the main script for the synchronization. It listens to the Asterisk Manager and sends HTTP requests to the other host when it detects an event related to the queues. It is supposed to be always up and running.
  • /escaux/bin/queue_resync.pl This script handles the resynchronization if something went wrong (SOP down, network problem, ...). It's launched periodically by a cronjob. You can also start it manually to resynchronize the members. With the -v option, you will see the result of the script.
  • /var/www/queue/queue.pl This script listens to HTTP request sent by other hosts to update the status of the queue members.
  • /var/log/queue_sync.log This is the log file for the sync mechanism.

The "live" active-active sync

The "live" active-active sync for the queue works like this:

  • The script /escaux/bin/queue_mgnt.pl listens to the Asterisk Manager for the events related to the queue (like proxy.pl).
  • When an event related to a queue happens, the script will send a request to the other host (the other SOP on which the queue is defined) and it will insert a record in the attribute_state table (in the cdrdb database). You can see those events by doing SELECT * from attribute_state WHERE class = 'queue_sync'; in the cdrdb database.
  • The other host will update the state of the members accordingly.

If something goes wrong (one SOP is down, the network is down, ...), there is also a resynchronization mechanism that will ensure that the members are in the same state on the two SOPs. This script is launched perdiodically, so a small delay can be observed before the synchronization when a SOP comes back online.

The resynchronization

The resynchronization works like this:

  • When the script /escaux/bin/queue_resync.pl is called, it makes a request to the other host. Something like http://X.X.X.X/queue/queue.pl?method=QueueStatus
  • The script queue.pl on the other host will query the attribute_state table to know the state of the several members of the queue and will return them as an XML file.
  • The script queue_resync.pl will read the result and update the state of the members accordingly, sending requests to the Asterisk Manager.

How to force resynchronization of queue members state ?

In the SOP Shell, navigate to menu
DONE Navigate to: Shell > Diagnostics > Queues > Force queue synchronization

Queue synchronization logging is accessible via the Sop shell menu
DONE Navigate to: Shell > Diagnostics > Queues > View queue synchronization log

Copyright © Escaux SA