[Comtec Announce] [IP Voice Services] Service Alert 26th July 2014 - Update 2

David Croft david at comtec.net.uk
Wed Jul 30 13:27:49 BST 2014


This is an incident notification regarding IP Voice Services.

Date: Saturday, 26th July 2014
Start time: 12:36 BST (UTC+0100)
End time: 13:32 BST (UTC+0100)

Services affected:

IP Voice Services.

Report:

An incident occurred affecting registration on IPVS and network
connectivity to all ancillary services.

Root Cause:

All layer 3 connections on a primary core router dropped due to a
sudden failure with this element.

Due to the nature of the failure, whilst BGP sessions on the primary
router did failover to the secondary as expected, it did not release
primary routing responsibility to its peer to complete the failover.
This caused traffic to continue to route to the primary router but no
further and as a result caused the service impact.

At this time the root cause is being associated to the corrective
actions intended to be taken via planned engineering works PEWA0195
that was scheduled for 03/08/2014. PEWA0195 related to a planned
reboot of one of the core network routers to correct a memory issue.

In close coordination with our vendor it was expected that the
impacted router should have remained stable until that time. Due to
this incident however, the work is no longer necessary.

Symptoms:

Up to 50% of active call traffic was affected and up to 90% of users
on the platform experienced a drop in registration on the SBC (Session
Border Controller).

Whilst unregistered, users would experience outbound calls failing and
inbound calls may not have been presented to user devices.

Inbound calls from the PSTN to DDIs that were forwarded to off-net
destinations routed as normal.

This also affected access to the platform for some supplementary
portals and services.

Resolution:

The engineering team rebooted the affected router at 13:32. The
failover to the secondary router took place as expected. At 13:37 the
primary router was brought back into service without issues.

Engineering are continuing to monitor all services to ensure there are
no ongoing problems or a recurrence of this issue.

All of the necessary logs have been taken from the affected router and
will be analysed in conjunction with the vendor to identify and
further confirm the underlying cause of the failure. Whilst this
investigation is ongoing, enhanced monitoring has been configured for
the router, based on the logs taken, to give advanced warning of this
event re-occurring so that a maintenance window can be scheduled if
required. The extended measures now in place should ensure that prompt
and controlled actions are taken if required to prevent further
negative impact.

Timeline:

26/07/2014

12:36 - All layer 3 connections on a particular primary core router
were lost due to a failure with this element. Automated alerts were
generated to key Engineering and Support representatives to notify of
the issue. We also picked this up with our own monitoring.

Due to the nature of the failure most BGP sessions on the primary
router were down but it did not release service to the secondary as
expected, causing traffic to route to the primary but not progress out
beyond this point.

This affected access to the platform for most portals and services,
including registrations and call processing.

Calls in to the platform from the PSTN continued to be accepted by
platform services or redirected as configured back out to the PSTN.
However, calls would not have been presented to end user
devices/systems affected by the routing failure.

13:32 - Once the Engineering Team had fully investigated the issue and
identified the cause of the failure it was decided to reboot the
primary router in order to restore service.

The reboot correctly took the primary router out of service and the
failover took place as expected from the primary to the secondary core
router and services were restored.

13:37 – The affected router returned to service as expected without
any errors and resumed the primary role for traffic processing.

Engineering are continuing to monitor all services to ensure there are
no ongoing problems or a re-occurrence of this issue.

Apologies for the inconvenience caused.

This is the final update.

Comtec NOC was tracking this issue under ticket [#KOH-381-15457].

Best regards,

David Croft

--
David Croft
Lead Engineer

Comtec Enterprises Ltd
Comtec House
46a Albert Road North
Reigate Industrial Estate
Reigate
Surrey RH2 9EL

Tel: 0845 899 1400
Fax: 0845 899 1401
www.comtec.com

For urgent operational issues please always contact noc at comtec.com
or 0845 899 1423 and not any named individual.


More information about the UK-Announce mailing list