[Comtec Announce] RFO - IP Voice Outage 6th August 2010 [Update]

Mon Oct 11 08:51:38 BST 2010

UPDATE - Service Alert: 06/10/2010

As per the last update on Tuesday 5th October please find below the post incident report:

This report provides an overview of the results of internal investigations into, and corrective actions being taken in relation to, the reportable incidents on Tuesday 5th October regarding the loss of incoming and outgoing calls.

There will be a further update provided on the 14th October, which will be to provide specific updates against the key actions as detailed in this report.

Incident Summary

On Tuesday 5th October at 05:00 a Major Incident occurred on the IPVS breakout - IP Exchange platform, resulting in approximately 90% of outgoing calls to all parts of the UK PSTN, and up to 33% of incoming calls from Scotland and the North of England to fail.

The fault was caused by a configuration error made during a planned work. Service was restored at 09:10 by correcting the configuration error. The total duration of the incident was 4 hours and 10 minutes.

05th October 2010

04:33 - Work commenced on the SIP NICS (PSTN Interconnect Gateway) and MGX (Media Gateway) both located at Manchester, under planned work reference PW136082. The purpose of the planned work was to load a configuration on the MGX to provide T38 fax functionality to the IP Exchange platform. The window for downtime of the planned work was 04:30 to 08:00 and the expected impact was deemed as being  "Low"  the notification was sent out on the 08/09/10.  The work involved reconfiguring a pair of VXSM (Voice Switch Service Module) cards known as AS1 Manchester. The configuration work was completed at approximately 05:00 and subsequent test calls were made.

05:45 - The first fault report was received by the IP Exchange operational team from one Communication Provider (CP).

07:15 - The fault was handed over from the out of hour's team to the day duty, at which point it was believed that the fault was an isolated issue affecting only one CP.

07:30 - It was identified that the whole platform was affected, and an in depth investigation began, including reviewing any planned works that had taken place overnight.

08:15 - A technical bridge was established, during which time two further CP's reported faults.

08:35 - The issue was reported to the IMT (Incident Management Team).

08:45 - Between the hours of 08.45 and 09.15 further fault reports were received from multiple CP's.

08:52 - The incident was declared as a Major Incident and communications were sent out to all CP's.

09:10 - Service was restored after correcting a configuration error that was found on a pair of VXSM cards on the MGX, known as AS2 Manchester. The pair of cards that make up AS2 are of the same type as, and are in adjacent slots to, the cards that make up AS1.

Observations

The root cause of this incident has been confirmed as human error during the planned work PW136082.

This work involved reconfiguring a pair of VXSM cards, known as AS1, on the Manchester MGX node. The person who carried out the planned work applied "part of the configuration" changes in error to AS2 and on realising this mistake set about reversing the changes, however as AS2 had previously had the T38 configuration applied on 1st October under a previous planned works; reference PW136087 which was notified to all IPX CP's on the 22/09/10. On removing this "part of the configuration" left AS2 with the pre T38 configuration that was deployed on the 1st October, and therefore any calls offered to the PSTN switches via AS2 failed.

On completion of the PEW test calls were only made through AS1, therefore leaving the issue unidentified, and it is also apparent that the response times to identify this as a platform incident should have been significantly quicker.

We are also investigating as to why the failure of one pair of VXSM cards had the level of impact to call flows.

Key Actions

The following actions have been identified as part of this incident review.

AP01 Confirm Root Cause - Complete

Root cause has been confirmed as human error- and it is understood what errors were made.

AP02 Training - Due 12/10/10

Both the individual concerned and the wider team to be trained on lessons learned.

AP03 Review Post PEW checks - Due 12/10/10 In particular to widen the range of test calls made, not just to test the specific equipment that has been worked on.

AP04 Understand level of severity caused by one pair of VXSM cards- Due 08/10/10 We are currently investigating the level of impact caused by the failure of one pair of VXSM cards.

AP05 Review How Initial Fault was handled - Due 08/10/10 To investigate why it took from 05:45 to 07:30 to identify that it was a platform issue.

 A further update against the actions as above will be sent out on the 14th October,

Phil Reed   Technical Director

Comtec Enterprises Ltd

Comtec House, 46a Albert Road North

Reigate, Surrey RH2 9EL

E:  phil.reed at comtec.com<mailto:ross.warnock at comtec.com>

T:  0845 899 1414  |  F: 0845 899 1401

W: www.comtec.com<http://www.comtec.com>

Comtec Enterprises Ltd - Winner of 'innovation in the datacentre' Datacentre Dynamics Award 2007

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sargasso.net/pipermail/uk-announce/attachments/20101011/6992e26e/attachment.html>