Service Unavailable - 5th September 8:03

8:03. We are aware of a major outage at one of our data centres. We have engineers on site and hope to get this resolved very swiftly. We apologise for the inconvenience and will keep you updated.

10:30: The login page is now back up. However, we have to do some recovery work on the servers, so the service is likely to remain unstable.

11:42: The service has now been restored and customers are dialling normally. If you are experiencing any further issues, please email us on A fault report will follow.


Reason for Outage – Network and VM Outage

Date of event: 2015-09-05

Event Summary:

The following high level timeline described events that occurred on Saturday 2015-09-05 08:03am through to Sunday 03:45am.

08:03  Telecity Nexus switch rebooted in Telecity

08:11  Telecity 2nd Nexus switch rebooted

08:35  Large volumes of VMware alerts across all sites

08:35  00's of machines frozen/corrupted or otherwise needing repair.

09:00 - Sunday 03:45   Repairing machines & services

The first network switch reboot had no impact on systems, but the reboot of its resilient pair caused a layer2 rooting loop.  The rooting loop meant that all virtual machines on the platform were unable to see the storage layer with approximately 50% of all machines affected, some severely to the level of having their operating system images corrupted rendering the machines useless.

Customers were impacted to varying degrees and varying times depending on which machines their service relies on and the damage to those machines.

Repair Action:

The operations team deleted and recreated the affected machines in each cluster to bring services up.  The operations team is continuing to rebuild machines lost on the day to bring spare headroom up to the levels before the issue occurred.

Root Cause:

The Nexus switch reboot has been identified as a known software issue by Cisco. The vendor’s recommended action is to upgrade the switch to the latest stable version. This version has already been installed elsewhere by DXI in February this year and has not demonstrated any issues.

Preventive Measures:

All affected switches will be upgraded to the latest stable version between 8pm and 11pm on Saturday 12th September. The configuration has been removed that would create a rooting loop in the event that both switches of a resilient pair rebooted at the same time.

Was this article helpful?
0 out of 0 found this helpful

Have more questions? Submit a request


Powered by Zendesk