We are aware that there is a longer than usual delay before call recordings are available in the calls report. Recordings have not been lost, but it will be tomorrow before things catch up.
Reason for Outage – Call Recording API
Date of event:
On Tuesday 2015-07-21 from 9:48am:
Various customers complained they were unable to download various call recordings. These were both current and historic, and various examples were received. Mostly files on the 21st were affected, but also examples of files before this date were reported.
This was caused by the partial failure of a second storage node, tripping out the “2 of 3” majority needed to provide a consistent service as one node had already failed in the previous month, and is still under repair.
No loss of call recordings have been suffered, however files in certain partitions cannot be retrieved by the API automatically until the partition synchronization is completed.
Support staff can individually retrieve and process recordings where necessary. No files after the 21st are affected.
The first RAID failure occurred on 24th June, and the re-synchronization of the data onto this array continues.
The second RAID failure occurred on 21st July and the re-synchronization of the data onto this array continues.
RAID is self-repairing, but due to the size of the arrays this can take a very long time when the arrays are still under write load. Best estimate until final completion is a further 50 days. We expect full API service to be restored in less than 30 days.
Prioritisation is being given to guarantee at least two copies exist of every recording.
Root Cause and Preventive Measures:
The root cause of this event is that a second failure occurred before the first could be repaired.
To prevent a re-occurrence, consideration is being given to increase the number of copies and discrete servers.