Follow

System Disruption - 21st July 11:20

11:20 - We are currently aware of some performance issues affecting EasyContactNow. Our Infrastructure team are working to address them.

11:45 - The service has now been restored to full capability. We apologise for the interruption. 

12:45 - Unfortunately we are continuing to experience intermittent problems on the platform. We are working to resolve this.

13:15 - We are continuing to work on stabilising Easycontactnow.

13:45 - The system is now stable. We will of course be very closely monitoring the situation in order to take immediate action if the status changes.

22nd July

13:40 - After yesterdays issues, our teams worked through the night to make sure we we have no reoccurrences today. I am very please to report this morning we have had no further problems. We continue to monitor the platform closely.

 

---- Post Mortem ----

 

Event Summary:

On Tuesday 2015-08-21:

Some customers contacted customer support reporting screen lag defined as their screen not refreshing in a timely fashion when something was clicked or a call was connected. 

Impact varied customer to customer and even agent to agent. Monitoring confirmed brief periods of system responsiveness issues ‘lag” of generally less than a few seconds, but between 11:19-11:23 and 16:40-16:48 there were concentrated periods of much higher than normal system response times. This corresponded with monitoring and alerts being received by the operations team.

 

Repair Action:

On Tuesday night starting at 23:00, a new database table was created and system code changed to utilise the new table in order to bypass an internal limitation of the current database implementation.

 

Root Cause:

The root cause of this event has been identified to be a mechanism within the current database implementation where the database automatically determines what it thinks is the best form of optimization to use for queries against database tables and files.

It is current practice to force all database interactions to perform in the highest performance mode by forcing indexing on all key tables and files. In this case there was a table that was an internal database system resource that could not be monitored and force optimized as is the normal practice. Since this database is where all new status transactions are being added, a threshold was reached that caused this internal database table to switch its method of optimisation to one that did not provide the level of responsiveness needed for best operations. This caused the system to exhibit the performance lag being experienced by customers.

This same cause has been showing progressive intermittent impact over the past 6 weeks as transaction history and volume has grown leading to yesterday’s events.

Moving this table and its functions out of the control of the internal database system optimiser and forcing full indexing resolved the issue going forward. 

 

Preventive Measures:

To prevent the type of issue happening again, additional monitoring has been implemented to better track performance of the database to identify potential performance issues so they can be acted upon before becoming customer impacting.

Investigation is also being made to improve the current QA pre production testing environment to simulate the load condition that would trigger these events in order to add an additional method of validation to prevent potential performance impacting events.

 

Was this article helpful?
0 out of 0 found this helpful

Have more questions? Submit a request

Have more questions? Submit a request

Comments

Powered by Zendesk