Learning lessons from others mistakes!

Warning sign

On March 6th 2013, NuGet.org’s package download was broken for one and a half hours. Unusually for a development team they have accepted the outage was their fault and could have been avoided. The full and frank documenting of the events can be found at the NuGet blog. For the purpose of this blog I am going to focus on the Post-mortem and what lessons can be learnt.

 

“Anyone who has never made a mistake has never tried anything new.” – Albert Einstein

 

“Never interrupt your enemy when he is making a mistake” – Napoleon Bonaparte

 

“Smart people learn from their mistakes. But the real sharp ones learn from the mistakes of others” – Brandon Mull – Fablehaven

 

The Concept

It is with the last quote in mind that this post is written. Over the past couple of years Microsoft has embraced open source more and more. An example of this embrace has been NuGet - NuGet is a free, open source developer focused package management system for the .NET platform intent on simplifying the process of incorporating third party libraries into a .NET application during development. NuGet is a member of the ASP.NET Gallery in the Outercurve Foundation.

 

On March 6th 2013, NuGet.org’s package download was broken for one and a half hours. Unusually for a development team they have accepted the outage was their fault and could have been avoided. The full and frank documenting of the events can be found at the NuGet blog. For the purpose of this blog we are going to focus on the Post-mortem and what lessons can be learnt…

 

The Lessons

The outage was triggered by the NuGet development team performing what should have been a regular planned upgrade to NuGet.org. Unfortunately what transpired was a catalogue of failures including the inability to revert to a previous good known state.

 

NuGet gallery is an Azure based application that plays to the strengths of cloud based software allowing it to be scalable, totally manageable by the development team and ideal for the hosting of a continuously improving application delivering full on innovation .

 

The NuGet team have prided its self on a 2-3 week update interval and the application has been built to fully exploit .Net’s latest and greatest features including Entity Framework migrations. The story is familiar to many of us in the industry, change and deployment has become routine, our expectation of the outcome is very optimistic – ‘hell we’ve done this so many times already…’ , so much of this QA malarkey is a bore and not needed, with Azure we can just flip back. This brings us to our first lesson:

 

1. If it can go wrong it will do so expect it to go wrong! Just because what you are doing has become routine does not mean it can’t go wrong and I guarantee when it does it won’t simply be a little bit wrong. Pilot’s use checklists every flight to ensure the routine of pre-flight checks does not miss anything. In our operating theatres' in the UK the medical teams now follow pre-op checklists that have been shown to save life.

 

  • Plan for failure
  • So for any deployment we must provide and adhere to the deployment checklist.
  • We must carryout every step and when upgrading software we must also update the checklist

 

In the NuGet outage one of the post-mortem items highlighted the fact that they had published code without the supporting data because they had failed to follow their deployment checklist. This particular failure quickly exposed a second failure and leads to our second lesson

 

2. With any change or upgrade ensure you can return back to the pervious good state. Our fellow developers at NuGet, unfortunately, missed this one. Having moved their production system into a failed state they were unable to return back or recover back to the previous good version. A set of simple rules can be derived from this lesson:

 

  • Do not deploy any upgrade with a recovery plan being in place
  • Do not deploy any upgrade without that recovery plan having been tested on pre-production first

 

Now the consequence of not adhering to the rules laid down in the previous two lessons may not have been endured had the next lesson been previously learnt and understood. The team simply did not test enough before deployment. In fact the QA testing was conducted by the developer and not the QA team against a less than optimal QA environment. The same test scripts were not run against pre-production as there were issues with this environment – it was normal for it to be broken – so the approach was to rely on testing was deployed to production. The lesson here is simple

 

3. Testing on the appropriate environment by testers prevents pain and embarrassment There is no excuse for ‘missing’ testing steps.

 

  • Testing must be conducted before deployment and by testers
  • Your pre-production and QA environments must be ‘squeaky clean’ and totally representative of production
  • Where data forms a major part of the application ensure that is correct represented in pre-prod and QA
  • Each environment must be standalone and not reliant on parts of the other

 

Summary

It is too easy for us all to look at these events and say – ‘ I hear what your saying but it will never happen to us’. In the relaxed atmosphere that you are reading this blog in this may be the case but add in a small does of deadline pressure, cost stress and management demands and all the good intentions go out the window as does the safety margin that prevent it happening to us all…..

 

So to summarise:

  1. Plan for failure. It is our job to prevent it but it is just important to be able to recover from it.
  2. Checklists – use them at all times. Maintain them and always use them – when all is falling around you they will be your guiding light
  3. Testing. Nothing beats testing and testing on the right environment. It is better to test and fail in private than not to and fail in public!

These lessons are just a few of the many we learn with experience but lack of understanding of them or ignoring them can lead to some embarrassing failures.

Written by Andy James at 00:00

Categories :

0 Comments :

Comment

Comments closed