Feeling the heat
Summer, with its long hot days, warm evenings and holidays, it’s all fun in the sun. But if summer is your business’s busiest time of year and all its critical IT systems go down, causing chaos for thousands of your customers and damaging the company’s reputation, then the fun fades quicker than any holiday suntan.
There are certain events that shouldn’t happen – they can’t be blamed on the weather, unscheduled maintenance or even a “power surge” – as poor planning is always the better explanation. There has been much speculation on what went wrong at BA and there’s also surprise that anything went wrong at all given the complexity and immense scale of an airline’s business and data centre operations, estimated at 500 cabinets. It’s second only to the banking industry in its size and scale and need for 100% uptime. Safety, security and customer service depend on it.
Outages are not isolated incidents
And yet – at a data centre industry level – this is by far an isolated incident. A survey commissioned by Eaton of IT and Data Centre managers across Europe found that 27% of respondents had suffered a prolonged outage leading to a disruptive level of downtime in the last 3 months. The vast majority of respondents (82%) agree that most critical business processes are dependent on IT and 74% say the health of the data centre directly impacts the quality of IT services. This paints a clear picture that the business depends on IT and IT depends on the data centre to function, so the fact that more than one in four data centres had recently suffered a prolonged outage tells us that something is wrong at an industry level.
Poor power planning
Just as critical business processes depend on IT, the data centre itself must provide resilience to keep the business running. It’s a core facet of a business’s risk management strategy.
The only thing we know for certain with the example of BA is that someone or something killed the power from the data centre, and whether it was a panicked response or a lack of knowledge, when they reapplied the power, incorrect processes exacerbated all the issues even further. We should be careful not to attribute this failure to any individual technology or person; it’s a problem of poor understanding of power that could have and should have been prevented by proper processes and power system design, especially if they’d followed the simple rule of data centre power management – actions have consequences and consequences require action.
The BA example demonstrates again that power misunderstanding is a common problem. Two-thirds of data centre professionals in Eaton’s research weren’t fully confident in power, and until organisations get to grips with power management we can expect to see more power-related outages. There is a profound concern around skills availability, that it’s hard to acquire and retain the relevant expertise or talent, whether it’s designing for energy efficiency, managing consumption on an ongoing basis, or dealing with power-related failures quickly and effectively to avoid and mitigate outages.
Have you tried switching it off and on again?
Should a full power outage occur then it’s absolutely imperative to have a disaster recovery process in place that clearly defines the steps to be taken when re-energising the data centre, detailing which systems must be brought back online first. In a full outage situation where people are in a state of panic and under pressure to resume normal services, staggering the re-energisation of the systems in your data centre may seem counter intuitive as the goal is to get back online as quickly as possible, but such a process helps to avoid further extension of the outage. The restoration of a data centre post going black needs to be done gently and in a clearly defined methodical fashion, simply trying to get everything back up in a hastily and unplanned way will only cause in-rush which could cause more outages, quickly crippling the data centre again. Power management is all about understanding the dependencies between the different parts of the power system and the IT load and having appropriate levels of resilience in the hardware, software and processes.
Recovering from an outage requires patience and a systematic process – two things that were seemingly missing according to reports on BA’s outage. No data centre professional has ever asked ‘have you tried switching it off and on again?’ The skill is to pace oneself and follow each step in turn, controlling and monitoring a phased restart so that batches of systems are only brought online when it’s safe to do so and one is sure of the correct phase balancing and loads. Skipping any steps in the rush to get back online can create a power surge, overloading circuits, tripping breakers and, to put it mildly, cause chaos.
Resilience and infrastructure upgrades
Alongside skills and power processes, the facilities infrastructure itself often needs upgrading to meet today’s efficiency, reliability and flexibility expectations. Around half of respondents in Eaton’s survey report that their core IT infrastructure needs strengthening, and this number is closer to two-thirds when it comes to facilities such as power and cooling.
Power management is increasingly becoming a software defined activity; given the skills gap, software can play an important role in bridging the divide between IT and power by presenting power management options in dashboard styles that are familiar to an IT audience, making it easier to understand and even automating management of power infrastructure. This could have prevented the outage that faced BA as the automated processes would have brought systems back online in a controlled and monitored fashion.
We’ve moved towards more virtualised environments in data centres, IT and data centre professionals are familiar with using virtualisation to maintain hardware, so the question is why not use the same principles in power? It is important that all power distribution designs, and associated resiliency software tools, are compatible with all the major virtualisation vendors to ensure future-proofing of the infrastructure. This approach will enable data centre professionals to do concurrent maintenance to mitigate risks of infrastructure maintenance and upgrades.
Learning lessons
While we may never fully understand what happened within BA’s data centre, it’s near guaranteed that it won’t be an isolated incident across the wider data centre industry, even if it’s unlikely we’ll see anything on the same scale for a long time. The issue comes down to either poor preparation or implementation of disaster recovery. Better preparation of the data centre disaster recovery process would have seen it designed with resilience in mind, meaning firstly the DR site should have kicked in to cover the demand during the outage and, secondly, when restarting the hardware and applications, it should have been done in a far more controlled manner. This would have meant that the reintroduction of power to systems in a slow and phased manner, allowed for a smooth and steady recovery. We, as a data centre industry, need to make sure that we all learn lessons from BA’s high-profile outage and take actions to ensure that effective power management is a ‘must have’ and not a ‘nice to have’.