, Sweat-proof “smart skin” takes reliable vitals, even during workouts and spicy meals
, Sweat-proof “smart skin” takes reliable vitals, even during workouts and spicy meals

Industry Lessons: Until Power Is Better Understood, BA Won’t Be an Isolated Incident

Feeling the heat

Summer, with its long hot days, warm evenings and holidays, its all fun in the sun. But if summer is your businesss busiest time of year and all its critical IT systems go down, causing chaos for thousands of your customers and damaging the companys reputation, then the fun fades quicker than any holiday suntan.

There are certain events that shouldnt happen – they cant be blamed on the weather, unscheduled maintenance or even apower surge” – as poor planning is always the better explanation. There has been much speculation on what went wrong at BA and theres also surprise that anything went wrong at all given the complexity and immense scale of an airlines business and data centre operations, estimated at 500 cabinets. Its second only to the banking industry in its size and scale and need for 100% uptime. Safety, security and customer service depend on it.

 

Outages are not isolated incidents

And yet – at a data centre industry levelthis is by far an isolated incident. A survey commissioned by Eaton of IT and Data Centre managers across Europe found that 27% of respondents had suffered a prolonged outage leading to a disruptive level of downtime in the last 3 months. The vast majority of respondents (82%) agree that most critical business processes are dependent on IT and 74% say the health of the data centre directly impacts the quality of IT services. This paints a clear picture that the business depends on IT and IT depends on the data centre to function, so the fact that more than one in four data centres had recently suffered a prolonged outage tells us that something is wrong at an industry level.

 

Poor power planning

Just as critical business processes depend on IT, the data centre itself must provide resilience to keep the business running. Its a core facet of a businesss risk management strategy.

The only thing we know for certain with the example of BA is that someone or something killed the power from the data centre, and whether it was a panicked response or a lack of knowledge, when they reapplied the power, incorrect processes exacerbated all the issues even further. We should be careful not to attribute this failure to any individual technology or person; its a problem of poor understanding of power that could have and should have been prevented by proper processes and power system design, especially if theyd followed the simple rule of data centre power managementactions have consequences and consequences require action.

The BA example demonstrates again that power misunderstanding is a common problem. Two-thirds of data centre professionals in Eatons research werent fully confident in power, and until organisations get to grips with power management we can expect to see more power-related outages. There is a profound concern around skills availability, that its hard to acquire and retain the relevant expertise or talent, whether its designing for energy efficiency, managing consumption on an ongoing basis, or dealing with power-related failures quickly and effectively to avoid and mitigate outages.

 

Have you tried switching it off and on again?

Should a full power outage occur then its absolutely imperative to have a disaster recovery process in place that clearly defines the steps to be taken when re-energising the data centre, detailing which systems must be brought back online first. In a full outage situation where people are in a state of panic and under pressure to resume normal services, staggering the re-energisation of the systems in your data centre may seem counter intuitive as the goal is to get back online as quickly as possible, but such a process helps to avoid further extension of the outage. The restoration of a data centre post going black needs to be done gently and in a clearly defined methodical fashion, simply trying to get everything back up in a hastily and unplanned way will only cause in-rush which could cause more outages, quickly crippling the data centre again. Power management is all about understanding the dependencies between the different parts of the power system and the IT load and having appropriate levels of resilience in the hardware, software and processes.

Recovering from an outage requires patience and a systematic processtwo things that were seemingly missing according to reports on BAs outage. No data centre professional has ever askedhave you tried switching it off and on again?’ The skill is to pace oneself and follow each step in turn, controlling and monitoring a phased restart so that batches of systems are only brought online when its safe to do so and one is sure of the correct phase balancing and loads. Skipping any steps in the rush to get back online can create a power surge, overloading circuits, tripping breakers and, to put it mildly, cause chaos.

 

Resilience and infrastructure upgrades

Alongside skills and power processes, the facilities infrastructure itself often needs upgrading to meet todays efficiency, reliability and flexibility expectations. Around half of respondents in Eatons survey report that their core IT infrastructure needs strengthening, and this number is closer to two-thirds when it comes to facilities such as power and cooling.

Power management is increasingly becoming a software defined activity; given the skills gap, software can play an important role in bridging the divide between IT and power by presenting power management options in dashboard styles that are familiar to an IT audience, making it easier to understand and even automating management of power infrastructure. This could have prevented the outage that faced BA as the automated processes would have brought systems back online in a controlled and monitored fashion.

Weve moved towards more virtualised environments in data centres, IT and data centre professionals are familiar with using virtualisation to maintain hardware, so the question is why not use the same principles in power? It is important that all power distribution designs, and associated resiliency software tools, are compatible with all the major virtualisation vendors to ensure future-proofing of the infrastructure. This approach will enable data centre professionals to do concurrent maintenance to mitigate risks of infrastructure maintenance and upgrades.

 

Learning lessons

While we may never fully understand what happened within BAs data centre, its near guaranteed that it wont be an isolated incident across the wider data centre industry, even if its unlikely well see anything on the same scale for a long time. The issue comes down to either poor preparation or implementation of disaster recovery. Better preparation of the data centre disaster recovery process would have seen it designed with resilience in mind, meaning firstly the DR site should have kicked in to cover the demand during the outage and, secondly, when restarting the hardware and applications, it should have been done in a far more controlled manner. This would have meant that the reintroduction of power to systems in a slow and phased manner, allowed for a smooth and steady recovery. We, as a data centre industry, need to make sure that we all learn lessons from BAs high-profile outage and take actions to ensure that effective power management is amust haveand not anice to have’.

Comments are closed.