Do AI Data Centres Really Need the Same Level of Resilience?

Insights

The data centre industry is booming across Asia; there is a supply and demand imbalance and an urgent need for more new AI-ready data centre capacity (it seems the word “factory” is the new “hyperscale”). The average data centre size (measured by power) has shot up to the point where it now seems normal for a developer to be building a minimum 100MW facility.

This increase in data centre size brings to the fore the cost of the supporting infrastructure and the need for more generators, cooling, switching and UPS’, combined with a general increase in the cost of raw materials.

At the same time most traditional operators seem to be building “AI ready” data centres that adhere to the same resilience standards as traditional colocation facilities, however, I question this logic. Is this a well thought through plan or simply clinging to habits formed over the last 20 years? Much of the data centre industry seems to operate on momentum, “We’ve always done it that way”. I also question if we are now building levels of diversity and security that many of the companies in the AI segment of the market may not value, and therefore would rather not pay for in the lease rates.

Legacy data centres were built around workloads that absolutely couldn’t fail, Banking, Insurance, Healthcare, Payments etc. For those applications/workloads outages were unacceptable and that’s how Tier III like architectures evolved. In fact, it went further with certifications (which cost considerable money in some cases) which were required to prove that data centres could run with very little downtime.

AI training doesn’t collapse when a node fails. Jobs run across large, distributed GPU clusters. If a server drops or there is a power interruption, then they resume. Real-time inference applications care about latency and availability. But could their resilience come from software-level redundancy across regions and not depend on the back-end hardware?

Highly resilient infrastructure can double the cost of power and cooling systems. Why can’t we think of another way that could reduce cost and accelerate RFS/deployment?

I wonder if the biggest obstacle isn’t technology, its mindset. Engineers and consultants trust designs they’ve spent careers perfecting and our industry is programmed to be risk averse.

Instead of replicating old standards, should we be asking what level of downtime actually matters for this specific workload, particularly LLM training? Does the customer care more about their business case, and the tight margins that the market sustains, or diversity and SLA?

What we want to do in my opinion is bring new AI-Ready capacity online quickly that is entirely fit-for-purpose. Over engineering it with unnecessary physical security, certifications and resilience serves is not necessarily the right model for all types of customers in today’s world. AI is rewriting the future of computing, and we need to consider how design must evolve alongside it, we may be wasting money and time on unnecessary infrastructure.

Darren Webb
CEO

Contact Us