NODE - Inside the mechanics of the AI factory

AI data centre

The data centre industry is navigating a fundamental shift in how hardware is deployed. The rise of AI factories and the demand for sovereign AI have led to a rethink of the traditional approach to critical digital infrastructure. National and industry-specific operations that keep data within jurisdictional boundaries now require a transformation of how hardware and data are managed. This movement has also been driven by a need for technical density that traditional design and construction, electrical supply, and cooling methods struggle to support.

The shift from cooling to heat orchestration

In a traditional air-cooled facility, a large volume of air acted as a thermal buffer. If a cooling unit failed, the thermal mass of the room provided a window of minutes or even hours before temperatures became critical. In a liquid-cooled AI operation, that margin for error becomes significantly reduced.

This change is a result of the power densities required for modern graphics processing units (GPUs). Air-cooled systems struggle to dissipate heat at the pace these chips require to keep them within their optimal operating ranges. This shift introduces a new level of mechanical complexity to the data centre floor. The infrastructure needs to be designed for AI workloads and high efficiency, yet flexible enough to handle the varying loads of different use-cases. Predictive performance management is a critical approach to maintaining stability when shifting between massive training loads and smaller inference tasks.

The five levels of commissioning

Before any live IT hardware arrives, facilities must undergo a rigorous commissioning process. It’s a multi-phased validation of primary and secondary cooling loops, transformers, switchgear, uninterruptible power supply (UPS) systems, busbar, and electrical distribution. The goal is to reach a state where the system is a known quantity before the actual IT equipment is installed.

This process involves using load bank racks, which simulate the heat and water flow of a real graphics processing unit (GPU) cluster. By using these load banks, operators can check that the complete system operates as designed. This helps verify that the power and cooling systems, including coolant distribution units (CDUs) and manifolds, are balanced. If the cooling flow rate is off by even a fraction, the GPUs will throttle, and the entire multi-million-pound investment will underperform.

Software and hardware integration

One of the most significant evolutions in this ecosystem is the integration of software and hardware at the design phase. Integrated systems are being deployed that require constant monitoring and simulation. This is where digital twin technology has moved from a theoretical concept to an operational necessity.

By using digital twins to simulate the thermal and electrical dynamics of a facility before IT hardware arrives, operators can carry out comprehensive commissioning in a virtual setup. This testing allows for the fine-tuning of secondary cooling loops and power skids to allow compute hardware to go live immediately when it arrives.

Agents that learn the system’s behaviour are also being integrated within building management systems (BMS). By taking data from chillers and cooling distribution units, the BMS can predict pressure changes or temperature changes before they happen. This allows the facility to adjust its motor usage or fluid pressure in real-time, depending on the workload.

The role of sovereign AI operations

Sovereign AI adds a layer of multi-tenant complexity that general hyperscale deployments rarely face. A national AI operation might support sectors as diverse as pharmaceuticals, defence, and banking simultaneously. The architecture must support the dynamic sharing of valuable assets while maintaining rigorous encryption and regulatory separation between these entities.

This means that the infrastructure must be hyper-efficient to manage the heat of these dense workloads, but it also needs to be flexible. The companies that do well will be those that can manage the complexity of heat and energy management, as well as sovereign security within a single, unified framework.

Knowledge sharing as a competitive advantage

The speed at which the AI industry operates means solutions are evolving in real time. Reference architectures are advancing as quickly as the chips themselves. This has led to a “hive mind” approach where vendors and operators share technical learnings in real-time.

In the past, proprietary designs were a competitive advantage. Today, the advantage lies in the ability to execute a deployment flawlessly and quickly. This requires a tight feedback loop between grid providers, energy companies, and infrastructure specialists like Vertiv. Success in this new era will be defined by the ability to orchestrate a transparent and integrated ecosystem. If organisations are to achieve deployment windows that were unimaginable as little as two years ago, the ecosystem must operate as a collective from the beginning.