Optimising server distribution across the racks / Хабр

Recently, a colleague asked me in a chat:

— Is there an article how to pack servers into the racks properly?

I realised that I'm unaware of it. So, I decided to write my text.

Firstly, this is an article about bare metal servers in the data centre (DC) facilities. Secondly, we estimate that there are a lot of servers (hundreds or thousands); the article doesn't make sense for fewer quantities. Thirdly, we consider that there are three constraints in the racks: physical space, electric power per each one, and cabinets stay in the rows adjacent to each other, so we can use a single ToR switch to connect servers in them.

The answer to the original question depends significantly on the parameter we are optimising and on what we can change to get a better result. For instance, we need to use less space to leave more for future growth. Or maybe we have freedom in the selection of cabinet height, power per rack, number of sockets per PDU, number of cabinets per switch group (a switch per 1, 2, or 3 racks), cable lengths and cabling works. The last component is critical for end of rack rows where we need to pull cables into the other row or leave under-utilised ports in the switch. Completely different stories are server selection and data centre selection. We should consider that we chose them already.

It's good to understand some nuances and details, in particular, average/maximum server power consumption and how our vendor provides electricity. So, if we have a power supply of 230V 1phase, then a 32Amps circuit breaker can hold up to ~7kW. Let's say that we pay formally for a 6kW per rack. If a vendor measures our power consumption per row of 10 cabinets, not per a single one, and if circuit breakers limit power at 7kW, then we can use 6.9kW in a rack and 5.1kW in another one. It will be ok and unpunishable.

Usually, our primary goal is to minimise spending. The best measurement criterium is total cost of ownership (TCO) reduction. It consists of the following parts:

CAPEX: buying data centre infrastructure, servers, network devices, cabling
OPEX: DC rent, electricity consumption, maintenance. OPEX depends on lifetime. It's reasonable to assume a lifetime is equal to 3 years.

We should optimise the most expensive parts of the pie. Everything else should use the remaining resources as effectively as possible.

Supposedly, we have an existing DC, rack height of H units (for example H=47), power per rack P_rack (P_rack=6kW), and we decided to use h=2U two-unit servers. Let's remove 2 to 4 units from the rack for switches, patch panels, cable managers. Then we can fit S_h=rounddown((H-2..4)/h) servers in a rack (i.e. S_h = rounddown((47-4)/2) = 21 servers per rack). Let's memorise S_h.

In a simple case, all the servers are the same. So, if we fill a rack by servers we can spend per server an average power of P_serv = P_rack/S_h (P_serv = 6000W/21 = 287W ). We ignore switch power consumption here.

Let's step aside and define what maximum server power consumption P_max is. The straightforward, completely safe and highly inefficient way is to read what a label on the server power supply unit says. Here is P_max.

A more complicated and efficient approach is to take TDP of all the components and sum them up. It's not accurate, but we can do it this way.

Usually, we don't know TDP of components apart from CPU. So, the most correct and the most complicated approach is to take an experimental adequately configured server, load it, for example, by /Linpack/ (CPU and memory) and /fio/ (disks), and measure power consumption. We need a laboratory in this case. If we take things seriously, we should create a warm environment in the cold aisle because higher temperature affects both fans and CPU power consumption. Thus, we get the maximum power consumption of the sample server with this particular configuration within the current environment under the specific load. Just keep in mind that a new firmware, software version and other conditions may affect the result.

Now, let's return back to P_serv and how should we compare it with P_max. It's a question of understanding how the services work and how strong are the nerves of our CTO.

If we don't accept any risk, we should assume that all the servers might start consuming their potential maximum simultaneously. At the same time, one of the DC feed can fail as well. Infrastructure should still provide the service. So, P_serv ≡ P_max. It's the approach when reliability is highly important.

If CIO takes into account not only ideal safety but also company money, if he is brave enough, then he can decide that

we start to manage our vendors, in particular, we forbid any planned maintenance in the periods of our expected high load to minimise power failure
and or our architecture allows us to lose a rack/row/DC while services continue operations
and or we distribute the load across the racks horizontally so well that our servers in a single cabinet will never consume their theoretical maximum all together.

It's advantageous not just guess here but monitor power consumption and understand how servers consume power during usual and peak loads. Thus and so after some analysis, the CIO travails and says:
«I command that maximum achievable average of all the server maximum power consumption is by so much less than the single server maximum consumption». Let it be P_serv=0.8*P_max

And then a rack of 6kW can accommodate not 16 servers of P_max = 375W but 20 servers of P_serv = 375W * 0.8 = 300W. I.e. 25% more servers. It's a real economy because we need 25% fewer racks. And we can save on rack PDUs, switches and cabling. A serious disadvantage of the solution is the need to check continuously that our assumptions are still valid. We should ensure that a new firmware doesn't change fan operation and power consumption significantly, that development team didn't start to use the servers much more efficiently (it means they managed to increase utilisation and power consumption). Then both initial assumptions and conclusions become wrong. So, it is the risk to be accepted responsibly. Or the risk can be avoided and then the company pays for obviously underloaded racks.

An important note: it's worth to try to distribute different services servers across the racks horizontally if possible. It is required to avoid cases when a bunch of servers for service arrives and is installed into cabinets vertically to improve «density» (just because it's easier to do this way). Indeed, it leads to the situation when one rack is filled with the same low-load servers while all highly loaded reside in another one. When the load profile is the same, and all the servers start to consume equally much simultaneously due to high load, the probability of losing the second rack becomes much higher.

Let's come back to server distribution in the racks. We considered physical constraints in the cabinets and power limitations. Now let's consider the network. One can use N=24/32/48-port switches (assuming 48-port ToR switches). Fortunately, there are not so many options if we ignore break-out cables. We consider options of a switch in every single rack, a switch per two or per three cabinets per group (R_net). I believe that the group shouldn't be three. Otherwise, it leads to cabling issues.

So, we distribute servers across the racks for each network scenario (1, 2, or 3 racks per group):

S_rack = min(S_h, rounddown(P_rack / P_serv), rounddown(N / R_net))

Thus, a group of two racks scenario is

S_rack² = min(21, rounddown(6000/300), rounddown(48/2)) = min(21, 20, 24) = 20 servers per rack

Similarly, we count the other scenarios:

S_rack¹ = 20

S_rack³ = 16

We are almost done. We should count the total amount of racks to distribute all the servers S (let there be 1000 servers):

R = roundup(S / (S_rack * R_net)) * R_net

R₁ = roundup(1000 / (20 * 1)) * 1 = 50 * 1 = 50 racks

R₂ = roundup(1000 / (20 * 2)) * 2 = 25 * 2 = 50 racks

R₂ = roundup(1000 / (16 * 3)) * 3 = 21 * 3 = 63 racks

Then we should count TCO for each option based on the number of racks, required switches, cabling, etc. We choose the scenario with the lowest TCO. Profit!

Please note although the number of racks for scenarios 1 and 2 is the same, TCO is different due to twice less amount of switches and longer cables for the 2nd scenario.

PS If power per rack or rack height may vary, then variability increases. But the selection may be reduced to the above method by brute-force the options. There will be more scenarios, but their quantity will be limited. We can increase power per rack in steps of 1kW, and there are a limited number standard rack types: of 42U, 45U, 47U, 48U. It might be helpful to use Excel's What-If analysis in Data Table mode. We should look at the resulting table and select the best option.