Understanding NUMA- Non Uniform Memory Access
While performing memory bench marking on your computer systems/VM’s, an important term that you would come across is NUMA or the Non-Uniform Memory access and the latency matrices around the same. So what is NUMA and how would you better understand this? For the purpose of this discussion, let us make the following considerations with respect to the SKU’s of system/VM that we are considering:
16 vCPU or 16 Cores with 256GiB of RAM running RHEL/Linux
And considering a memory map/diagrammatic representation of the same:
As represented in Dig1, this is a typical dual domain NUMA architecture where the CPU cores denoted as C1-C16, where C1-C8 are part of the NUMA Domain 0 and C9-C16 are part of the NUMA Domain 1 connected to their respective RAM(denoted as RAM 0 and RAM 1 for simplicity)
The cores are connected to the respective RAM memory using a Memory Controller. If the core is trying to access memory across the NUMA domains, they do so by the use of the Interconnect bus and this is termed as “Remote Access” and if they are accessing the memory within the same NUMA domain, its called “Local Access” . ie if C1 is trying to access a memory address in RAM 1, it needs to do so via the Interconnect bus through MC 1 to reach RAM 1, which would be “Remote Access”; if C1 is trying to access a memory address in RAM 0 through MC 0, this would be a “Local Access”.
As you may imagine, “Remote Access” would involve an additional latency since the call needs to traverse through an “Interconnect Bus” and this latency is nearly twice as that of a “Local Access” call.
Using the Intel® Memory Latency Checker the latency of your Local vs Remote Access call can be measured:
Intel(R) Memory Latency Checker - v3.9
Measuring idle latencies (in ns)...
Numa node
Numa node 0 1
0 118.4 242.5
1 242.4 117.5
As seen above, the “Remote Access” across NUMA nodes ie 0 ->1 or 1 ->0, is nearly twice the idle latency as that of the “Local Access” 0->0 or 1->1
Similarly, ~1/3rd Memory bandwidths are observed on the “Remote Access” vs the “Local Access”
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Numa node
Numa node 0 1
0 65002.8 15063.9
1 15011.8 64917.3
L2 HIT/HITM latency is ~1/3rd in “Local Access” vs “Remote Access”
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 36.5
Local Socket L2->L2 HITM latency 39.8
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 103.3
1 105.4 -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
Reader Numa Node
Writer Numa Node 0 1
0 - 204.6
1 202.4 -
We see from the above that, “Remote Access” involves significant cost with respect to the latency. Most modern CPU architectures involve a dual NUMA socket design, spreading the cores proportionally across the sockets.
One might think then why not move to a single socket architecture to eliminate the “Remote Access” latency and Facebook did exactly that. Facebook worked with Intel to optimize their data center workloads to move to a single socket CPU architecture with the Intel Xeon-D line, eliminating the “Remote Access” latency problem caused by the dual NUMA sockets.
How to find the number of NUMA nodes your VM uses?
We have a tool called numactl, and running the below can give you the distribution of your cores across the respective NUMA nodes.
Example(run against a 64 core machine):
$ numactl --hardwareavailable: 2 nodes (0-1)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31node 0 size: 524287 MBnode 0 free: 502307 MBnode 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63node 1 size: 524288 MBnode 1 free: 505002 MBnode distances:node 0 10: 10 201: 20 10