What is NUMA
Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) is a physical memory design used in SMP (multiprocessors) architecture, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.
In a NUMA system, processors, memory, and I/O are grouped together into cells. The latency and bandwidth characteristics of communication within a cell are “fast” while going outside a cell is “slow”. Since the memory in NUMA systems is physically distributed but logically shared, these systems offer better performance to applications that are optimized to use their features.
For non-optimized applications, they still offer better performance since the default behavior is designed to be benign –if not beneficial– and they still have access to much larger shared resources of memory, CPUs, and disk space.
In NUMA architecture, each processor has a ‘local’ bank of memory, to which it has much closer (lower latency) access. The whole system may still operate as one unit, and all memory is basically accessible from everywhere but at a potentially higher latency and lower performance.
Fundamentally, some memory locations (‘local’ ones) are faster, that is, cost less to access, than other locations (‘remote’ ones attached to other processors).
How Does Linux Handle NUMA?
Linux manages memory in zones. A NUMA node can have multiple zones since it may be able to serve multiple DMA areas. How Linux has arranged memory can be determined by looking at /proc/zoneinfo. The NUMA node association of the zones allows the kernel to make decisions involving the memory latency relative to cores.
On boot-up, Linux will detect the organization of memory via the ACPI (Advanced Configuration and Power Interface) tables provided by the firmware and then create zones that map to the NUMA nodes and DMA areas as needed. Memory allocation then occurs from the zones.
Viewing NUMA information
The command numactl is used to run processes on specific NUMA nodes. There are numerous ways to view information about the NUMA characteristics of the system and of various processes currently running. The hardware NUMA configuration of a system can be viewed by using “numactl –hardware”. This includes a dump of the SLIT (system locality information table) that shows the cost of accesses to different nodes in a NUMA system. The example below shows a NUMA system with two nodes. The distance for a local access is 10. A remote access costs twice as much on this system (20).
# numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 node 0 size: 131026 MB node 0 free: 588 MB node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 node 1 size: 131072 MB node 1 free: 169 MB node distances: node 0 1 0: 10 20 1: 20 10
numastat is another tool that is used to show how many allocations were satisfied from the local node. Of particular interest is the numa_miss counter, which indicates that the system assigned memory from a different node in order to avoid reclaim. These allocations also contribute to other node. The remainder of the count is intentional off-node allocations. The amount of off-node memory can be used as a guide to figure out how effectively memory was assigned to processes running on the system.
$ numastat node0 node1 numa_hit 13273229839 4595119371 numa_miss 2104327350 6833844068 numa_foreign 6833844068 2104327350 interleave_hit 52991 52864 local_node 13273229554 4595091108 other_node 2104327635 6833872331