123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495 |
- .. _numa_memory_policy:
- ==================
- NUMA Memory Policy
- ==================
- What is NUMA Memory Policy?
- ============================
- In the Linux kernel, "memory policy" determines from which node the kernel will
- allocate memory in a NUMA system or in an emulated NUMA system. Linux has
- supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
- The current memory policy support was added to Linux 2.6 around May 2004. This
- document attempts to describe the concepts and APIs of the 2.6 memory policy
- support.
- Memory policies should not be confused with cpusets
- (``Documentation/cgroup-v1/cpusets.txt``)
- which is an administrative mechanism for restricting the nodes from which
- memory may be allocated by a set of processes. Memory policies are a
- programming interface that a NUMA-aware application can take advantage of. When
- both cpusets and policies are applied to a task, the restrictions of the cpuset
- takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
- below for more details.
- Memory Policy Concepts
- ======================
- Scope of Memory Policies
- ------------------------
- The Linux kernel supports _scopes_ of memory policy, described here from
- most general to most specific:
- System Default Policy
- this policy is "hard coded" into the kernel. It is the policy
- that governs all page allocations that aren't controlled by
- one of the more specific policy scopes discussed below. When
- the system is "up and running", the system default policy will
- use "local allocation" described below. However, during boot
- up, the system default policy will be set to interleave
- allocations across all nodes with "sufficient" memory, so as
- not to overload the initial boot node with boot-time
- allocations.
- Task/Process Policy
- this is an optional, per-task policy. When defined for a
- specific task, this policy controls all page allocations made
- by or on behalf of the task that aren't controlled by a more
- specific scope. If a task does not define a task policy, then
- all page allocations that would have been controlled by the
- task policy "fall back" to the System Default Policy.
- The task policy applies to the entire address space of a task. Thus,
- it is inheritable, and indeed is inherited, across both fork()
- [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task
- to establish the task policy for a child task exec()'d from an
- executable image that has no awareness of memory policy. See the
- :ref:`Memory Policy APIs <memory_policy_apis>` section,
- below, for an overview of the system call
- that a task may use to set/change its task/process policy.
- In a multi-threaded task, task policies apply only to the thread
- [Linux kernel task] that installs the policy and any threads
- subsequently created by that thread. Any sibling threads existing
- at the time a new task policy is installed retain their current
- policy.
- A task policy applies only to pages allocated after the policy is
- installed. Any pages already faulted in by the task when the task
- changes its task policy remain where they were allocated based on
- the policy at the time they were allocated.
- .. _vma_policy:
- VMA Policy
- A "VMA" or "Virtual Memory Area" refers to a range of a task's
- virtual address space. A task may define a specific policy for a range
- of its virtual address space. See the
- :ref:`Memory Policy APIs <memory_policy_apis>` section,
- below, for an overview of the mbind() system call used to set a VMA
- policy.
- A VMA policy will govern the allocation of pages that back
- this region of the address space. Any regions of the task's
- address space that don't have an explicit VMA policy will fall
- back to the task policy, which may itself fall back to the
- System Default Policy.
- VMA policies have a few complicating details:
- * VMA policy applies ONLY to anonymous pages. These include
- pages allocated for anonymous segments, such as the task
- stack and heap, and any regions of the address space
- mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is
- applied to a file mapping, it will be ignored if the mapping
- used the MAP_SHARED flag. If the file mapping used the
- MAP_PRIVATE flag, the VMA policy will only be applied when
- an anonymous page is allocated on an attempt to write to the
- mapping-- i.e., at Copy-On-Write.
- * VMA policies are shared between all tasks that share a
- virtual address space--a.k.a. threads--independent of when
- the policy is installed; and they are inherited across
- fork(). However, because VMA policies refer to a specific
- region of a task's address space, and because the address
- space is discarded and recreated on exec*(), VMA policies
- are NOT inheritable across exec(). Thus, only NUMA-aware
- applications may use VMA policies.
- * A task may install a new VMA policy on a sub-range of a
- previously mmap()ed region. When this happens, Linux splits
- the existing virtual memory area into 2 or 3 VMAs, each with
- it's own policy.
- * By default, VMA policy applies only to pages allocated after
- the policy is installed. Any pages already faulted into the
- VMA range remain where they were allocated based on the
- policy at the time they were allocated. However, since
- 2.6.16, Linux supports page migration via the mbind() system
- call, so that page contents can be moved to match a newly
- installed policy.
- Shared Policy
- Conceptually, shared policies apply to "memory objects" mapped
- shared into one or more tasks' distinct address spaces. An
- application installs shared policies the same way as VMA
- policies--using the mbind() system call specifying a range of
- virtual addresses that map the shared object. However, unlike
- VMA policies, which can be considered to be an attribute of a
- range of a task's address space, shared policies apply
- directly to the shared object. Thus, all tasks that attach to
- the object share the policy, and all pages allocated for the
- shared object, by any task, will obey the shared policy.
- As of 2.6.22, only shared memory segments, created by shmget() or
- mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
- policy support was added to Linux, the associated data structures were
- added to hugetlbfs shmem segments. At the time, hugetlbfs did not
- support allocation at fault time--a.k.a lazy allocation--so hugetlbfs
- shmem segments were never "hooked up" to the shared policy support.
- Although hugetlbfs segments now support lazy allocation, their support
- for shared policy has not been completed.
- As mentioned above in :ref:`VMA policies <vma_policy>` section,
- allocations of page cache pages for regular files mmap()ed
- with MAP_SHARED ignore any VMA policy installed on the virtual
- address range backed by the shared file mapping. Rather,
- shared page cache pages, including pages backing private
- mappings that have not yet been written by the task, follow
- task policy, if any, else System Default Policy.
- The shared policy infrastructure supports different policies on subset
- ranges of the shared object. However, Linux still splits the VMA of
- the task that installs the policy for each range of distinct policy.
- Thus, different tasks that attach to a shared memory segment can have
- different VMA configurations mapping that one shared object. This
- can be seen by examining the /proc/<pid>/numa_maps of tasks sharing
- a shared memory region, when one task has installed shared policy on
- one or more ranges of the region.
- Components of Memory Policies
- -----------------------------
- A NUMA memory policy consists of a "mode", optional mode flags, and
- an optional set of nodes. The mode determines the behavior of the
- policy, the optional mode flags determine the behavior of the mode,
- and the optional set of nodes can be viewed as the arguments to the
- policy behavior.
- Internally, memory policies are implemented by a reference counted
- structure, struct mempolicy. Details of this structure will be
- discussed in context, below, as required to explain the behavior.
- NUMA memory policy supports the following 4 behavioral modes:
- Default Mode--MPOL_DEFAULT
- This mode is only used in the memory policy APIs. Internally,
- MPOL_DEFAULT is converted to the NULL memory policy in all
- policy scopes. Any existing non-default policy will simply be
- removed when MPOL_DEFAULT is specified. As a result,
- MPOL_DEFAULT means "fall back to the next most specific policy
- scope."
- For example, a NULL or default task policy will fall back to the
- system default policy. A NULL or default vma policy will fall
- back to the task policy.
- When specified in one of the memory policy APIs, the Default mode
- does not use the optional set of nodes.
- It is an error for the set of nodes specified for this policy to
- be non-empty.
- MPOL_BIND
- This mode specifies that memory must come from the set of
- nodes specified by the policy. Memory will be allocated from
- the node in the set with sufficient free memory that is
- closest to the node where the allocation takes place.
- MPOL_PREFERRED
- This mode specifies that the allocation should be attempted
- from the single node specified in the policy. If that
- allocation fails, the kernel will search other nodes, in order
- of increasing distance from the preferred node based on
- information provided by the platform firmware.
- Internally, the Preferred policy uses a single node--the
- preferred_node member of struct mempolicy. When the internal
- mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
- and the policy is interpreted as local allocation. "Local"
- allocation policy can be viewed as a Preferred policy that
- starts at the node containing the cpu where the allocation
- takes place.
- It is possible for the user to specify that local allocation
- is always preferred by passing an empty nodemask with this
- mode. If an empty nodemask is passed, the policy cannot use
- the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
- described below.
- MPOL_INTERLEAVED
- This mode specifies that page allocations be interleaved, on a
- page granularity, across the nodes specified in the policy.
- This mode also behaves slightly differently, based on the
- context where it is used:
- For allocation of anonymous pages and shared memory pages,
- Interleave mode indexes the set of nodes specified by the
- policy using the page offset of the faulting address into the
- segment [VMA] containing the address modulo the number of
- nodes specified by the policy. It then attempts to allocate a
- page, starting at the selected node, as if the node had been
- specified by a Preferred policy or had been selected by a
- local allocation. That is, allocation will follow the per
- node zonelist.
- For allocation of page cache pages, Interleave mode indexes
- the set of nodes specified by the policy using a node counter
- maintained per task. This counter wraps around to the lowest
- specified node after it reaches the highest specified node.
- This will tend to spread the pages out over the nodes
- specified by the policy based on the order in which they are
- allocated, rather than based on any page offset into an
- address range or file. During system boot up, the temporary
- interleaved system default policy works in this mode.
- NUMA memory policy supports the following optional mode flags:
- MPOL_F_STATIC_NODES
- This flag specifies that the nodemask passed by
- the user should not be remapped if the task or VMA's set of allowed
- nodes changes after the memory policy has been defined.
- Without this flag, any time a mempolicy is rebound because of a
- change in the set of allowed nodes, the node (Preferred) or
- nodemask (Bind, Interleave) is remapped to the new set of
- allowed nodes. This may result in nodes being used that were
- previously undesired.
- With this flag, if the user-specified nodes overlap with the
- nodes allowed by the task's cpuset, then the memory policy is
- applied to their intersection. If the two sets of nodes do not
- overlap, the Default policy is used.
- For example, consider a task that is attached to a cpuset with
- mems 1-3 that sets an Interleave policy over the same set. If
- the cpuset's mems change to 3-5, the Interleave will now occur
- over nodes 3, 4, and 5. With this flag, however, since only node
- 3 is allowed from the user's nodemask, the "interleave" only
- occurs over that node. If no nodes from the user's nodemask are
- now allowed, the Default behavior is used.
- MPOL_F_STATIC_NODES cannot be combined with the
- MPOL_F_RELATIVE_NODES flag. It also cannot be used for
- MPOL_PREFERRED policies that were created with an empty nodemask
- (local allocation).
- MPOL_F_RELATIVE_NODES
- This flag specifies that the nodemask passed
- by the user will be mapped relative to the set of the task or VMA's
- set of allowed nodes. The kernel stores the user-passed nodemask,
- and if the allowed nodes changes, then that original nodemask will
- be remapped relative to the new set of allowed nodes.
- Without this flag (and without MPOL_F_STATIC_NODES), anytime a
- mempolicy is rebound because of a change in the set of allowed
- nodes, the node (Preferred) or nodemask (Bind, Interleave) is
- remapped to the new set of allowed nodes. That remap may not
- preserve the relative nature of the user's passed nodemask to its
- set of allowed nodes upon successive rebinds: a nodemask of
- 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
- allowed nodes is restored to its original state.
- With this flag, the remap is done so that the node numbers from
- the user's passed nodemask are relative to the set of allowed
- nodes. In other words, if nodes 0, 2, and 4 are set in the user's
- nodemask, the policy will be effected over the first (and in the
- Bind or Interleave case, the third and fifth) nodes in the set of
- allowed nodes. The nodemask passed by the user represents nodes
- relative to task or VMA's set of allowed nodes.
- If the user's nodemask includes nodes that are outside the range
- of the new set of allowed nodes (for example, node 5 is set in
- the user's nodemask when the set of allowed nodes is only 0-3),
- then the remap wraps around to the beginning of the nodemask and,
- if not already set, sets the node in the mempolicy nodemask.
- For example, consider a task that is attached to a cpuset with
- mems 2-5 that sets an Interleave policy over the same set with
- MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
- interleave now occurs over nodes 3,5-7. If the cpuset's mems
- then change to 0,2-3,5, then the interleave occurs over nodes
- 0,2-3,5.
- Thanks to the consistent remapping, applications preparing
- nodemasks to specify memory policies using this flag should
- disregard their current, actual cpuset imposed memory placement
- and prepare the nodemask as if they were always located on
- memory nodes 0 to N-1, where N is the number of memory nodes the
- policy is intended to manage. Let the kernel then remap to the
- set of memory nodes allowed by the task's cpuset, as that may
- change over time.
- MPOL_F_RELATIVE_NODES cannot be combined with the
- MPOL_F_STATIC_NODES flag. It also cannot be used for
- MPOL_PREFERRED policies that were created with an empty nodemask
- (local allocation).
- Memory Policy Reference Counting
- ================================
- To resolve use/free races, struct mempolicy contains an atomic reference
- count field. Internal interfaces, mpol_get()/mpol_put() increment and
- decrement this reference count, respectively. mpol_put() will only free
- the structure back to the mempolicy kmem cache when the reference count
- goes to zero.
- When a new memory policy is allocated, its reference count is initialized
- to '1', representing the reference held by the task that is installing the
- new policy. When a pointer to a memory policy structure is stored in another
- structure, another reference is added, as the task's reference will be dropped
- on completion of the policy installation.
- During run-time "usage" of the policy, we attempt to minimize atomic operations
- on the reference count, as this can lead to cache lines bouncing between cpus
- and NUMA nodes. "Usage" here means one of the following:
- 1) querying of the policy, either by the task itself [using the get_mempolicy()
- API discussed below] or by another task using the /proc/<pid>/numa_maps
- interface.
- 2) examination of the policy to determine the policy mode and associated node
- or node lists, if any, for page allocation. This is considered a "hot
- path". Note that for MPOL_BIND, the "usage" extends across the entire
- allocation process, which may sleep during page reclaimation, because the
- BIND policy nodemask is used, by reference, to filter ineligible nodes.
- We can avoid taking an extra reference during the usages listed above as
- follows:
- 1) we never need to get/free the system default policy as this is never
- changed nor freed, once the system is up and running.
- 2) for querying the policy, we do not need to take an extra reference on the
- target task's task policy nor vma policies because we always acquire the
- task's mm's mmap_sem for read during the query. The set_mempolicy() and
- mbind() APIs [see below] always acquire the mmap_sem for write when
- installing or replacing task or vma policies. Thus, there is no possibility
- of a task or thread freeing a policy while another task or thread is
- querying it.
- 3) Page allocation usage of task or vma policy occurs in the fault path where
- we hold them mmap_sem for read. Again, because replacing the task or vma
- policy requires that the mmap_sem be held for write, the policy can't be
- freed out from under us while we're using it for page allocation.
- 4) Shared policies require special consideration. One task can replace a
- shared memory policy while another task, with a distinct mmap_sem, is
- querying or allocating a page based on the policy. To resolve this
- potential race, the shared policy infrastructure adds an extra reference
- to the shared policy during lookup while holding a spin lock on the shared
- policy management structure. This requires that we drop this extra
- reference when we're finished "using" the policy. We must drop the
- extra reference on shared policies in the same query/allocation paths
- used for non-shared policies. For this reason, shared policies are marked
- as such, and the extra reference is dropped "conditionally"--i.e., only
- for shared policies.
- Because of this extra reference counting, and because we must lookup
- shared policies in a tree structure under spinlock, shared policies are
- more expensive to use in the page allocation path. This is especially
- true for shared policies on shared memory regions shared by tasks running
- on different NUMA nodes. This extra overhead can be avoided by always
- falling back to task or system default policy for shared memory regions,
- or by prefaulting the entire shared memory region into memory and locking
- it down. However, this might not be appropriate for all applications.
- .. _memory_policy_apis:
- Memory Policy APIs
- ==================
- Linux supports 3 system calls for controlling memory policy. These APIS
- always affect only the calling task, the calling task's address space, or
- some shared object mapped into the calling task's address space.
- .. note::
- the headers that define these APIs and the parameter data types for
- user space applications reside in a package that is not part of the
- Linux kernel. The kernel system call interfaces, with the 'sys\_'
- prefix, are defined in <linux/syscalls.h>; the mode and flag
- definitions are defined in <linux/mempolicy.h>.
- Set [Task] Memory Policy::
- long set_mempolicy(int mode, const unsigned long *nmask,
- unsigned long maxnode);
- Set's the calling task's "task/process memory policy" to mode
- specified by the 'mode' argument and the set of nodes defined by
- 'nmask'. 'nmask' points to a bit mask of node ids containing at least
- 'maxnode' ids. Optional mode flags may be passed by combining the
- 'mode' argument with the flag (for example: MPOL_INTERLEAVE |
- MPOL_F_STATIC_NODES).
- See the set_mempolicy(2) man page for more details
- Get [Task] Memory Policy or Related Information::
- long get_mempolicy(int *mode,
- const unsigned long *nmask, unsigned long maxnode,
- void *addr, int flags);
- Queries the "task/process memory policy" of the calling task, or the
- policy or location of a specified virtual address, depending on the
- 'flags' argument.
- See the get_mempolicy(2) man page for more details
- Install VMA/Shared Policy for a Range of Task's Address Space::
- long mbind(void *start, unsigned long len, int mode,
- const unsigned long *nmask, unsigned long maxnode,
- unsigned flags);
- mbind() installs the policy specified by (mode, nmask, maxnodes) as a
- VMA policy for the range of the calling task's address space specified
- by the 'start' and 'len' arguments. Additional actions may be
- requested via the 'flags' argument.
- See the mbind(2) man page for more details.
- Memory Policy Command Line Interface
- ====================================
- Although not strictly part of the Linux implementation of memory policy,
- a command line tool, numactl(8), exists that allows one to:
- + set the task policy for a specified program via set_mempolicy(2), fork(2) and
- exec(2)
- + set the shared policy for a shared memory segment via mbind(2)
- The numactl(8) tool is packaged with the run-time version of the library
- containing the memory policy system call wrappers. Some distributions
- package the headers and compile-time libraries in a separate development
- package.
- .. _mem_pol_and_cpusets:
- Memory Policies and cpusets
- ===========================
- Memory policies work within cpusets as described above. For memory policies
- that require a node or set of nodes, the nodes are restricted to the set of
- nodes whose memories are allowed by the cpuset constraints. If the nodemask
- specified for the policy contains nodes that are not allowed by the cpuset and
- MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
- specified for the policy and the set of nodes with memory is used. If the
- result is the empty set, the policy is considered invalid and cannot be
- installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
- onto and folded into the task's set of allowed nodes as previously described.
- The interaction of memory policies and cpusets can be problematic when tasks
- in two cpusets share access to a memory region, such as shared memory segments
- created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
- any of the tasks install shared policy on the region, only nodes whose
- memories are allowed in both cpusets may be used in the policies. Obtaining
- this information requires "stepping outside" the memory policy APIs to use the
- cpuset information and requires that one know in what cpusets other task might
- be attaching to the shared region. Furthermore, if the cpusets' allowed
- memory sets are disjoint, "local" allocation is the only valid policy.
|