Topology-Aware Policy
Overview
What Problems Does the Topology-Aware Policy Solve?
On server-grade hardware the CPU cores, I/O devices and other peripherals form a rather complex network together with the memory controllers, the I/O bus hierarchy and the CPU interconnect. When a combination of these resources are allocated to a single workload, the performance of that workload can vary greatly, depending on how efficiently data is transferred between them or, in other words, on how well the resources are aligned.
There are a number of inherent architectural hardware properties that, unless properly taken into account, can cause resource misalignment and workload performance degradation. There are a multitude of CPU cores available to run workloads. There are a multitude of memory controllers these workloads can use to store and retrieve data from main memory. There are a multitude of I/O devices attached to a number of I/O buses the same workloads can access. The CPU cores can be divided into a number of groups, with each group having different access latency and bandwidth to each memory controller and I/O device.
If a workload is not assigned to run with a properly aligned set of CPU, memory and devices, it will not be able to achieve optimal performance. Given the idiosyncrasies of hardware, allocating a properly aligned set of resources for optimal workload performance requires identifying and understanding the multiple dimensions of access latency locality present in hardware or, in other words, hardware topology awareness.
The topology-aware policy addresses these challenges by:
Hardware topology awareness: Automatically builds a tree of pools based on detected CPU physical hardware topology (sockets and dies) and logical memory hardware topology (NUMA nodes)
Aligned resource allocation: Assigns CPU, memory, and devices with optimal topological alignment
Multi-tier memory support: Handles DRAM, PMEM, and HBM memory types
Flexible CPU allocation: Supports shared, exclusive, and mixed CPU core assignments
Device locality: Considers device connection topology to CPU and memory when placing workloads
How the Topology-Aware Policy Works
The topology-aware policy automatically builds a tree of pools based on the
detected hardware topology. Each pool has a set of CPUs and memory zones
assigned as their resources. Resource allocation for workloads happens by
first picking the pool which is considered to fit the best the resource
requirements of the workload and then assigning CPU and memory from this pool.
The pool nodes at various depths from bottom to top represent the L3 caches, NUMA nodes, dies, sockets, and finally the whole of the system at the root node. L3 cache pools group CPUs sharing the same L3 cache. Leaf NUMA nodes are assigned the memory behind their controllers / zones and CPU cores with the smallest distance / access penalty to this memory. If the machine has multiple types of memory separately visible to both the kernel and user space, for instance both DRAM and PMEM, each zone of special type of memory is assigned to the closest NUMA node pool.
Each non-leaf pool node in the tree is assigned the union of the resources of its children. So in practice, dies nodes end up containing all the CPU cores and the memory zones in the corresponding die, sockets nodes end up containing the CPU cores and memory zones in the corresponding socket’s dies, and the root ends up containing all CPU cores and memory zones in all sockets.
With this setup, each pool in the tree has a topologically aligned set of CPU and memory resources. The amount of available resources gradually increases in the tree from bottom to top, while the strictness of alignment is gradually relaxed. In other words, as one moves from bottom to top in the tree, it is getting gradually easier to fit in a workload, but the price paid for this is a gradually increasing maximum potential cost or penalty for memory access and data transfer between CPU cores.
Another property of this setup is that the resource sets of sibling pools at the same depth in the tree are disjoint while the resource sets of descendant pools along the same path in the tree partially overlap, with the intersection decreasing as the the distance between pools increases. This makes it easy to isolate workloads from each other. As long as workloads are assigned to pools which has no other common ancestor than the root, the resources of these workloads should be as well isolated from each other as possible on the given hardware.
With such an arrangement, this policy should handle topology-aware alignment of resources without any special or extra configuration. When allocating resources, the policy
filters out all pools with insufficient free capacity
runs a scoring algorithm for the remaining ones
picks the one with the best score
assigns resources to the workload from there
Although the details of the scoring algorithm are subject to change as the implementation evolves, its basic principles are roughly
prefer pools lower in the tree, IOW stricter alignment and lower latency
prefer idle pools over busy ones, IOW more remaining free capacity and fewer workloads
prefer pools with better overall device alignment
Key Features
The topology-aware policy has the following features:
Topologically aligned allocation of CPU and memory
Assign CPU and memory to workloads with tightest available alignment
Aligned allocation of devices
Pick pool for workload based on locality of devices already assigned
Shared allocation of CPU cores
Assign workload to shared subset of pool CPUs
Exclusive allocation of CPU cores
Dynamically slice off CPU cores from shared subset and assign to workload
Mixed allocation of CPU cores
Assign both exclusive and shared CPU cores to workload
Kernel-isolated CPU support (‘isolcpus’)
Use kernel-isolated CPU cores for exclusively assigned CPU cores
Resource exposure
Expose assigned resources to workloads and notify about changes
Dynamic memory relaxation
Dynamically widen workload memory set to avoid pool/workload OOM
Multi-tier memory allocation
Assign workloads to memory zones of their preferred type
Support for DRAM, PMEM (such as Intel® Optane™ memory), and HBM
Cold start support
Pin workload exclusively to PMEM for an initial warm-up period
Organization of this document
This document is organized as follows:
Installation and configuration describes installation and basic configuration
Configuration options covers detailed configuration options
Cookbook provides recipes for common use cases
Troubleshooting offers troubleshooting guidance
Integration with Kubernetes
The topology-aware policy integrates with Kubernetes through the Node Resource Interface (NRI). It uses its configuration stored as a Kubernetes Custom Resource together with optional policy-specific Pod annotations to control resource allocation behavior.
Installation and Configuration
Prerequisites
Kubernetes cluster with NRI-enabled container runtime
NRI plugins support enabled in the container runtime configuration
Installing with Helm
The topology-aware policy can be installed using Helm charts. Refer to the topology-aware Helm documentation for detailed instructions.
Managing Configuration with kubectl
The policy configuration can be managed using kubectl and the TopologyAware
dynamic configuration custom resource. Configuration
changes are applied dynamically without requiring pod restarts.
Example configuration commands:
# List all topology-aware policy configurations (in the kube-system namespace)
kubectl -n kube-system get topologyawarepolicies.config.nri
# View the default configuration
kubectl -n kube-system get topologyawarepolicies.config.nri/default -o yaml
# Edit/update the default configuration
kubectl -n kube-system edit topologyawarepolicies.config.nri/default
Replace kube-system with the namespace where the plugin is deployed.
Configuration Scopes
The topology-aware policy supports three levels of configuration precedence:
Default configuration (lowest precedence): Applies to all nodes without more specific configuration
Resource name:
default
Group-specific configuration: Applies to nodes labeled with a configuration group
Resource name:
group.$GROUP_NAMENode label:
config.nri/group=$GROUP_NAME
Node-specific configuration (highest precedence): Applies to a single named node
Resource name:
node.$NODE_NAME
Configuration Options
Policy-Level Settings
The following policy-level configuration options affect its default behavior. These options can be supplied as part of the effective dynamic configuration custom resource.
pinCPUwhether to pin workloads to assigned pool CPU sets
pinMemorywhether to pin workloads to assigned pool memory zones
preferIsolatedCPUswhether isolated CPUs are preferred by default for workloads that are eligible for exclusive CPU allocation
preferSharedCPUswhether shared allocation is preferred by default for workloads that would be otherwise eligible for exclusive CPU allocation
reservedPoolNamespaceslist of extra namespaces (or glob patterns) that will be allocated to reserved CPUs
colocatePodswhether try to allocate containers in a pod to the same or close by topology pools
colocateNamespaceswhether try to allocate containers in a namespace to the same or close by topology pools
defaultCPUPriorityis the default CPU prioritization, used when a container has not been annotated with any other CPU preferences. The possible values are:
high,normal,low, andnone. Currently this option only affects exclusive CPU allocations. For a more detailed discussion of CPU prioritization see the cpu allocator documentation.
unlimitedBurstableis the default topology level preference for containers with unlimited burstability. The policy will try to allocate Burstable containers with no CPU limit to a pool at this topology level. The possible values are:
system,package,die,numa,l3cache.
schedulingClassesDefine scheduling classes recognized by the policy. A scheduling class has the following set of associated Linux scheduling policy and I/O priority attributes
nameis the name of the scheduling class.policyis the Linux scheduling policy. Supported policies are:none,other,fifo,rr,batch,idle, anddeadline.priorityis the scheduling priority. Refer to sched_setscheduler(2) documentation for valid values depending on the policy.flagsis a list of scheduling flags. Supported flags are:reset-on-fork,reclaim,dl-overrun,keep-policy,keep-params,util-clamp-min,util-clamp-max.nice: nice value for the container process.runtime: runtime value fordeadlinescheduling policy (in microseconds).deadline: deadline value fordeadlinescheduling policy (in microseconds).period: period value fordeadlinescheduling policy (in microseconds).ioClass: IO class for the container process. Supported classes are:none,rtfor realtime,befor best-effort, andidle.ioPriority: IO priority for the container process. Refer to ionice(1) documentation for valid values. These attributes are applied to containers which get assigned to the class. Use thescheduling-class.resource-policy.nri.ioannotation key to annotate a pod or a container to a class.
namespaceSchedulingClassescan assign default scheduling classes to namespaces. If a container is not annotated to use a specific scheduling class but its namespace has a default scheduling class, this will apply to the container.
podQoSSchedulingClassescan assign default scheduling classes to Pod QoS classes. If container is neither annotated to use a specific scheduling class nor its namespace has a default scheduling class, but its Pod QoS class has a default scheduling class, this will apply to the container.
Additionally, the following sub-configuration is available for instrumentation:
instrumentation: configures runtime instrumentation.httpEndpoint: the address the HTTP server listens on. Example::8891.prometheusExport: if set to True, metrics about system and topology zone resource assignment are readable through/metricsfrom the configuredhttpEndpoint.reportPeriod:/metricsaggregation interval for polled metrics.
Reserved and Available Resources
Available and reserved resources are set up using the availableResources
and reservedResources configuration options.
Available resources can be used to set aside otherwise available CPU and memory resources and prevent the policy from assigning them to any workload. If available resources are omitted the policy can and will assign resources to workloads from the full available set.
Reserved resources reserve resources for the control plane, which by default
is all workloads in the kube-system namespace. Reserved resources cannot be
omitted. At least some amount of CPU must be reserved.
Both available and reserved resources can contain CPU and memory. Memory can be defined using the same quantity notation as in container resource requests and limits. CPU can be defined as quantity, amount of CPU to allocate to be available or reserved, or as an explicit CPU set either to be allocated or to be excluded from the allocated set.
A common case for the reserved resources is to set aside CPU for the control plane by quantity. The following fragment reserves 0.75 CPUs
...
reservedResources:
cpu: 750m
...
Alternatively, one can reserve full CPU cores explicitly. For instance this configuration fragment reserves CPUs 0-1 for the control plane.
...
reservedResources:
cpu: cpuset:0-1
...
A typical use case for available resources is to set aside some CPUs for system daemons unrelated to Kubernetes. The following configuration leaves CPUs 0-1 unallocated for the available set on a system with 128 CPUs thus leaving them for use by other daemons.
...
availableResources:
cpu: cpuset:2-128
...
One can also specify available CPUs by explicitly excluding the ones that should be set aside for other use:
...
availableResources:
cpu: exclude-cpuset:0-1
...
Note that both for the available and reserved resources you should make sure that the policy settings match any comparable settings of the node agent, the kubelet.
CPU Allocation Preferences
There are a number of workload properties this policy actively checks to decide if the workload could potentially benefit from extra resource allocation optimizations. Unless configured differently, containers fulfilling certain corresponding criteria are considered eligible for these optimizations. This will be reflected in the assigned resources whenever that is possible at the time the container’s creation / resource allocation request hits the policy.
The set of these extra optimizations consist of
assignment of
kube-reservedCPUsassignment of exclusively allocated CPU cores
usage of kernel-isolated CPU cores (for exclusive allocation)
The policy uses a combination of the QoS class and the resource requirements of the container to decide if any of these extra allocation preferences should be applied. Containers are divided into five groups, with each group having a slightly different set of criteria for eligibility.
kube-systemgroupall containers in the
kube-systemnamespace
low-prioritygroupcontainers in the
BestEffortorBurstableQoS class
sub-coregroupGuaranteed QoS class containers with
CPU request < 1 CPU
mixedgroupGuaranteed QoS class containers with
1 <= CPU request < 2
multi-coregroupGuaranteed QoS class containers with
CPU request >= 2
The eligibility rules for extra optimization are slightly different among these groups.
kube-systemnot eligible for extra optimizations
eligible to run on
kube-reservedCPU coresalways run on shared CPU cores
low-prioritynot eligible for extra optimizations
always run on shared CPU cores
sub-corenot eligible for extra optimizations
always run on shared CPU cores
mixedby default eligible for exclusive and isolated allocation
not eligible for either if
preferSharedCPUsis set to truenot eligible for either if annotated to opt out from exclusive allocation
not eligible for isolated allocation if annotated to opt out
multi-coreCPU request fractional (
(CPU request % 1000 milli-CPU) != 0):by default not eligible for extra optimizations
eligible for exclusive and isolated allocation if annotated to opt in
CPU request not fractional:
by default eligible for exclusive allocation
by default not eligible for isolated allocation
not eligible for exclusive allocation if annotated to opt out
eligible for isolated allocation if annotated to opt in
Eligibility for kube-reserved CPU core allocation should always be possible to
honor. If this is not the case, it is probably due to an incorrect configuration
which underdeclares ReservedResources. In that case, ordinary shared CPU cores
will be used instead of kube-reserved ones.
Eligibility for exclusive CPU allocation should always be possible to honor. Eligibility for isolated core allocation is only honored if there are enough isolated cores available to fulfill the exclusive part of the container’s CPU request with isolated cores alone. Otherwise ordinary CPUs will be allocated, by slicing them off for exclusive usage from the shared subset of CPU cores in the container’s assigned pool.
Containers in the kube-system group are pinned to share all kube-reserved CPU cores. Containers in the low-priority or sub-core groups, and containers which are only eligible for shared CPU core allocation in the mixed and multi-core groups, are all pinned to run on the shared subset of CPU cores in the container’s assigned pool. This shared subset can and usually does change dynamically as exclusive CPU cores are allocated and released in the pool.
Preferred Topology Level for Burstable Containers Without CPU Limit
CPU-unlimited burstable containers are by default preferred to allocate to a
pool at the topology level specified by the unlimitedBurstable configuration
option. This global default can be overridden by pod or container using the
unlimited-burstable.resource-policy.nri.io annotation. This annotation can
have the same values as the unlimitedBurstable configuration option.
metadata:
annotations:
# prefer to allocate container within the pod by default to a die
unlimited-burstable.resource-policy.nri.io/pod: "die"
# prefer to allocate C1 to a single NUMA node
unlimited-burstable.resource-policy.nri.io/container.C1: "numa"
# prefer to allocate C2 to a single socket
unlimited-burstable.resource-policy.nri.io/container.C2: "package"
# prefer to allocate C3 to all sockets in the system
unlimited-burstable.resource-policy.nri.io/container.C3: "system"
# prefer to allocate C4 to a single L3 cache
unlimited-burstable.resource-policy.nri.io/container.C4: "l3cache"
# any other containers in the pod will prefer allocation to a single die
Selectively Disabling Hyperthreading
If a container opts to hide hyperthreads, it is allowed to use only one hyperthread from every physical CPU core allocated to it. Note that as a result the container may be allowed to run on only half of the CPUs it has requested. In case of workloads that do not benefit from hyperthreading this nevertheless results in better performance compared to running on all hyperthreads of the same CPU cores. If container’s CPU allocation is exclusive, no other container can run on hidden hyperthreads either.
metadata:
annotations:
# allow the "LLM" container to use only single thread per physical CPU core
hide-hyperthreads.resource-policy.nri.io/container.LLM: "true"
Assigning Containers to Scheduling Classes
A container can be assigned to a known ‘scheduling class’ by name using the
scheduling-class.resource-policy.nri.io effective annotation key. The value
of the annotation is the name of the class for the container or the pod. The
class itself needs to be defined in the active policy configuration using the
schedulingClasses configuration option. For instance the following Helm
configuration fragment defines two classes, realtime and idle with the
corresponding scheduling and I/O priority attributes.
config:
reservedResources:
cpu: 2
...
schedulingClasses:
- name: realtime
policy: fifo # SCHED_FIFO
priority: 42
- name: idle
policy: idle # SCHED_IDLE
nice: 17
ioClass: be
ioPriority: 6
...
The following pod annotation will assign the container c0 to the realtime class:
metadata:
annotations:
scheduling-class.resource-policy.nri.io/container.c0: realtime
Inherited Default Scheduling Classes
If a container is not assigned to a scheduling class by annotation, it inherits the default scheduling class for its namespace or Pod QoS class, in this order of precedence, if either or both is set.
Implicit Topological Co-location for Pods and Namespaces
The colocatePods or colocateNamespaces configuration options control whether
the policy will try to co-locate, that is allocate topologically close, containers
within the same Pod or K8s namespace.
Both of these options are false by default. Setting them to true is a shorthand for adding to each container an affinity of weight 10 for all other containers in the same pod or namespace.
Containers with user-defined affinities are never extended with either of these co-location affinities. However, such containers can still have affinity effects on other containers that do get extended with co-location. Therefore mixing user- defined affinities with implicit co-location requires both careful consideration and a thorough understanding of affinity evaluation, or it should be avoided altogether.
CPU and Memory Pinning Controls
Some containers may need to run on all CPUs or access all memories without restrictions. Annotate these pods and containers to prevent the resource policy from touching their CPU or memory pinning.
cpu.preserve.resource-policy.nri.io/container.CONTAINER_NAME: "true"
cpu.preserve.resource-policy.nri.io/pod: "true"
cpu.preserve.resource-policy.nri.io: "true"
memory.preserve.resource-policy.nri.io/container.CONTAINER_NAME: "true"
memory.preserve.resource-policy.nri.io/pod: "true"
memory.preserve.resource-policy.nri.io: "true"
Memory Configuration
It is not possible for the policy to accurately determine memory requests for
pods in the Burstable QoS class. If high accuracy is critical for such containers
you can annotate the pod with exact per container resource requirements, or use the
resource annotator webhook to do this for you. See the related Helm chart
documentation for more details.
Cold Start
The topology-aware policy supports “cold start” functionality. When cold start
is enabled and the workload is allocated to a topology node with both DRAM and
PMEM memory, the initial memory controller is only the PMEM controller. DRAM
controller is added to the workload only after the cold start timeout is
done. The effect of this is that allocated large unused memory areas of
memory don’t need to be migrated to PMEM, because it was allocated there to
begin with. Cold start is configured like this in the pod metadata:
metadata:
annotations:
memory-type.resource-policy.nri.io/container.container1: dram,pmem
cold-start.resource-policy.nri.io/container.container1: |
duration: 60s
Again, alternatively you can use the following deprecated Pod annotation syntax to achieve the same, but support for this syntax is subject to be dropped in a future release:
metadata:
annotations:
resource-policy.nri.io/memory-type: |
container1: dram,pmem
resource-policy.nri.io/cold-start: |
container1:
duration: 60s
In the above example, container1 would be initially granted only PMEM
memory controller, but after 60 seconds the DRAM controller would be
added to the container memset.
Reserved Resources
User is able to mark certain namespaces to have a reserved CPU allocation.
Containers belonging to such namespaces will only run on CPUs set aside
according to the global CPU reservation, as configured by the ReservedResources
configuration option in the policy section.
The reservedPoolNamespaces option is a list of namespace globs that will be
allocated to reserved CPU class.
For example:
reservedPoolNamespaces: ["my-pool","reserved-*"]
In this setup, all the workloads in my-pool namespace and those namespaces
starting with reserved- string are allocated to reserved CPU class.
The workloads in kube-system are automatically assigned to reserved CPU
class so no need to mention kube-system in this list.
User is able to mark certain pods and containers to have a reserved CPU allocation by using annotations. Containers having a such annotation will only run on CPUs set aside according to the global CPU reservation, as configured by the ReservedResources configuration option in the policy section.
For example:
metadata:
annotations:
prefer-reserved-cpus.resource-policy.nri.io/pod: "true"
prefer-reserved-cpus.resource-policy.nri.io/container.special: "false"
Topology Hints
NRI Resource Policy automatically generates HW Topology Hints for devices
assigned to a container, prior to handing the container off to the active policy
for resource allocation. The topology-aware policy is hint-aware and normally
takes topology hints into account when picking the best pool to allocate resources.
Hints indicate optimal HW locality for device access and they can alter
significantly which pool gets picked for a container.
Since device topology hints are implicitly generated, there are cases where one would like the policy to disregard them altogether. For instance, when a local volume is used by a container but not in any performance critical manner.
Containers can be annotated to opt out from and selectively opt in to hint-aware pool selection using the following Pod annotations.
metadata:
annotations:
# only disregard hints for container C1
topologyhints.resource-policy.nri.io/container.C1: "false"
# disregard hints for all containers by default
topologyhints.resource-policy.nri.io/pod: "false"
# but take hints into account for container C2
topologyhints.resource-policy.nri.io/container.C2: "true"
Topology hint generation is globally enabled by default. Therefore, using the Pod annotation as opt in only has an effect when the whole pod is annotated to opt out from hint-aware pool selection.
It is possible to control whether and what kind of topology hints are generated using extra pod annotations. By default hints are generated from mounts and devices injected into the container. If pod resource API queries are enabled, query replies are also used for hint generation.
Enabling Or Disabling Selected Types of Topology Hints
The topologyhints.resource-policy.nri.io annotation key can be used
to enable or disable topology hint generation for one or more containers
altogether, or selectively for mounts, devices, and pod resources types.
More than one type can be specified as a comma-separated list. Additionally,
the all and none types are recognized to mean all or none of these
types. If no hint type annotations are present, all types of hints are
enabled.
For example:
metadata:
annotations:
# disable topology hint generation for all containers by default
topologyhints.resource-policy.nri.io/pod: none
# enable mount-based hints for the 'diskwriter' container
topologyhints.resource-policy.nri.io/container.diskwriter: mounts
# enable device-based hints for the 'videoencoder' container
topologyhints.resource-policy.nri.io/container.diskwriter: devices
# enable pod resource-based hints for the 'dpdk' container
topologyhints.resource-policy.nri.io/container.dpdk: pod-resources
# enable device and pod resource-based hints for 'networkpump' container
topologyhints.resource-policy.nri.io/container.networkpump: devices,pod-resources
Note that for pod resource based hints, you also need to enable pod resource API queries using the corresponding configuration option, like this:
apiVersion: config.nri/v1alpha1
kind: TopologyAwarePolicy
metadata:
name: default
...
spec:
...
agent:
podResourceAPI: true
Controlling Topology Hints by Path
It is also possible to enable and disable topology hint generation based
on mount or device path, using allow and deny lists. When the policy
is generating topology hints, it consults these lists to decide whether
hints for a particular mount or device are enabled. The deny list is
consulted first, followed by the allow list. A common usage pattern is
to deny all paths, then allow only selected ones.
Two types of allow and deny lists are supported: glob and prefix.
A path matches a prefix list if it starts with any entry in the list. A
path matches a glob list, where entries include shell-style path globbing
with wildcards, it the full path matches any glob pattern in the list.
For example:
metadata:
annotations:
# Deny all hints by default.
deny.topologyhints.resource-policy.nri.io/pod: |+
type: prefix
paths:
- /
# Allow hints from /sys/devices/pci*/*d7:00.0/*:d8:00.1 for container ctr1.
allow.topologyhints.resource-policy.nri.io/container.ctr1: |+
type: glob
paths:
- /sys/devices/pci*/*d7:00.0/*:d8:00.*
# Allow hints for local NVME block devices.
# I/O backing device for meaningful hint generation) for ctr2
allow.topologyhints.resource-policy.nri.io/container.ctr2: |+
type: prefix
paths:
- /dev/nvme
Using Pod Resource API for Extra Topology Hints
If access to the kubelet’s Pod Resource API is enabled in the
Node Agent’s configuration,
and pod resource-based hints are not explicitly disabled by annotation, per-
container topology hints are automatically generated whenever a device with
locality to a NUMA node is advertised by the API. Annotated allow and deny
lists can be used to selectively disable or enable per-resource hints, using
podresapi:$RESOURCE_NAME as the path for the resource.
Picking CPU And Memory By Topology Hints
Normally topology hints are only used to pick the assigned pool for a workload. Once a pool is selected the available resources within the pool are considered equally good for satisfying the topology hints. When the policy is allocating exclusive CPUs and picking pinned memory for the workload, only other potential criteria and attributes are considered for picking the individual resources.
When multiple devices are allocated to a single container, it is possible that this default assumption of all resources within the pool being topologically equal is not true. If a container is allocated misaligned devices, IOW devices with different memory or CPU locality, it is possible that only some of the CPU and memory in the selected pool satisfy the device hints and therefore have the desired locality.
For instance when in a two-socket system with socket #0 having NUMA nodes #0,#1 and socket #1 having NUMA nodes #2,#3, if a container is allocated two devices, one with locality to node #0 and another with locality to node #3, the only pool fulfilling topology hints for both devices is the root node. However, half of the resources in the pool are optimal for one of the devices and the other half are not optimal for either.
A container can be annotated to prefer hint based selection and pinning of CPU
and memory resources using the pick-resources-by-hints.resource-policy.nri.io
annotation. For example,
apiVersion: v1
kind: Pod
metadata:
name: data-pump
annotations:
k8s.v1.cni.cncf.io/networks: sriov-net1
prefer-isolated-cpus.resource-policy.nri.io/container.ctr0: "true"
pick-resources-by-hints.resource-policy.nri.io/container.ctr0: "true"
spec:
containers:
- name: ctr0
image: dpdk-pump
imagePullPolicy: Always
resources:
requests:
cpu: 2
memory: 100M
vendor.com/sriov_netdevice_A: '1'
vendor.com/sriov_netdevice_B: '1'
limits:
vendor.com/sriov_netdevice_A: '1'
vendor.com/sriov_netdevice_B: '1'
cpu: 2
memory: 100M
When annotated like that, the policy will try to pick one exclusive isolated CPU with locality to one device and another with locality to the other. It will also try to pick and pin to memory aligned with these devices. If this succeeds for all devices, the effective resources for the container will be the union of the individually picked resources. If picking resources by hints fails for any of the devices, the policy falls back to picking resource from the pool without considering device hints.
Container Affinity and Anti-Affinity
The topology-aware resource policy allow the user to give hints about how particular containers should be co-located within a node. In particular these hints express whether containers should be located ‘close’ to each other or ‘far away’ from each other, in a hardware topology sense.
Since these hints are interpreted always by a particular policy implementation, the exact definitions of ‘close’ and ‘far’ are also somewhat policy-specific. However as a general rule of thumb containers running
on CPUs within the same NUMA nodes are considered ‘close’ to each other,
on CPUs within different NUMA nodes in the same socket are ‘farther’, and
on CPUs within different sockets are ‘far’ from each other
These hints are expressed by container affinity annotations on the Pod.
There are two types of affinities:
affinity(orpositive affinity): cause affected containers to pull each other closeranti-affinity(ornegative affinity): cause affected containers to push each other further away
Policies try to place a container
close to those the container has affinity towards
far from those the container has anti-affinity towards.
Affinity Annotation Syntax
Affinities are defined as the resource-policy.nri.io/affinity annotation.
Anti-affinities are defined as the resource-policy.nri.io/anti-affinity
annotation. They are specified in the metadata section of the Pod YAML, under
annotations as a dictionary, with each dictionary key being the name of the
container within the Pod to which the annotation belongs to.
metadata:
annotations:
resource-policy.nri.io/affinity: |
container1:
- scope:
key: key-ref
operator: op
values:
- value1
...
- valueN
match:
key: key-ref
operator: op
values:
- value1
...
- valueN
weight: w
An anti-affinity is defined similarly but using resource-policy.nri.io/anti-affinity
as the annotation key.
metadata:
annotations:
resource-policy.nri.io/anti-affinity: |
container1:
- scope:
key: key-ref
operator: op
values:
- value1
...
- valueN
match:
key: key-ref
operator: op
values:
- value1
...
- valueN
weight: w
Affinity Semantics
An affinity consists of three parts:
scope expression: defines which containers this affinity is evaluated againstmatch expression: defines for which containers (within the scope) the affinity applies toweight: defines how strong a pull or a push the affinity causes
Affinities are also sometimes referred to as positive affinities while anti-affinities are referred to as negative affinities. The reason for this is that the only difference between these are that affinities have a positive weight while anti-affinities have a negative weight.
The scope of an affinity defines the bounding set of containers the affinity can apply to. The affinity expression is evaluated against the containers in scope and it selects the containers the affinity really has an effect on. The weight specifies whether the effect is a pull or a push. Positive weights cause a pull while negative weights cause a push. Additionally, the weight specifies how strong the push or the pull is. This is useful in situations where the policy needs to make some compromises because an optimal placement is not possible. The weight then also acts as a way to specify preferences of priorities between the various compromises: the heavier the weight the stronger the pull or push and the larger the probability that it will be honored, if this is possible at all.
The scope can be omitted from an affinity in which case it implies Pod scope, in other words the scope of all containers that belong to the same Pod as the container for which which the affinity is defined.
The weight can also be omitted in which case it defaults to -1 for anti-affinities and +1 for affinities. Weights are currently limited to the range [-1000,1000].
Both the affinity scope and the expression select containers, therefore they are identical. Both of them are expressions. An expression consists of three parts:
key: specifies what metadata to pick from a container for evaluation
operation (op): specifies what logical operation the expression evaluates
values: a set of strings to evaluate the the value of the key against
The supported keys are:
for pods:
namenamespaceqosclasslabels/<label-key>iduid
for containers:
pod/<pod-key>namenamespaceqosclasslabels/<label-key>tags/<tag-key>id
Essentially an expression defines a logical operation of the form (key op values). Evaluating this logical expression will take the value of the key in which either evaluates to true or false. a boolean true/false result. Currently the following operations are supported:
Equals: equality, true if the value of key equals the single item in valuesNotEqual: inequality, true if the value of key is not equal to the single item in valuesIn: membership, true if value of key equals to any among valuesNotIn: negated membership, true if the value of key is not equal to any among valuesExists: true if the given key exists with any valueNotExists: true if the given key does not existAlwaysTrue: always evaluates to true, can be used to denote node-global scope (all containers)Matches: true if the value of key matches the globbing pattern in valuesMatchesNot: true if the value of key does not match the globbing pattern in valuesMatchesAny: true if the value of key matches any of the globbing patterns in valuesMatchesNone: true if the value of key does not match any of the globbing patterns in values
The effective affinity between containers C_1 and C_2, A(C_1, C_2) is the sum of the weights of all pairwise in-scope matching affinities W(C_1, C_2). To put it another way, evaluating an affinity for a container C_1 is done by first using the scope (expression) to determine which containers are in the scope of the affinity. Then, for each in-scope container C_2 for which the match expression evaluates to true, taking the weight of the affinity and adding it to the effective affinity A(C_1, C_2).
Note that currently (for the topology-aware policy) this evaluation is asymmetric: A(C_1, C_2) and A(C_2, C_1) can and will be different unless the affinity annotations are crafted to prevent this (by making them fully symmetric). Moreover, A(C_1, C_2) is calculated and taken into consideration during resource allocation for C_1, while A(C_2, C_1) is calculated and taken into account during resource allocation for C_2. This might be changed in a future version.
Currently affinity expressions lack support for boolean operators (and, or, not). Sometimes this limitation can be overcome by using joint keys, especially with matching operators. The joint key syntax allows joining the value of several keys with a separator into a single value. A joint key can be specified in a simple or full format:
simple:
<colon-separated-subkeys>, this is equivalent to:::<colon-separated-subkeys>full:
<ksep><vsep><ksep-separated-keylist>
A joint key evaluates to the values of all the <ksep>-separated subkeys
joined by <vsep>. A non-existent subkey evaluates to the empty string. For
instance the joint key
:pod/qosclass:pod/name:name
evaluates to
<qosclass>:<pod name>:<container name>
For existence operators, a joint key is considered to exist if any of its subkeys exists.
Examples
Put the container peter close to the container sheep but far away from the
container wolf.
resource-policy.nri.io/affinity: |
peter:
- match:
key: name
operator: Equals
values:
- sheep
weight: 5
resource-policy.nri.io/anti-affinity: |
peter:
- match:
key: name
operator: Equals
values:
- wolf
weight: 5
Shorthand Notation
There is an alternative shorthand syntax for what is considered to be the most common case: defining affinities between containers within the same pod. With this notation one needs to give just the names of the containers, like in the example below.
annotations:
resource-policy.nri.io/affinity: |
container3: [ container1 ]
resource-policy.nri.io/anti-affinity: |
container3: [ container2 ]
container4: [ container2, container3 ]
This shorthand notation defines:
container3havingaffinity (weight 1) to
container1anti-affinity(weight -1) tocontainer2
container4havinganti-affinity(weight -1) tocontainer2, andcontainer3
The equivalent annotation in full syntax would be
metadata:
annotations:
resource-policy.nri.io/affinity: |+
container3:
- match:
key: labels/io.kubernetes.container.name
operator: In
values:
- container1
resource-policy.nri.io/anti-affinity: |+
container3:
- match:
key: labels/io.kubernetes.container.name
operator: In
values:
- container2
container4:
- match:
key: labels/io.kubernetes.container.name
operator: In
values:
- container2
- container3
Cookbook
Mixed Workloads with Different QoS Requirements
Deploy multiple containers with varying QoS on the same node, using pod annotations and policy configuration:
# High-priority realtime container
# - SCHED_FIFO scheduling policy
# - prefer isolated CPUs for the single requested one
---
apiVersion: v1
kind: Pod
metadata:
annotations:
prefer-isolated-cpus.resource-policy.nri.io/pod: "true"
scheduling-class.resource-policy.nri.io/pod: "realtime"
spec:
containers:
- name: realtime-app
resources:
requests:
cpu: "1"
memory: "8Gi"
limits:
cpu: "1"
memory: "8Gi"
# Prioritized task with elevated priority
# - SCHED_OTHER scheduling policy with elevated priority
# - burstable container (without resource limits)
# - burstability limited to a NUMA node
---
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduling-class.resource-policy.nri.io/pod: "prioritized"
unlimited-burstable.resource-policy.nri.io/pod: "numa"
spec:
containers:
- name: prioritized-app
resources:
requests:
cpu: "1500m"
memory: "1Gi"
# Best-effort background task
# - SCHED_IDLE to run only when nothing else needs CPU in the same pool
---
apiVersion: v1
kind: Pod
metadata:
annotations:
scheduling-class.resource-policy.nri.io/pod: "idle"
spec:
containers:
- name: background-task
resources:
requests:
cpu: "100m"
memory: "256Mi"
apiVersion: config.nri/v1alpha1
kind: TopologyAwarePolicy
metadata:
name: default
spec:
preferIsolatedCPUs: false
schedulingClasses:
- name: realtime
policy: fifo
priority: 42
- name: prioritized
policy: other
priority: 30
- idle:
policy: idle
Multi-Tier Memory Applications
Use PMEM for warm-up and DRAM for active working set with pod annotations. If you want this as a default, set it via your admission policy or template the pod annotations at deployment time.
apiVersion: v1
kind: Pod
metadata:
annotations:
memory-type.resource-policy.nri.io/container.app: dram,pmem
cold-start.resource-policy.nri.io/container.app: |
duration: 60s
spec:
containers:
- name: app
resources:
requests:
memory: "32Gi"
limits:
memory: "32Gi"
Co-located Pod Workloads
Prefer co-location of containers within the same pod by enabling the policy-level setting, or use explicit affinity annotations:
apiVersion: config.nri/v1alpha1
kind: TopologyAwarePolicy
metadata:
name: default
spec:
colocatePods: true
apiVersion: v1
kind: Pod
metadata:
annotations:
resource-policy.nri.io/affinity: |
backend:
- match:
key: name
operator: Equals
values:
- frontend
weight: 10
spec:
containers:
- name: frontend
resources:
requests:
cpu: "2"
memory: "4Gi"
- name: backend
resources:
requests:
cpu: "2"
memory: "4Gi"
For more affinity options, see Container Affinity and Anti-Affinity.
Troubleshooting
In order to enable more verbose logging and metrics exporting from the topology-aware policy, enable instrumentation and policy debugging from the nri-resource-policy global config:
instrumentation:
# The topology-aware policy can exports various system and topology
# zone utilisation metrics. Accessible in command line with
# curl --silent http://$localhost_or_pod_IP:8891/metrics
httpEndpoint: :8891
prometheusExport: true
metrics:
enabled: # use '*' instead for all available metrics
- policy
logger:
debug:
- policy