Balloons Policy
Overview
What Problems Does the Balloons Policy Solve?
The balloons policy addresses CPU, device, and container affinity requirements by organizing workloads into isolated CPU pools called “balloons.” This approach solves several key challenges:
Flexible node resource partitioning: Isolates and regroups containers from same or different pods, multi-pod applications and namespaces to share balloons.
Realtime requirements and latency-sensitivity: Minimizes latencies by isolating critical workloads, tuning cache and physical CPU core sharing, memory and device locality, CPU frequencies and powersaving states, and process scheduling parameters, including realtime scheduling policies and I/O priorities.
Maximal server throughput: Optimizes resource utilization by spreading memory and memory bandwidth hungry workloads across sockets, NUMA nodes and cache domains. Tunes the balance of memory accesses from lowest latency accesses to the closest memories only towards maximal bandwidth by using more memory channels through balanced cross-NUMA balloons within the same socket, or even more by crossing socket boundaries. Controls allowed memory types (DRAM, HBM, PMEM).
Different workloads, different servers, different rules: Allocates CPUs for different containers based on different CPU allocation preferences. Preferences per worker node or worker node group. Supports both pre-allocation of CPUs and on-demand allocations as containers are created, and both always static and dynamically growing/shrinking balloons.
Organization of this document
This document is organized so you can either read it top-to-bottom or jump directly to the sections below:
Integration with Kubernetes
The balloons policy integrates into the Kubernetes stack as an NRI (Node Resource Interface) plugin that works with container runtimes (containerd or CRI-O):
Runtime Integration: Runs as a DaemonSet on each node, intercepting container lifecycle events through NRI.
Dynamic Configuration: Uses Kubernetes Custom Resources (CRs) for configuration, supporting cluster-wide, node-group, and node-specific settings.
Pod Annotations: Allows fine-grained control through pod and container annotations.
Topology Awareness: Can expose balloon topology through NodeResourceTopology CRs for scheduler integration and for cluster-wide inspection of containers CPU affinity.
The policy evaluates each container when it starts, assigns it to an appropriate balloon based on configuration rules, and sets its CPU and memory affinity accordingly.
Installation and Configuration
Prerequisites
Kubernetes 1.24+
Helm 3.0.0+
Container runtime with NRI support:
containerd 1.7.0+ or CRI-O 1.26.0+
NRI feature enabled in the runtime (this is the default in up-to-date runtimes).
Installing with Helm
Add the NRI plugins Helm repository:
helm repo add nri-plugins https://containers.github.io/nri-plugins
helm repo update
Install the balloons policy with default configuration:
helm install nri-resource-policy-balloons nri-plugins/nri-resource-policy-balloons --namespace kube-system
Install with a custom configuration from a values file:
cat > balloons.values.helm.yaml <<EOF
nri:
runtime:
patchConfig: true
plugin:
index: 10
config:
agent:
nodeResourceTopology: true
podResourceAPI: false
allocatorTopologyBalancing: true
pinCPU: true
pinMemory: false
reservedResources:
cpu: cpuset:0
balloonTypes:
- name: high-priority
minCPUs: 4
maxCPUs: 8
namespaces:
- production
preferNewBalloons: true
control:
rdt:
enable: false
partitions:
options:
log:
debug:
- policy
klog:
skip_headers: true
source: true
EOF
helm install nri-resource-policy-balloons nri-plugins/nri-resource-policy-balloons \
--namespace kube-system \
-f balloons.values.helm.yaml
Uninstalling with Helm
Uninstalling will not restore CPU and memory pinning of running containers. See Reset CPU and MemoryPinning to allow containers to use any CPU and memory again.
Uninstalling leaves behind the BalloonsPolicy custom resource definition (CRD). Because installation skips installing new BalloonsPolicy CRD if it already exists, it is recommended to manually remove the CRD after uninstalling.
helm uninstall nri-resource-policy-balloons -n kube-system
kubectl delete crd balloonspolicies.config.nri
Managing Configuration with kubectl
View the current balloons policy configuration:
# List all balloons policy configurations
kubectl get balloonspolicies.config.nri -n kube-system
# View the default configuration
kubectl get balloonspolicies.config.nri/default -n kube-system -o yaml
Edit the configuration:
kubectl edit balloonspolicies.config.nri/default -n kube-system
The policy watches for configuration changes and automatically reconfigures itself when the configuration is updated.
Configuration Scopes
The balloons policy supports three levels of configuration precedence:
Default configuration (lowest precedence): Applies to all nodes without more specific configuration
Resource name:
default
Group-specific configuration: Applies to nodes labeled with a configuration group
Resource name:
group.$GROUP_NAMENode label:
config.nri/group=$GROUP_NAME
Node-specific configuration (highest precedence): Applies to a single named node
Resource name:
node.$NODE_NAME
Example: Create a group-specific configuration for high-memory nodes:
# Create the configuration
kubectl apply -f - <<EOF
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
name: group.high-memory
namespace: kube-system
spec:
reservedResources:
cpu: cpuset:0-1
balloonTypes:
- name: memory-intensive
shareIdleCPUsInSame: numa
maxBalloons: 4
preferNewBalloons: true
allocatorTopologyBalancing: true
namespaces:
- analytics
EOF
# Label nodes to use this configuration
kubectl label node node-1 config.nri/group=high-memory
kubectl label node node-2 config.nri/group=high-memory
Example: Create a node-specific configuration:
kubectl apply -f - <<EOF
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
name: node.special-node
namespace: kube-system
spec:
reservedResources:
cpu: 4000m
balloonTypes:
- name: gpu-workloads
preferCloseToDevices:
- /sys/class/drm/card0
EOF
Configuration Options
Core Concepts
A balloon is a set of CPUs shared by a set of containers. Key principles:
Every container is assigned to exactly one balloon.
Every CPU belongs to at most one balloon.
A container is allowed to use all CPUs in its balloon.
Optionally, containers can share “idle CPUs” (CPUs not in any balloon).
Containers never access CPUs from other balloons.
Container-to-Balloon Assignment
These options control which containers are assigned to which balloons. The policy first chooses a balloon type for a container, and then an existing or a new balloon instance of that type.
Choosing Balloon Type
Balloon type selection for a new container:
Balloon type effective for the container is specified by its pod annotation. This skips matching in user-defined and built-in balloon types. (Highest precedence.)
Implicit built-in reserved balloon type matches kube-system containers, unless configured otherwise.
The first balloon type in the
balloonTypeslist where amatchExpressionor anamespacematches the container.Implicit built-in default balloon type matches any container as a fallback, unless configured otherwise.
If no match, creating the container fails.
namespaces (list of strings)
Assigns containers from matching namespaces to this balloon type.
Supports wildcards (e.g.,
"prod-*","*").
balloonTypes:
- name: production
namespaces:
- prod-*
- critical
matchExpressions (list of expressions)
Evaluates expressions against container attributes.
First balloon type with a matching expression is selected.
Supports keys:
name(container name)namespaceqosclass(Guaranteed, Burstable or BestEffort)labels/KEY(container’s label)pod/name(pod’s name)pod/qosclasspod/labels/KEYpod/idpod/uid
Supports operators:
Equals,NotEqual: key value is (not) equal to the only valueIn,NotIn: key value is (not) in listed valuesExists,NotExists: key is [not] present, values ignoredAlwaysTrueMatches,MatchesNot: key value matches (not) a single globbing patternMatchesAny,MatchesNone: key value matches any/none of a set of globbing patterns
Example: first pick logger and monitor containers into low-priority balloons from any pod, including high and critical priority pods. Then pick all (remaining) containers from high priority pods into latency-critical balloons.
balloonTypes:
- name: low-priority
matchExpressions:
- key: name
operator: In
values:
- logger
- monitor
- name: latency-critical
matchExpressions:
- key: pod/labels/priority
operator: In
values:
- high
- critical
Composite balloons with componentCreation: balance-balloons:
When a balloon type has
componentslist,componentCreationenables controlling which component balloon types are used for allocating its CPUs. The default is all of them.componentCreation: balance-balloonscreates only one component - the one whose balloon type has the fewest instances.Use this to alternate container assignments across multiple balloon types.
balloonTypes:
- name: balanced-a-b
components:
- balloonType: type-a
- balloonType: type-b
componentCreation: balance-balloons
- name: type-a
...
- name: type-b
...
Choosing Balloon Instance
Once the policy has chosen a balloon type for a container, following options determine which instance of that type receives the container.
If no instance, new or existing, can get enough CPUs to fit the container, the container creation fails.
groupBy (string)
Groups containers into the same balloon instance based on expression evaluation.
Expression uses substitution:
${pod/labels/mylabel}replaced with label value. Substitution supports the keys as listed inmatchExpressionsabove.Containers with the same expression value go to the same balloon.
balloonTypes:
- name: app-instances
groupBy: "${pod/labels/app}-${pod/labels/instance}"
Pod spreading options:
preferSpreadingPods: false(default): Containers from the same pod prefer the same balloon.preferSpreadingPods: true: Containers from the same pod prefer different balloons.
Namespace grouping:
preferPerNamespaceBalloon: false(default): Namespace has no effect on balloon selection.preferPerNamespaceBalloon: true: Containers in the same namespace prefer the same balloon and different namespaces prefer different balloons.
Example: assign all containers into per-namespace balloons. Use the same balloon type for every container by placing it before built-in types in the list.
balloonTypes:
- name: per-namespace-balloon
namespaces:
- "*"
preferPerNamespaceBalloon: true
allocatorTopologyBalancing: true
- name: reserved
- name: default
preferNewBalloons (boolean, default: false)
false: Prefer filling existing balloons, inflate if needed up to maximum size, create new balloon as last resort.true: Prefer creating new balloons for exclusive CPU access (if CPUs available).
balloonTypes:
- name: exlusive-cpus
allocatorTopologyBalancing: true
preferNewBalloons: true
CPUs-to-Balloon Selection
These options control which CPUs are selected when creating or resizing balloons.
Static CPU Preferences
preferIsolCpus (boolean, default: false)
true: Prefer kernel-isolated CPUs (fromisolcpuskernel parameter)Tip: use with
minCPUs,maxCPUs, andminBalloonsto pre-allocate CPUs when using this option.Warning: If insufficient isolated CPUs exist, balloons may include non-isolated CPUs, too. If the application is unaware of this, it is likely to cause unwanted Linux scheduling behavior.
Example: pre-allocate two fixed-size one-CPU-balloons from kernel-isolated CPUs.
balloonTypes:
- name: kernel-isolated-cpu
preferIsolCpus: true
minCPUs: 1
maxCPUs: 1
minBalloons: 2
preferCoreType (string: "efficient" or "performance")
On hybrid architectures (P/E cores), prefers the specified core type.
"performance": Select high-performance cores."efficient": Select power-efficient cores.
balloonTypes:
- name: background-tasks
preferCoreType: efficient
preferCloseToDevices (list of strings)
Prefers CPUs topologically close to specified devices.
Device paths like
/sys/class/net/eth0/sys/class/drm/card0/sys/devices/system/cpu/cpu14/cache/index2/sys/devices/system/node/node0
First device in list has highest priority.
Automatically adds anti-affinity between listed devices and other balloon types.
balloonTypes:
- name: gpu-workloads
preferCloseToDevices:
- /sys/class/drm/card0
- name: network-io
preferCloseToDevices:
- /sys/class/net/eth0
Dynamic CPU Preferences
allocatorTopologyBalancing (boolean, inherits from policy-level
setting if not specified)
true: Spread balloons across hardware topology (NUMA/die/package with most free CPUs).Reduces interference when system is partially loaded.
Helps with future balloon inflation within same NUMA/die/package.
false(default): Pack balloons tightly into same hardware topology elements.Keeps large portions of hardware idle for power saving.
More interference between balloons.
preferSpreadOnPhysicalCores (boolean, inherits from policy-level
setting if not specified)
true: Allocate logical CPUs from separate physical coresPrevents containers from competing on same physical core resources.
Allows more interference between different balloons.
false(default): Pack logical CPUs tightly to minimum number of physical coresReduces inter-balloon interference.
Containers in the same balloon share physical core resources.
Recommendation: use
loadsandloadClassesfor better control in sharing physical cores and low-level caches, and/orhideHyperthreadsto prevent any process in any balloon from using the other hyperthread from the same physical core.
loads (list of strings)
Marks logical CPUs and their surroundings (physical core, cache) as loaded by certain balloons.
Avoids selecting CPUs from surroundings for other same-load-generating balloons.
Every load is defined in
loadClasses(see below).
balloonTypes:
- name: compute-heavy
loads:
- avx-compute
loadClasses (list, policy-level configuration):
Defines system load characteristics that balloons can generate. The CPU allocator uses this information to avoid overloading hardware resources.
Each load class defines:
name(string): Load class identifier (referenced in balloon types’loadslists).level(string): Hardware topology level affected:"core": Load affects physical CPU core resources."l2cache": Load affects L2 cache block.
overloadsLevelInBalloon(boolean, default:false)false: CPUs within balloon can be from same core/cache (locality prioritized).true: CPUs within balloon should avoid same core/cache (avoid self-interference).
How load classes affect CPU allocation:
Balloon type declares it generates certain loads.
When allocating CPUs for this balloon, allocator marks affected topology levels as loaded.
When allocating CPUs for another balloon with the same load class (same or different balloon type), allocator avoids loaded topology levels.
Result: Balloons generating similar loads get CPUs from separate cores/caches. Balloons generating different loads on same cores/caches are allowed to share the same surroundings.
Example: containers in compute-heavy-1 and compute-heavy-2
balloons should get only one hyperthread from every physical core, and
they should not share physical cores. Containers in light-compute
are marked to load cores, too, but only to avoid taking both
hyperthreads from the same cores. This leaves the other hyperthread
free for compute heavy workloads. A heavy-avx and a light-avx load
can share the same physical core, but two heavy-avx or two
light-avx loads should not.
balloonTypes:
- name: compute-heavy-1
loads:
- heavy-avx
- name: compute-heavy-2
loads:
- heavy-avx
- name: light-compute
loads:
- light-avx
loadClasses:
- name: heavy-avx
level: core
overloadsLevelInBalloon: true # Each CPU from different physical core
- name: light-avx
level: core
overloadsLevelInBalloon: true # Spread on different physical cores
# to leave the other thread free for
# compute-heavy balloons
components (list of objects):
For balloons with diverse CPU requirements, use composite balloons where each component specifies different requirements:
balloonTypes:
- name: hybrid-workload
components:
- balloonType: near-gpu
- balloonType: near-network
- name: near-gpu
preferCloseToDevices:
- /sys/class/drm/card0
- name: near-network
preferCloseToDevices:
- /sys/class/net/eth0
Balloon Size Control
minCPUs (integer, default: 0)
Minimum number of CPUs in any balloon of this type.
When balloon is created or deflated, it always has at least this many CPUs.
Useful for
ensuring minimum performance guarantees
allocating exactly wanted number of special CPUs (from isolcpus, reserved CPU set, efficient-cores, CPUs local to a device, …)
partitioning hardware block-by-block (take all CPUs from a physical core, L2 cache domain, NUMA node, compute die or socket at once.
maxCPUs (integer, default: 0 = unlimited)
Maximum number of CPUs in any balloon of this type.
Balloon will not inflate beyond this limit. The policy has to create a new balloon instead.
Set to -1 to prevent containers matching this balloon from running on the node (see cookbook).
Fully dynamic sizing:
Set
minCPUs: 0andmaxCPUs: 0or leave them undefined.Balloon size determined entirely by container CPU requests.
Fixed size:
Set
minCPUsandmaxCPUsto the same value.
balloonTypes:
- name: fixed-quad
minCPUs: 4
maxCPUs: 4
- name: dynamic-small
minCPUs: 1
maxCPUs: 8
- name: unlimited
minCPUs: 2
maxCPUs: 0
CPU Allocation Priority
minBalloons (integer, default: 0)
Number of balloon instances pre-created when policy starts or reconfigures.
allocatorPriorityand the order in theballoonTypeslist affect the order of instantiatingminBalloonsof different balloon types (see below).Ensures critical balloons always exist and have CPUs before other balloons.
Yet new balloon instances of this type may be dynamically created and destroyed, the number of instances never goes below
minBalloons.
maxBalloons (integer, default: 0 = unlimited)
Maximum number of balloon instances allowed to co-exist.
Prevents creating new balloons beyond this limit.
allocatorPriority (string: "high", "normal" (default), "low", "none")
At policy initialization, balloons are pre-created in this order:
Balloon types with
priority: high(in list order)Balloon types with
priority: normal(in list order)Balloon types with
priority: low(in list order)
balloonTypes:
- name: critical-service
minBalloons: 2
maxBalloons: 2
minCPUs: 4
maxCPUs: 4
allocatorPriority: high
- name: best-effort
allocatorPriority: low
Memories-to-Balloon Selection
If pinMemory: true, the policy allows containers to use memory only
from NUMA nodes closest to their balloon’s CPUs. This node set may be
larger if there is not enough memory in the closest nodes
alone. Pinning can be fine-tuned to include only certain memory types
for certain balloons and containers.
memoryTypes (list of strings):
Restricts memory types available to containers:
"DRAM","HBM","PMEM".Default: All memory types in system are allowed.
Can be overridden per container with
memory-type.resource-policy.nri.ioannotation.Effective only when
pinMemory: true.
pinMemory: true
balloonTypes:
- name: hbm-only
memoryTypes:
- HBM
- name: flexible
memoryTypes:
- DRAM
- PMEM
- name: default
pinMemory: false
Container Tuning
These options configure how containers behave within their balloons.
Scheduling and Priority
schedulingClass (string)
References a scheduling class defined in the
schedulingClasseslist.Sets Linux scheduling policy, priority, nice value, and I/O class for containers.
Can be overridden with
scheduling-class.resource-policy.nri.iopod annotation.
schedulingClasses (list, policy-level configuration):
Each scheduling class defines:
name(string): Class name referenced byschedulingClassin balloon types.policy(string): Linux scheduling policy:"none","other","fifo","rr","batch","idle","deadline".priority(integer): Scheduling priority, depends onpolicy, seesched_setscheduler(2).flags(list): Scheduling flags:"reset-on-fork","reclaim","dl-overrun","keep-policy","keep-params","util-clamp-min","util-clamp-max".nice(integer): Nice value for container process (-20 to 19).runtime(integer): Runtime for deadline policy (microseconds).deadline(integer): Deadline for deadline policy (microseconds).period(integer): Period for deadline policy (microseconds).ioClass(string): I/O class:"none","rt"(realtime),"be"(best-effort),"idle".ioPriority(integer): I/O priority, seeionice(1).
balloonTypes:
- name: high-priority
schedulingClass: critical
schedulingClasses:
- name: critical
policy: rr
priority: 50
ioClass: rt
ioPriority: 0
- name: background
policy: idle
ioClass: idle
Hyperthread Visibility
hideHyperthreads (boolean, default: false)
true: Containers can use only one hyperthread from each physical core in the balloon.Hidden hyperthreads remain completely idle (not available to any container).
Useful for workloads that don’t benefit from hyperthreading.
Balloon with 16 logical CPUs from 8 cores allows using only 8 CPUs.
false: Containers can use all hyperthreads.Can be overridden with
hide-hyperthreads.resource-policy.nri.iopod annotation that hides hyperthreads only in affected containers rather than of all containers in certain balloon types.
balloonTypes:
- name: compute-intensive
hideHyperthreads: true
Pod Annotations for Container Overrides
All pod annotations below are effective to all containers in a pod
(annotation key ending .../pod or missing /), or to named
containers in the pod (key ending .../container.CONTAINER_NAME). The
latter override the former.
Balloon type selection:
balloon.balloons.resource-policy.nri.io: BALLOON_TYPE
balloon.balloons.resource-policy.nri.io/pod: BALLOON_TYPE
balloon.balloons.resource-policy.nri.io/container.CONTAINER_NAME: BALLOON_TYPE
Scheduling class:
scheduling-class.resource-policy.nri.io/container.CONTAINER_NAME: CLASS_NAME
Hyperthread hiding:
hide-hyperthreads.resource-policy.nri.io/container.CONTAINER_NAME: "true"
Preserve existing pinning (opt-out of policy management):
cpu.preserve.resource-policy.nri.io/container.CONTAINER_NAME: "true"
memory.preserve.resource-policy.nri.io/container.CONTAINER_NAME: "true"
Memory type:
memory-type.resource-policy.nri.io/container.CONTAINER_NAME: HBM,DRAM
CPU Tuning
These options configure CPU behavior and power management.
cpuClass (string)
References a CPU class defined in
control.cpu.classes(policy-level configuration).Applied when balloon is created, inflated, or deflated.
Configures frequency scaling and C-states for CPUs in the balloon.
idleCPUClass (string, policy-level configuration)
CPU class for idle CPUs (not in any balloon).
Applied when CPUs are removed from balloons.
control.cpu.classes (object, policy-level configuration):
Each CPU class (keyed by name) can define:
minFreq(integer): Minimum CPU frequency in kHz.maxFreq(integer): Maximum CPU frequency in kHz.uncoreMinFreq(integer): Minimum uncore frequency in kHz.uncoreMaxFreq(integer): Maximum uncore frequency in kHz.disabledCstates(list): C-state names to disable (e.g.,["C6", "C8"]).Disabling deep C-states reduces latency by preventing deep sleep.
Disabling intermediate C-states keeps CPU more responsive longer after use, but allows it to enter deeper power saving states if not needed.
List available C-states:
grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name.
balloonTypes:
- name: latency-critical
cpuClass: turbo
- name: best-effort
cpuClass: normal
idleCPUClass: powersave
control:
cpu:
classes:
turbo:
minFreq: 3000000
maxFreq: 3600000
uncoreMinFreq: 2000000
uncoreMaxFreq: 2400000
disabledCstates: [C6, C8, C10]
normal:
minFreq: 1200000
maxFreq: 3000000
powersave:
minFreq: 800000
maxFreq: 1200000
Built-in Balloon Types
The policy includes two built-in balloon types that can be customized and whose position in the balloon type list can be changed.
Reserved Balloon
Purpose: Runs system containers (typically from kube-system namespace)
Default behavior (when not explicitly defined):
Automatically placed first in balloon types list.
Captures containers from
kube-systemnamespace and namespaces matchingreservedPoolNamespaces.Uses CPUs specified in
reservedResources.cpu. If acpusetis defined, those CPUs will be preferred when inflating the reserved balloon, and correspondingly avoided by other balloons. Reserved balloon can inflate beyond this cpuset if required by its containers.
reservedResources (policy-level configuration):
cpu(string): CPUs for reserved balloon.Preferred CPUs:
"cpuset:0,48"prefers using CPU 0 and 48. Uses many enough to satisfy container CPU requests, but no more than that.Excluded CPUs:
"exclude-cpuset:4-48"prefers using any other CPUs present inavailableResources.cputhan CPUs 4-63. E.g. if CPUs 0-63 are available, this will result in CPU set 0-3.Quantity:
"2000m"or"2"uses at least 2 CPUs.If
minCPUsis explicitly set forreservedballoon type, that overrides the quantity.
reservedPoolNamespaces (list of strings, policy-level configuration):
Additional namespaces (beyond
kube-system) assigned to reserved balloon.Supports wildcards.
Customizing reserved balloon:
# Policy-level settings
reservedResources:
cpu: cpuset:0,48 # preferred specific CPUs
reservedPoolNamespaces:
- kube-system
- monitoring
# Explicitly define 'reserved' balloon type for more control.
# Because "reserved" is moved below "own-cpus", kube-system and
# monitoring pods with priority=high label will get their own CPUs
# instead of sharing CPUs with other reserved pods.
balloonTypes:
- name: own-cpus
preferNewBalloons: true
matchExpressions:
- key: pod/labels/priority
operator: In
values:
- high
- name: reserved # Must be named 'reserved'
shareIdleCPUsInSame: numa
Default Balloon
Purpose: Catches all containers not matched by user-defined balloon types.
Default behavior (when not explicitly defined):
Automatically placed last in balloon types list.
Captures all remaining containers.
Uses any remaining CPUs not allocated to other balloons.
Customizing default balloon:
balloonTypes:
- name: production
namespaces:
- prod-*
- name: default # Must be named 'default'
minCPUs: 1
shareIdleCPUsInSame: package # Allow bursting without crossing socket boundary.
minBalloons: 2 # Create one balloon in both sockets in a 2-socket system.
maxBallons: 2
namespaces:
- "*" # Match all containers not matched in above types
Toggle and Reset Pinning Memory, CPUs, and Containers
Memory Pinning
pinMemory (boolean, policy-level, default: true)
true: Pin containers to NUMA nodes closest to their balloon’s CPUs.false: Allow containers to use memory from any NUMA node.Can be overridden per balloon type.
Warning: Pinning may cause OOM kills if pinned memory nodes have insufficient memory.
CPU Pinning
pinCPU (boolean, policy-level, default: true)
true: Restrict containers to their balloon’s CPUs (and optionally shared idle CPUs).false: Containers can use any CPUs in the system.Usually left as
truefor proper balloon isolation.
Managed CPUs
availableResources (object, policy-level configuration):
cpu(string): CPUset managed by the policy.Explicit CPUs to use:
"cpuset:4-48"sets the available CPUs to the given set, CPUs 4-48.CPUs to exclude:
"exclude-cpuset:0-3"sets the available CPUs to all but the given set. E.g. if the system has CPUs 0-127, this will result in CPUs 4-127.
All balloons use only CPUs from this set.
Useful for reserving CPUs for non-policy-managed workloads.
availableResources:
cpu: cpuset:48-95,144-191 # Use only socket 1 in 2-socket system
Managed Containers
The balloons policy can completely ignore selected containers.
preserve (object, policy-level configuration):
matchExpressions(list): Container match expressions.Containers matching these expressions are not managed by the policy.
Their existing CPU/memory pinning (or lack thereof) is preserved.
Useful for
analyzers and other containers that need access to every CPU
preserve: matchExpressions: - key: name operator: In values: - analyzer - debugger
special containers managed by external policies, for instance on CPUs not in
availableResources.preserve: matchExpressions: - key: pod/labels/non-balloon-cpus operator: Equals values: - "true"
balloons configures only few special containers while others can run unrestricted on any CPU in the system.
preserve: matchExpressions: - key: pod/labels/workload-type operator: NotIn values: - "idle" balloonTypes: - name: idle-workloads matchExpressions: - key: pod/labels/workload-type operator: In values: - "idle" schedulingClass: idle schedulingClasses: - name: idle policy: idle ioClass: idle pinCPUs: false pinMemory: false
Reset CPU and Memory Pinning
Running containers can be “reset” to allow accessing all CPUs and memories in the system by applying a configuration that pins them accordingly.
kubectl apply -f - <<EOF
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
name: default
namespace: kube-system
spec:
balloonTypes:
- name: reserved
namespaces:
- "*"
shareIdleCPUsInSame: system
# preferIsolCpus: true # uncomment to use kernel isolcpus, too
reservedResources:
cpu: 1000m
pinCPU: true
pinMemory: true
EOF
Visibility, Scheduling, Metrics, Logging, Debugging
These options control observability of balloons and their containers, including making them visible to Kubernetes scheduler.
NodeResourceTopology Integration
agent.nodeResourceTopology (boolean, policy-level, default:
false)
true: Expose balloons as topology zones innoderesourcetopologies.topology.node.k8s.ioCRs.Enables topology-aware scheduling, exposes created balloon instances as zones.
showContainersInNrt (boolean, can be set at policy-level and
overridden in balloon types, default: false)
true: Include container names and resource affinities in NodeResourceTopology CRs.Policy-level setting applies to all balloons by default.
Balloon-type setting overrides policy-level default.
Warning: May generate significant API server traffic with many containers.
# Enable topology exposure at policy level
agent:
nodeResourceTopology: true
# Do not show containers in NodeResourceTopology by default.
showContainersInNrt: false
# Override for specific balloon type
balloonTypes:
- name: devel
namespaces:
- "devel-*"
showContainersInNrt: true # Show development container CPU affinities for debugging
Example: print all balloons in all nodes in a single table
kubectl get noderesourcetopologies.topology.node.k8s.io -o json | jq -r '
["NODE","BALLOON","CPUSET","SHARED_CPUSET"],
(
.items.[] as $node
| $node.zones[]
| select(.type == "balloon")
| [
$node.metadata.name,
.name,
(.attributes[] | select(.name=="cpuset") | .value),
(.attributes[] | select(.name=="shared cpuset") | .value)
]
)
| @tsv'
Example: print all containers whose balloon type has effective
showContainersInNrt: true in all nodes
kubectl get noderesourcetopologies.topology.node.k8s.io -o json | jq -r '
["NODE","BALLOON","CONTAINER","CPUS","MEMS"],
(
.items.[] as $node
| $node.zones[]
| select(.type == "allocation for container")
| [
$node.metadata.name,
.parent,
.name,
(.attributes[] | select(.name=="cpuset") | .value),
(.attributes[] | select(.name=="memory set") | .value)
]
)
| @tsv'
NodeResourceTopology information can be used in Kubernetes scheduling through Topology-aware scheduler plugin.
Instrumentation and Metrics
instrumentation (object, policy-level configuration):
httpEndpoint(string): HTTP server address (e.g.,":8891").prometheusExport(boolean): Enable Prometheus metrics at/metricsendpoint.reportPeriod(string): Aggregation interval for polled metrics (e.g.,"10s").metrics.enabled(list): Glob patterns for metrics to collect (e.g.,["policy"]).samplingRatePerMillion(integer): Tracing sample rate (e.g.,100000for 10%).tracingCollector(string): External tracing endpoint (e.g.,"otlp-http").
Available metrics:
Balloon instances and their CPUs.
Containers assigned to each balloon.
CPU allocation and utilization.
instrumentation:
httpEndpoint: :8891
prometheusExport: true
reportPeriod: 30s
metrics:
enabled:
- policy
- '*' # All metrics
samplingRatePerMillion: 100000
Accessing metrics:
# Direct access (if pod IPs are reachable)
for podip in $(kubectl get pod -n kube-system -l "app.kubernetes.io/instance=nri-resource-policy-balloons" -o=jsonpath='{.items[*].status.podIP}'); do
curl -s http://$podip:8891/metrics
done
# Port forwarding
kubectl port-forward -n kube-system daemonset/nri-resource-policy-balloons 8891:8891 &
curl http://localhost:8891/metrics
Logging and Debugging
log (object, policy-level configuration):
debug(list): List of components to enable debug logging for.nri-plugin- NRI communication with the container runtimepolicyfor balloons policyexpressionfor matchExpression and groupBy expressionscachefor container and pod cachecpufor low-level CPU tuning controls (frequencies, cstates)cpuallocatorfor lowest-level CPU allocator below the policysysfsfor hardware discovery"*"for all
source(boolean): Prefix messages with logger source name.klog(object): klog-specific options (see klog documentation).
log:
debug:
- policy
- nri-plugin
source: true
klog:
skip_headers: true
View logs:
Show policy decision in assigning container CTRNAME in PODNAME in NAMESPACE to a balloon:
kubectl logs -n kube-system daemonset/nri-resource-policy-balloons | grep 'assigning container NAMESPACE/PODNAME/CTRNAME'
Inspect container information and cgroups:
Inspect container CTRNAME in PODNAME in NAMESPACE:
CTR_ID=$(kubectl get pod -n NAMESPACE PODNAME -o jsonpath='{.status.containerStatuses[?(@.name=="CTRNAME")].containerID}' | sed 's|^[^:]*://||')
crictl inspect $CTR_ID | jq .info.runtimeSpec
Read effective CPU and memory sets from cgroups on the local node with the kube-cgroups script.
curl -OL https://raw.githubusercontent.com/containers/nri-plugins/refs/heads/main/scripts/testing/kube-cgroups
chmod a+rx kube-cgroups
./kube-cgroups -n NAMESPACE -p PODNAME -c CTRNAME -f 'cpuset.cpus.effective|cpuset.mems.effective'
Cookbook
Latency-Critical Containers
Goal: Minimize latency by preventing cache sharing and optimizing CPU configuration.
Requirements:
Containers should not share L2 cache to avoid cache contention.
CPUs should run at maximum frequency.
Disable deep C-states to avoid wakeup latency.
Each container gets dedicated CPUs (no sharing between containers).
Configuration:
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
name: default
namespace: kube-system
spec:
reservedResources:
cpu: "2"
pinCPU: true
pinMemory: true
allocatorTopologyBalancing: false # Pack tightly for power efficiency
idleCPUClass: powersave # Force to minimal frequencies for external processes
balloonTypes:
- name: ultra-low-latency
# Match latency-critical containers
matchExpressions:
- key: pod/labels/latency
operator: In
values:
- critical
# Fixed size, pre-created balloons
minBalloons: 4
minCPUs: 4
allocatorPriority: high
# Each container gets its own balloon
preferNewBalloons: true
# CPU configuration for minimal latency
cpuClass: ultra-low-latency
schedulingClass: realtime
# Mark CPUs as generating cache load
loads:
- l2-intensive
- name: default
namespaces:
- "*"
loads:
- l2-intensive # avoid sharing L2 domains with critical containers
cpuClass: normal # low to mid GHz to save power budget for critical containers
# Define L2 cache load to prevent containers in other balloons from sharing cache
loadClasses:
- name: l2-intensive
level: l2cache
overloadsLevelInBalloon: false # Share L2 between CPUs within balloon
# CPU classes for frequency and C-state control
control:
cpu:
classes:
ultra-low-latency:
minFreq: 3500000
maxFreq: 3900000
uncoreMinFreq: 2400000
uncoreMaxFreq: 2400000
disabledCstates: [C6, C7, C8, C10]
normal:
minFreq: 800000
maxFreq: 2500000
powersave:
minFreq: 800000
maxFreq: 800000
# Scheduling for high priority
schedulingClasses:
- name: realtime
policy: fifo
priority: 80
ioClass: rt
ioPriority: 0
Pod annotation example:
apiVersion: v1
kind: Pod
metadata:
name: latency-critical-app
labels:
latency: critical
spec:
containers:
- name: app
image: my-app:latest
resources:
requests:
cpu: "4"
Maximum Memory Bandwidth Containers
Goal: Maximize memory bandwidth by two alternative means:
balancing memory-bandwidth balloons across multiple NUMA nodes, each balloon using only the local (lowest-latency) memory nodes.
balancing each memory-bandwidth balloon to have equal number of CPUs from both NUMA nodes of single socket. These balloons are then balanced between sockets.
Requirements:
CPUs from multiple NUMA nodes to access multiple memory channels.
Balanced distribution across topology for maximum bandwidth.
Containers can be large, using many CPUs.
Configuration 1: Local NUMA accesses only
For 2-socket 4-NUMA-node system where each where pods labelled “memory-intensive” should be restricted to use local NUMA only in both CPU and memory. Other pods should share all CPUs of either socket except for CPUs reserved for workloads memory-intensive workloads.
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
name: default
namespace: kube-system
spec:
reservedResources:
cpu: cpuset:0
pinCPU: true
pinMemory: true
allocatorTopologyBalancing: true # Spread across NUMA nodes
balloonTypes:
- name: memory-bandwidth
matchExpressions:
- key: pod/labels/workload-type
operator: In
values:
- memory-intensive
# Each workload gets its own balloon
preferNewBalloons: true
- name: default
# create one default balloon per socket for other workloads
minBalloons: 2
shareIdleCPUsInSame: package
Configuration 2: balanced CPUs from both NUMAs of either socket
For a 2-socket 4-NUMA-node system where each container needs an equal number of CPUs from all NUMAs of either package, but needs to avoid cross-socket allocation.
balloonTypes:
- name: max-bandwidth-either-package
# Composite balloon combining CPUs from all NUMA nodes from either package
components:
- balloonType: max-bandwidth-package0
- balloonType: max-bandwidth-package1
componentCreation: balance-balloons
preferNewBalloons: true
matchExpressions:
- key: pod/labels/workload-type
operator: In
values:
- max-bandwidth
# Component balloon types - one per package
- name: max-bandwidth-package0
components:
- balloonType: numa0
- balloonType: numa1
- name: max-bandwidth-package1
components:
- balloonType: numa2
- balloonType: numa3
# Component balloon types - one per NUMA
- name: numa0
preferCloseToDevices:
- /sys/devices/system/node/node0
- name: numa1
preferCloseToDevices:
- /sys/devices/system/node/node1
- name: numa2
preferCloseToDevices:
- /sys/devices/system/node/node2
- name: numa3
preferCloseToDevices:
- /sys/devices/system/node/node3
Workload-Aware Hyperthread Sharing
Goal: Optimize physical core utilization by controlling which workload types share cores.
Scenario:
Type A workloads: High CPU utilization, always running (e.g., batch processing), no bursts.
Type B workloads: Bursty, short-duration spikes (e.g., request handling). Benefit from temporarily using more CPUs than requested.
Type C workloads: Low priority background tasks. Benefit from from extra CPUs but must not slow down type B workloads.
Strategy:
Prevent two Type A workloads from sharing physical cores (would compete heavily).
Allow Type A + Type B on same physical cores.
Allow Type A + Type C on same physical cores.
Allow Type B and C use extra available CPUs.
Prioritize running Type B over Type C on shared CPUs.
Configuration:
apiVersion: config.nri/v1alpha1
kind: BalloonsPolicy
metadata:
name: default
namespace: kube-system
spec:
reservedResources:
cpu: "2"
pinCPU: true
pinMemory: false
allocatorTopologyBalancing: true
balloonTypes:
- name: always-running-batch
matchExpressions:
- key: pod/labels/workload-pattern
operator: In
values:
- always-running
# Mark as generating sustained core load
loads:
- sustained-compute
- name: bursty-requests
matchExpressions:
- key: pod/labels/workload-pattern
operator: In
values:
- bursty
schedulingClass: high-priority
# Allow bursting within NUMA node scope.
shareIdleCPUsInSame: numa
- name: background-tasks
matchExpressions:
- key: pod/labels/workload-pattern
operator: In
values:
- background
# No loads specified - can share cores with anything.
schedulingClass: low-priority
# Allow bursting within NUMA node scope.
shareIdleCPUsInSame: numa
# Define sustained compute load
loadClasses:
- name: sustained-compute
level: core
schedulingClasses:
- name: high-priority
nice: -10
- name: low-priority
policy: batch
nice: 10
ioClass: idle
How it works:
always-running-batch balloons declare
sustained-computeload withoverloadsLevelInBalloon: true.CPUs allocated from different physical cores within each balloon.
When multiple such balloons exist, they avoid each other’s cores.
bursty-requests and background-tasks balloons don’t declare any loads.
Can be allocated to hyperthreads on cores already used by always-running-batch.
Containers in bursty-requests and background-tasks balloons share idle CPUs in the same NUMA nodes.
They get CPU scheduling priority (bursty) or depriority (background) to manage competition.
Result: Physical cores are shared between complementary workloads, maximizing utilization without excessive competition.
Pod annotation examples:
# Always-running batch job
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing
spec:
template:
metadata:
labels:
workload-pattern: always-running
spec:
containers:
- name: processor
image: batch-processor:latest
resources:
requests:
cpu: "4"
---
# Bursty request handler
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
metadata:
labels:
workload-pattern: bursty
spec:
containers:
- name: server
image: api-server:latest
resources:
requests:
cpu: "2"
---
# Background task
apiVersion: batch/v1
kind: CronJob
metadata:
name: log-aggregator
spec:
schedule: "*/5 * * * *"
jobTemplate:
spec:
template:
metadata:
labels:
workload-pattern: background
spec:
containers:
- name: aggregator
image: log-aggregator:latest
resources:
requests:
cpu: "1"
Advanced: Fine-grained control with L2 cache awareness
For workloads where L2 cache contention is also a concern:
balloonTypes:
- name: cache-sensitive-batch
matchExpressions:
- key: pod/labels/workload-pattern
operator: In
values:
- cache-sensitive
loads:
- sustained-compute
- l2-cache-intensive
loadClasses:
- name: sustained-compute
level: core
overloadsLevelInBalloon: true
- name: l2-cache-intensive
level: l2cache
overloadsLevelInBalloon: false # Allow cache sharing within balloon
This prevents cache-sensitive workloads from sharing L2 caches across balloons while still allowing it within a balloon.
Troubleshooting
Issue: Containers not being assigned to expected balloons or Balloons not getting desired CPUs
Solution: Check the balloon type matching order and CPU allocation from logs with
log:
debug:
- policy
Issue: Out of memory errors with pinMemory: true
Solution: Either disable memory pinning or ensure sufficient memory on each NUMA node:
pinMemory: false
Issue: Policy not applying configuration changes
Solution:
Verify that
nri-resource-policy-balloons-...pod is running.kubectl get -n kube-system daemonset/nri-resource-policy-balloons kubectl get pod -n kube-system nri-resource-policy-balloons-...
Verify the configuration object (BalloonsPolicy) exists, is valid, is named correctly (
default,group.GROUPNAMEornode.NODENAME) and is in the same namespace as the pod.kubectl get balloonspolicies.config.nri -n kube-system
Verify node labels. If a node has a
config.nri/group=GROUPNAMElabel, it will not use BalloonsPolicy nameddefault, onlygroup.GROUPNAMEornode.NODENAME.kubectl get node NODENAME --show-labels
If necessary, delete the policy pod, let the DaemonSet re-create it and look for “config” issues in the log.
kubectl delete pod -n kube-system nri-resource-policy-balloons-...
kubectl logs -n kube-system nri-resource-policy-balloons-...
Issue: System unresponsive or performance drop after configuration change
Solution:
Delete all workloads, especially those with large memory footprint, before changing a configuration. Pinning their memory to different nodes may cause large node-to-node memory migrations, causing unresponsiveness even up to minutes.
Redeploy performance critical workloads only after configuration updateds. That ensures their processes get started and data aligned in memory as configured.
Make sure important containers request CPUs. Check if BestEffort and Burstable containers without CPU requests are still allowed to use enough CPUs (see
shareIdleCPUsInSame) instead of getting all packed on a single CPU.