Documentation/core-api/real-time/hardware.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 ====================
 Considering hardware
 ====================

 :Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

 The way a workload is handled can be influenced by the hardware it runs on.
 Key components include the CPU, memory, and the buses that connect them.
 These resources are shared among all applications on the system.
 As a result, heavy utilization of one resource by a single application
 can affect the deterministic handling of workloads in other applications.

 Below is a brief overview.

 System memory and cache
 -----------------------

 Main memory and the associated caches are the most common shared resources among
 tasks in a system. One task can dominate the available caches, forcing another
 task to wait until a cache line is written back to main memory before it can
 proceed. The impact of this contention varies based on write patterns and the
 size of the caches available. Larger caches may reduce stalls because more lines
 can be buffered before being written back. Conversely, certain write patterns
 may trigger the cache controller to flush many lines at once, causing
 applications to stall until the operation completes.

 This issue can be partly mitigated if applications do not share the same CPU
 cache. The kernel is aware of the cache topology and exports this information to
 user space. Tools such as **lstopo** from the Portable Hardware Locality (hwloc)
 project (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy.

 Avoiding shared L2 or L3 caches is not always possible. Even when cache sharing
 is minimized, bottlenecks can still occur when accessing system memory. Memory
 is used not only by the CPU but also by peripheral devices via DMA, such as
 graphics cards or network adapters.

 In some cases, cache and memory bottlenecks can be controlled if the hardware
 provides the necessary support. On x86 systems, Intel offers Cache Allocation
 Technology (CAT), which enables cache partitioning among applications and
 provides control over the interconnect. AMD provides similar functionality under
 Platform Quality of Service (PQoS). On Arm64, the equivalent is Memory
 System Resource Partitioning and Monitoring (MPAM).

 These features can be configured through the Linux Resource Control interface.
 For details, see Documentation/filesystems/resctrl.rst.

 The perf tool can be used to monitor cache behavior. It can analyze
 cache misses of an application and compare how they change under
 different workloads on a neighboring CPU. Even more powerful, the perf
 c2c tool can help identify cache-to-cache issues, where multiple CPU
 cores repeatedly access and modify data on the same cache line.

 Hardware buses
 --------------

 Real-time systems often need to access hardware directly to perform their work.
 Any latency in this process is undesirable, as it can affect the outcome of the
 task. For example, on an I/O bus, a changed output may not become immediately
 visible but instead appear with variable delay depending on the latency of the
 bus used for communication.

 A bus such as PCI is relatively simple because register accesses are routed
 directly to the connected device. In the worst case, a read operation stalls the
 CPU until the device responds.

 A bus such as USB is more complex, involving multiple layers. A register read
 or write is wrapped in a USB Request Block (URB), which is then sent by the
 USB host controller to the device. Timing and latency are influenced by the
 underlying USB bus. Requests cannot be sent immediately; they must align with
 the next frame boundary according to the endpoint type and the host controller's
 scheduling rules. This can introduce delays and additional latency. For example,
 a network device connected via USB may still deliver sufficient throughput, but
 the added latency when sending or receiving packets may fail to meet the
 requirements of certain real-time use cases.

 Additional restrictions on bus latency can arise from power management. For
 instance, PCIe with Active State Power Management (ASPM) enabled can suspend
 the link between the device and the host. While this behavior is beneficial for
 power savings, it delays device access and adds latency to responses. This issue
 is not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be
 affected by power management mechanisms.

 Virtualization
 --------------

 In a virtualized environment such as KVM, each guest CPU is represented as a
 thread on the host. If such a thread runs with real-time priority, the system
 should be tested to confirm it can sustain this behavior over extended periods.
 Because of its priority, the thread will not be preempted by lower-priority
 threads (such as SCHED_OTHER), which may then receive no CPU time. This can
 cause problems if a lower-priority thread is pinned to a CPU already occupied by
 a real-time task and unable to make progress. Even if a CPU has been isolated,
 the system may still (accidentally) start a per‑CPU thread on that CPU.
 Ensuring that a guest CPU goes idle is difficult, as it requires avoiding both
 task scheduling and interrupt handling. Furthermore, if the guest CPU does go
 idle but the guest system is booted with the option **idle=poll**, the guest
 CPU will never enter an idle state and will instead spin until an event
 arrives.

 Device handling introduces additional considerations. Emulated PCI devices or
 VirtIO devices require a counterpart on the host to complete requests. This
 adds latency because the host must intercept and either process the request
 directly or schedule a thread for its completion. These delays can be avoided if
 the required PCI device is passed directly through to the guest. Some devices,
 such as networking or storage controllers, support the PCIe SR-IOV feature.
 SR-IOV allows a single PCIe device to be divided into multiple virtual functions,
 which can then be assigned to different guests.

 Networking
 ----------

 For low-latency networking, the full networking stack may be undesirable, as it
 can introduce additional sources of delay. In this context, XDP can be used
 as a shortcut to bypass much of the stack while still relying on the kernel's
 network driver.

 The requirements are that the network driver must support XDP- preferably using
 an "skb pool" and that the application must use an XDP socket. Additional
 configuration may involve BPF filters, tuning networking queues, or configuring
 qdiscs for time-based transmission. These techniques are often
 applied in Time-Sensitive Networking (TSN) environments.

 Documenting all required steps exceeds the scope of this text. For detailed
 guidance, see the TSN documentation at https://tsn.readthedocs.io.

 Another useful resource is the Linux Real-Time Communication Testbench
 https://github.com/Linutronix/RTC-Testbench.
 The goal of this project is to validate real-time network communication. It can
 be thought of as a "cyclictest" for networking and also serves as a starting
 point for application development.
	.. SPDX-License-Identifier: GPL-2.0

	====================
	Considering hardware
	====================

	:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

	The way a workload is handled can be influenced by the hardware it runs on.
	Key components include the CPU, memory, and the buses that connect them.
	These resources are shared among all applications on the system.
	As a result, heavy utilization of one resource by a single application
	can affect the deterministic handling of workloads in other applications.

	Below is a brief overview.

	System memory and cache
	-----------------------

	Main memory and the associated caches are the most common shared resources among
	tasks in a system. One task can dominate the available caches, forcing another
	task to wait until a cache line is written back to main memory before it can
	proceed. The impact of this contention varies based on write patterns and the
	size of the caches available. Larger caches may reduce stalls because more lines
	can be buffered before being written back. Conversely, certain write patterns
	may trigger the cache controller to flush many lines at once, causing
	applications to stall until the operation completes.

	This issue can be partly mitigated if applications do not share the same CPU
	cache. The kernel is aware of the cache topology and exports this information to
	user space. Tools such as lstopo from the Portable Hardware Locality (hwloc)
	project (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy.

	Avoiding shared L2 or L3 caches is not always possible. Even when cache sharing
	is minimized, bottlenecks can still occur when accessing system memory. Memory
	is used not only by the CPU but also by peripheral devices via DMA, such as
	graphics cards or network adapters.

	In some cases, cache and memory bottlenecks can be controlled if the hardware
	provides the necessary support. On x86 systems, Intel offers Cache Allocation
	Technology (CAT), which enables cache partitioning among applications and
	provides control over the interconnect. AMD provides similar functionality under
	Platform Quality of Service (PQoS). On Arm64, the equivalent is Memory
	System Resource Partitioning and Monitoring (MPAM).

	These features can be configured through the Linux Resource Control interface.
	For details, see Documentation/filesystems/resctrl.rst.

	The perf tool can be used to monitor cache behavior. It can analyze
	cache misses of an application and compare how they change under
	different workloads on a neighboring CPU. Even more powerful, the perf
	c2c tool can help identify cache-to-cache issues, where multiple CPU
	cores repeatedly access and modify data on the same cache line.

	Hardware buses
	--------------

	Real-time systems often need to access hardware directly to perform their work.
	Any latency in this process is undesirable, as it can affect the outcome of the
	task. For example, on an I/O bus, a changed output may not become immediately
	visible but instead appear with variable delay depending on the latency of the
	bus used for communication.

	A bus such as PCI is relatively simple because register accesses are routed
	directly to the connected device. In the worst case, a read operation stalls the
	CPU until the device responds.

	A bus such as USB is more complex, involving multiple layers. A register read
	or write is wrapped in a USB Request Block (URB), which is then sent by the
	USB host controller to the device. Timing and latency are influenced by the
	underlying USB bus. Requests cannot be sent immediately; they must align with
	the next frame boundary according to the endpoint type and the host controller's
	scheduling rules. This can introduce delays and additional latency. For example,
	a network device connected via USB may still deliver sufficient throughput, but
	the added latency when sending or receiving packets may fail to meet the
	requirements of certain real-time use cases.

	Additional restrictions on bus latency can arise from power management. For
	instance, PCIe with Active State Power Management (ASPM) enabled can suspend
	the link between the device and the host. While this behavior is beneficial for
	power savings, it delays device access and adds latency to responses. This issue
	is not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be
	affected by power management mechanisms.

	Virtualization
	--------------

	In a virtualized environment such as KVM, each guest CPU is represented as a
	thread on the host. If such a thread runs with real-time priority, the system
	should be tested to confirm it can sustain this behavior over extended periods.
	Because of its priority, the thread will not be preempted by lower-priority
	threads (such as SCHED_OTHER), which may then receive no CPU time. This can
	cause problems if a lower-priority thread is pinned to a CPU already occupied by
	a real-time task and unable to make progress. Even if a CPU has been isolated,
	the system may still (accidentally) start a per‑CPU thread on that CPU.
	Ensuring that a guest CPU goes idle is difficult, as it requires avoiding both
	task scheduling and interrupt handling. Furthermore, if the guest CPU does go
	idle but the guest system is booted with the option idle=poll, the guest
	CPU will never enter an idle state and will instead spin until an event
	arrives.

	Device handling introduces additional considerations. Emulated PCI devices or
	VirtIO devices require a counterpart on the host to complete requests. This
	adds latency because the host must intercept and either process the request
	directly or schedule a thread for its completion. These delays can be avoided if
	the required PCI device is passed directly through to the guest. Some devices,
	such as networking or storage controllers, support the PCIe SR-IOV feature.
	SR-IOV allows a single PCIe device to be divided into multiple virtual functions,
	which can then be assigned to different guests.

	Networking
	----------

	For low-latency networking, the full networking stack may be undesirable, as it
	can introduce additional sources of delay. In this context, XDP can be used
	as a shortcut to bypass much of the stack while still relying on the kernel's
	network driver.

	The requirements are that the network driver must support XDP- preferably using
	an "skb pool" and that the application must use an XDP socket. Additional
	configuration may involve BPF filters, tuning networking queues, or configuring
	qdiscs for time-based transmission. These techniques are often
	applied in Time-Sensitive Networking (TSN) environments.

	Documenting all required steps exceeds the scope of this text. For detailed
	guidance, see the TSN documentation at https://tsn.readthedocs.io.

	Another useful resource is the Linux Real-Time Communication Testbench
	https://github.com/Linutronix/RTC-Testbench.
	The goal of this project is to validate real-time network communication. It can
	be thought of as a "cyclictest" for networking and also serves as a starting
	point for application development.