| .. SPDX-License-Identifier: GPL-2.0 |
| |
| ==================== |
| Considering hardware |
| ==================== |
| |
| :Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
| |
| The way a workload is handled can be influenced by the hardware it runs on. |
| Key components include the CPU, memory, and the buses that connect them. |
| These resources are shared among all applications on the system. |
| As a result, heavy utilization of one resource by a single application |
| can affect the deterministic handling of workloads in other applications. |
| |
| Below is a brief overview. |
| |
| System memory and cache |
| ----------------------- |
| |
| Main memory and the associated caches are the most common shared resources among |
| tasks in a system. One task can dominate the available caches, forcing another |
| task to wait until a cache line is written back to main memory before it can |
| proceed. The impact of this contention varies based on write patterns and the |
| size of the caches available. Larger caches may reduce stalls because more lines |
| can be buffered before being written back. Conversely, certain write patterns |
| may trigger the cache controller to flush many lines at once, causing |
| applications to stall until the operation completes. |
| |
| This issue can be partly mitigated if applications do not share the same CPU |
| cache. The kernel is aware of the cache topology and exports this information to |
| user space. Tools such as **lstopo** from the Portable Hardware Locality (hwloc) |
| project (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy. |
| |
| Avoiding shared L2 or L3 caches is not always possible. Even when cache sharing |
| is minimized, bottlenecks can still occur when accessing system memory. Memory |
| is used not only by the CPU but also by peripheral devices via DMA, such as |
| graphics cards or network adapters. |
| |
| In some cases, cache and memory bottlenecks can be controlled if the hardware |
| provides the necessary support. On x86 systems, Intel offers Cache Allocation |
| Technology (CAT), which enables cache partitioning among applications and |
| provides control over the interconnect. AMD provides similar functionality under |
| Platform Quality of Service (PQoS). On Arm64, the equivalent is Memory |
| System Resource Partitioning and Monitoring (MPAM). |
| |
| These features can be configured through the Linux Resource Control interface. |
| For details, see Documentation/filesystems/resctrl.rst. |
| |
| The perf tool can be used to monitor cache behavior. It can analyze |
| cache misses of an application and compare how they change under |
| different workloads on a neighboring CPU. Even more powerful, the perf |
| c2c tool can help identify cache-to-cache issues, where multiple CPU |
| cores repeatedly access and modify data on the same cache line. |
| |
| Hardware buses |
| -------------- |
| |
| Real-time systems often need to access hardware directly to perform their work. |
| Any latency in this process is undesirable, as it can affect the outcome of the |
| task. For example, on an I/O bus, a changed output may not become immediately |
| visible but instead appear with variable delay depending on the latency of the |
| bus used for communication. |
| |
| A bus such as PCI is relatively simple because register accesses are routed |
| directly to the connected device. In the worst case, a read operation stalls the |
| CPU until the device responds. |
| |
| A bus such as USB is more complex, involving multiple layers. A register read |
| or write is wrapped in a USB Request Block (URB), which is then sent by the |
| USB host controller to the device. Timing and latency are influenced by the |
| underlying USB bus. Requests cannot be sent immediately; they must align with |
| the next frame boundary according to the endpoint type and the host controller's |
| scheduling rules. This can introduce delays and additional latency. For example, |
| a network device connected via USB may still deliver sufficient throughput, but |
| the added latency when sending or receiving packets may fail to meet the |
| requirements of certain real-time use cases. |
| |
| Additional restrictions on bus latency can arise from power management. For |
| instance, PCIe with Active State Power Management (ASPM) enabled can suspend |
| the link between the device and the host. While this behavior is beneficial for |
| power savings, it delays device access and adds latency to responses. This issue |
| is not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be |
| affected by power management mechanisms. |
| |
| Virtualization |
| -------------- |
| |
| In a virtualized environment such as KVM, each guest CPU is represented as a |
| thread on the host. If such a thread runs with real-time priority, the system |
| should be tested to confirm it can sustain this behavior over extended periods. |
| Because of its priority, the thread will not be preempted by lower-priority |
| threads (such as SCHED_OTHER), which may then receive no CPU time. This can |
| cause problems if a lower-priority thread is pinned to a CPU already occupied by |
| a real-time task and unable to make progress. Even if a CPU has been isolated, |
| the system may still (accidentally) start a per‑CPU thread on that CPU. |
| Ensuring that a guest CPU goes idle is difficult, as it requires avoiding both |
| task scheduling and interrupt handling. Furthermore, if the guest CPU does go |
| idle but the guest system is booted with the option **idle=poll**, the guest |
| CPU will never enter an idle state and will instead spin until an event |
| arrives. |
| |
| Device handling introduces additional considerations. Emulated PCI devices or |
| VirtIO devices require a counterpart on the host to complete requests. This |
| adds latency because the host must intercept and either process the request |
| directly or schedule a thread for its completion. These delays can be avoided if |
| the required PCI device is passed directly through to the guest. Some devices, |
| such as networking or storage controllers, support the PCIe SR-IOV feature. |
| SR-IOV allows a single PCIe device to be divided into multiple virtual functions, |
| which can then be assigned to different guests. |
| |
| Networking |
| ---------- |
| |
| For low-latency networking, the full networking stack may be undesirable, as it |
| can introduce additional sources of delay. In this context, XDP can be used |
| as a shortcut to bypass much of the stack while still relying on the kernel's |
| network driver. |
| |
| The requirements are that the network driver must support XDP- preferably using |
| an "skb pool" and that the application must use an XDP socket. Additional |
| configuration may involve BPF filters, tuning networking queues, or configuring |
| qdiscs for time-based transmission. These techniques are often |
| applied in Time-Sensitive Networking (TSN) environments. |
| |
| Documenting all required steps exceeds the scope of this text. For detailed |
| guidance, see the TSN documentation at https://tsn.readthedocs.io. |
| |
| Another useful resource is the Linux Real-Time Communication Testbench |
| https://github.com/Linutronix/RTC-Testbench. |
| The goal of this project is to validate real-time network communication. It can |
| be thought of as a "cyclictest" for networking and also serves as a starting |
| point for application development. |