Documentation/core-api/real-time/differences.rst - linux - Git at Google

 .. SPDX-License-Identifier: GPL-2.0

 ===========================
 How realtime kernels differ
 ===========================

 :Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

 Preface
 =======

 With forced-threaded interrupts and sleeping spin locks, code paths that
 previously caused long scheduling latencies have been made preemptible and
 moved into process context. This allows the scheduler to manage them more
 effectively and respond to higher-priority tasks with reduced latency.

 The following chapters provide an overview of key differences between a
 PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel.

 Locking
 =======

 Spinning locks such as spinlock_t are used to provide synchronization for data
 structures accessed from both interrupt context and process context. For this
 reason, locking functions are also available with the _irq() or _irqsave()
 suffixes, which disable interrupts before acquiring the lock. This ensures that
 the lock can be safely acquired in process context when interrupts are enabled.

 However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer
 run in hard IRQ context. As a result, there is no need to disable interrupts as
 part of the locking procedure when using spinlock_t.

 For low-level core components such as interrupt handling, the scheduler, or the
 timer subsystem the kernel uses raw_spinlock_t. This lock type preserves
 traditional semantics: it disables preemption and, when used with _irq() or
 _irqsave(), also disables interrupts. This ensures proper synchronization in
 critical sections that must remain non-preemptible or with interrupts disabled.

 Execution context
 =================

 Interrupt handling in a PREEMPT_RT system is invoked in process context through
 the use of threaded interrupts. Other parts of the kernel also shift their
 execution into threaded context by different mechanisms. The goal is to keep
 execution paths preemptible, allowing the scheduler to interrupt them when a
 higher-priority task needs to run.

 Below is an overview of the kernel subsystems involved in this transition to
 threaded, preemptible execution.

 Interrupt handling
 ------------------

 All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are
 interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or
 IRQF_ONESHOT flags.

 The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those
 registered using request_threaded_irq() and providing only a threaded handler.
 Its purpose is to keep the interrupt line masked until the threaded handler has
 completed.

 If a primary handler is also provided in this case, it is essential that the
 handler does not acquire any sleeping locks, as it will not be threaded. The
 handler should be minimal and must avoid introducing delays, such as
 busy-waiting on hardware registers.


 Soft interrupts, bottom half handling
 -------------------------------------

 Soft interrupts are raised by the interrupt handler and are executed after the
 handler returns. Since they run in thread context, they can be preempted by
 other threads. Do not assume that softirq context runs with preemption
 disabled. This means you must not rely on mechanisms like local_bh_disable() in
 process context to protect per-CPU variables. Because softirq handlers are
 preemptible under PREEMPT_RT, this approach does not provide reliable
 synchronization.

 If this kind of protection is required for performance reasons, consider using
 local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to
 verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the
 necessary locking to ensure proper protection.

 Using local_lock_nested_bh() also makes the locking scope explicit and easier
 for readers and maintainers to understand.


 per-CPU variables
 -----------------

 Protecting access to per-CPU variables solely by using preempt_disable() should
 be avoided, especially if the critical section has unbounded runtime or may
 call APIs that can sleep.

 If using a spinlock_t is considered too costly for performance reasons,
 consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces
 no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies
 that the lock is only acquired in process context and never from softirq or
 hard IRQ context.

 On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t,
 which provides safe local protection for per-CPU data while keeping the system
 preemptible.

 Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used
 to protect per-CPU data by relying on implicit preemption disabling. If this
 inherited preemption disabling is essential and if local_lock_t cannot be used
 due to performance constraints, brevity of the code, or abstraction boundaries
 within an API then preempt_disable_nested() may be a suitable alternative. On
 non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already
 disabled. On PREEMPT_RT, it explicitly disables preemption.

 Timers
 ------

 By default, an hrtimer is executed in hard interrupt context. The exception is
 timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in
 softirq context.

 On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in
 softirq context by default, typically within the ktimersd thread. This thread
 runs at the lowest real-time priority, ensuring it executes before any
 SCHED_OTHER tasks but does not interfere with higher-priority real-time
 threads. To explicitly request execution in hard interrupt context on
 PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag.

 Memory allocation
 -----------------

 The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
 gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
 necessary to use GFP_ATOMIC when allocating memory from interrupt context or
 from sections where preemption is disabled. This is because the allocator must
 not sleep in these contexts waiting for memory to become available.

 However, this approach does not work on PREEMPT_RT kernels. The memory
 allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
 acquired when preemption is disabled. Fortunately, this is generally not a
 problem, because PREEMPT_RT moves most contexts that would traditionally run
 with preemption or interrupts disabled into threaded context, where sleeping is
 allowed.

 What remains problematic is code that explicitly disables preemption or
 interrupts. In such cases, memory allocation must be performed outside the
 critical section.

 This restriction also applies to memory deallocation routines such as kfree()
 and free_pages(), which may also involve internal locking and must not be
 called from non-preemptible contexts.

 IRQ work
 --------

 The irq_work API provides a mechanism to schedule a callback in interrupt
 context. It is designed for use in contexts where traditional scheduling is not
 possible, such as from within NMI handlers or from inside the scheduler, where
 using a workqueue would be unsafe.

 On non-PREEMPT_RT systems, all irq_work items are executed immediately in
 interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next
 timer tick but are still executed in interrupt context.

 On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks
 may acquire sleeping locks or have unbounded execution time, they are handled
 in thread context by a per-CPU irq_work kernel thread. This thread runs at the
 lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks
 but does not interfere with higher-priority real-time threads.

 The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still
 executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be
 deferred until the next timer tick and are also executed by the irq_work/
 thread.

 RCU callbacks
 -------------

 RCU callbacks are invoked by default in softirq context. Their execution is
 important because, depending on the use case, they either free memory or ensure
 progress in state transitions. Running these callbacks as part of the softirq
 chain can lead to undesired situations, such as contention for CPU resources
 with other SCHED_OTHER tasks when executed within ksoftirqd.

 To avoid running callbacks in softirq context, the RCU subsystem provides a
 mechanism to execute them in process context instead. This behavior can be
 enabled by setting the boot command-line parameter rcutree.use_softirq=0. This
 setting is enforced in kernels configured with PREEMPT_RT.

 Spin until ready
 ================

 The "spin until ready" pattern involves repeatedly checking (spinning on) the
 state of a data structure until it becomes available. This pattern assumes that
 preemption, soft interrupts, or interrupts are disabled. If the data structure
 is marked busy, it is presumed to be in use by another CPU, and spinning should
 eventually succeed as that CPU makes progress.

 Some examples are hrtimer_cancel() or timer_delete_sync(). These functions
 cancel timers that execute with interrupts or soft interrupts disabled. If a
 thread attempts to cancel a timer and finds it active, spinning until the
 callback completes is safe because the callback can only run on another CPU and
 will eventually finish.

 On PREEMPT_RT kernels, however, timer callbacks run in thread context. This
 introduces a challenge: a higher-priority thread attempting to cancel the timer
 may preempt the timer callback thread. Since the scheduler cannot migrate the
 callback thread to another CPU due to affinity constraints, spinning can result
 in livelock even on multiprocessor systems.

 To avoid this, both the canceling and callback sides must use a handshake
 mechanism that supports priority inheritance. This allows the canceling thread
 to suspend until the callback completes, ensuring forward progress without
 risking livelock.

 In order to solve the problem at the API level, the sequence locks were extended
 to allow a proper handover between the the spinning reader and the maybe
 blocked writer.

 Sequence locks
 --------------

 Sequence counters and sequential locks are documented in
 Documentation/locking/seqlock.rst.

 The interface has been extended to ensure proper preemption states for the
 writer and spinning reader contexts. This is achieved by embedding the writer
 serialization lock directly into the sequence counter type, resulting in
 composite types such as seqcount_spinlock_t or seqcount_mutex_t.

 These composite types allow readers to detect an ongoing write and actively
 boost the writer’s priority to help it complete its update instead of spinning
 and waiting for its completion.

 If the plain seqcount_t is used, extra care must be taken to synchronize the
 reader with the writer during updates. The writer must ensure its update is
 serialized and non-preemptible relative to the reader. This cannot be achieved
 using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable
 preemption. In such cases, using seqcount_spinlock_t is the preferred solution.

 However, if there is no spinning involved i.e., if the reader only needs to
 detect whether a write has started and not serialize against it then using
 seqcount_t is reasonable.
	.. SPDX-License-Identifier: GPL-2.0

	===========================
	How realtime kernels differ
	===========================

	:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

	Preface
	=======

	With forced-threaded interrupts and sleeping spin locks, code paths that
	previously caused long scheduling latencies have been made preemptible and
	moved into process context. This allows the scheduler to manage them more
	effectively and respond to higher-priority tasks with reduced latency.

	The following chapters provide an overview of key differences between a
	PREEMPT_RT kernel and a standard, non-PREEMPT_RT kernel.

	Locking
	=======

	Spinning locks such as spinlock_t are used to provide synchronization for data
	structures accessed from both interrupt context and process context. For this
	reason, locking functions are also available with the _irq() or _irqsave()
	suffixes, which disable interrupts before acquiring the lock. This ensures that
	the lock can be safely acquired in process context when interrupts are enabled.

	However, on a PREEMPT_RT system, interrupts are forced-threaded and no longer
	run in hard IRQ context. As a result, there is no need to disable interrupts as
	part of the locking procedure when using spinlock_t.

	For low-level core components such as interrupt handling, the scheduler, or the
	timer subsystem the kernel uses raw_spinlock_t. This lock type preserves
	traditional semantics: it disables preemption and, when used with _irq() or
	_irqsave(), also disables interrupts. This ensures proper synchronization in
	critical sections that must remain non-preemptible or with interrupts disabled.

	Execution context
	=================

	Interrupt handling in a PREEMPT_RT system is invoked in process context through
	the use of threaded interrupts. Other parts of the kernel also shift their
	execution into threaded context by different mechanisms. The goal is to keep
	execution paths preemptible, allowing the scheduler to interrupt them when a
	higher-priority task needs to run.

	Below is an overview of the kernel subsystems involved in this transition to
	threaded, preemptible execution.

	Interrupt handling
	------------------

	All interrupts are forced-threaded in a PREEMPT_RT system. The exceptions are
	interrupts that are requested with the IRQF_NO_THREAD, IRQF_PERCPU, or
	IRQF_ONESHOT flags.

	The IRQF_ONESHOT flag is used together with threaded interrupts, meaning those
	registered using request_threaded_irq() and providing only a threaded handler.
	Its purpose is to keep the interrupt line masked until the threaded handler has
	completed.

	If a primary handler is also provided in this case, it is essential that the
	handler does not acquire any sleeping locks, as it will not be threaded. The
	handler should be minimal and must avoid introducing delays, such as
	busy-waiting on hardware registers.


	Soft interrupts, bottom half handling
	-------------------------------------

	Soft interrupts are raised by the interrupt handler and are executed after the
	handler returns. Since they run in thread context, they can be preempted by
	other threads. Do not assume that softirq context runs with preemption
	disabled. This means you must not rely on mechanisms like local_bh_disable() in
	process context to protect per-CPU variables. Because softirq handlers are
	preemptible under PREEMPT_RT, this approach does not provide reliable
	synchronization.

	If this kind of protection is required for performance reasons, consider using
	local_lock_nested_bh(). On non-PREEMPT_RT kernels, this allows lockdep to
	verify that bottom halves are disabled. On PREEMPT_RT systems, it adds the
	necessary locking to ensure proper protection.

	Using local_lock_nested_bh() also makes the locking scope explicit and easier
	for readers and maintainers to understand.


	per-CPU variables
	-----------------

	Protecting access to per-CPU variables solely by using preempt_disable() should
	be avoided, especially if the critical section has unbounded runtime or may
	call APIs that can sleep.

	If using a spinlock_t is considered too costly for performance reasons,
	consider using local_lock_t. On non-PREEMPT_RT configurations, this introduces
	no runtime overhead when lockdep is disabled. With lockdep enabled, it verifies
	that the lock is only acquired in process context and never from softirq or
	hard IRQ context.

	On a PREEMPT_RT kernel, local_lock_t is implemented using a per-CPU spinlock_t,
	which provides safe local protection for per-CPU data while keeping the system
	preemptible.

	Because spinlock_t on PREEMPT_RT does not disable preemption, it cannot be used
	to protect per-CPU data by relying on implicit preemption disabling. If this
	inherited preemption disabling is essential and if local_lock_t cannot be used
	due to performance constraints, brevity of the code, or abstraction boundaries
	within an API then preempt_disable_nested() may be a suitable alternative. On
	non-PREEMPT_RT kernels, it verifies with lockdep that preemption is already
	disabled. On PREEMPT_RT, it explicitly disables preemption.

	Timers
	------

	By default, an hrtimer is executed in hard interrupt context. The exception is
	timers initialized with the HRTIMER_MODE_SOFT flag, which are executed in
	softirq context.

	On a PREEMPT_RT kernel, this behavior is reversed: hrtimers are executed in
	softirq context by default, typically within the ktimersd thread. This thread
	runs at the lowest real-time priority, ensuring it executes before any
	SCHED_OTHER tasks but does not interfere with higher-priority real-time
	threads. To explicitly request execution in hard interrupt context on
	PREEMPT_RT, the timer must be marked with the HRTIMER_MODE_HARD flag.

	Memory allocation
	-----------------

	The memory allocation APIs, such as kmalloc() and alloc_pages(), require a
	gfp_t flag to indicate the allocation context. On non-PREEMPT_RT kernels, it is
	necessary to use GFP_ATOMIC when allocating memory from interrupt context or
	from sections where preemption is disabled. This is because the allocator must
	not sleep in these contexts waiting for memory to become available.

	However, this approach does not work on PREEMPT_RT kernels. The memory
	allocator in PREEMPT_RT uses sleeping locks internally, which cannot be
	acquired when preemption is disabled. Fortunately, this is generally not a
	problem, because PREEMPT_RT moves most contexts that would traditionally run
	with preemption or interrupts disabled into threaded context, where sleeping is
	allowed.

	What remains problematic is code that explicitly disables preemption or
	interrupts. In such cases, memory allocation must be performed outside the
	critical section.

	This restriction also applies to memory deallocation routines such as kfree()
	and free_pages(), which may also involve internal locking and must not be
	called from non-preemptible contexts.

	IRQ work
	--------

	The irq_work API provides a mechanism to schedule a callback in interrupt
	context. It is designed for use in contexts where traditional scheduling is not
	possible, such as from within NMI handlers or from inside the scheduler, where
	using a workqueue would be unsafe.

	On non-PREEMPT_RT systems, all irq_work items are executed immediately in
	interrupt context. Items marked with IRQ_WORK_LAZY are deferred until the next
	timer tick but are still executed in interrupt context.

	On PREEMPT_RT systems, the execution model changes. Because irq_work callbacks
	may acquire sleeping locks or have unbounded execution time, they are handled
	in thread context by a per-CPU irq_work kernel thread. This thread runs at the
	lowest real-time priority, ensuring it executes before any SCHED_OTHER tasks
	but does not interfere with higher-priority real-time threads.

	The exception are work items marked with IRQ_WORK_HARD_IRQ, which are still
	executed in hard interrupt context. Lazy items (IRQ_WORK_LAZY) continue to be
	deferred until the next timer tick and are also executed by the irq_work/
	thread.

	RCU callbacks
	-------------

	RCU callbacks are invoked by default in softirq context. Their execution is
	important because, depending on the use case, they either free memory or ensure
	progress in state transitions. Running these callbacks as part of the softirq
	chain can lead to undesired situations, such as contention for CPU resources
	with other SCHED_OTHER tasks when executed within ksoftirqd.

	To avoid running callbacks in softirq context, the RCU subsystem provides a
	mechanism to execute them in process context instead. This behavior can be
	enabled by setting the boot command-line parameter rcutree.use_softirq=0. This
	setting is enforced in kernels configured with PREEMPT_RT.

	Spin until ready
	================

	The "spin until ready" pattern involves repeatedly checking (spinning on) the
	state of a data structure until it becomes available. This pattern assumes that
	preemption, soft interrupts, or interrupts are disabled. If the data structure
	is marked busy, it is presumed to be in use by another CPU, and spinning should
	eventually succeed as that CPU makes progress.

	Some examples are hrtimer_cancel() or timer_delete_sync(). These functions
	cancel timers that execute with interrupts or soft interrupts disabled. If a
	thread attempts to cancel a timer and finds it active, spinning until the
	callback completes is safe because the callback can only run on another CPU and
	will eventually finish.

	On PREEMPT_RT kernels, however, timer callbacks run in thread context. This
	introduces a challenge: a higher-priority thread attempting to cancel the timer
	may preempt the timer callback thread. Since the scheduler cannot migrate the
	callback thread to another CPU due to affinity constraints, spinning can result
	in livelock even on multiprocessor systems.

	To avoid this, both the canceling and callback sides must use a handshake
	mechanism that supports priority inheritance. This allows the canceling thread
	to suspend until the callback completes, ensuring forward progress without
	risking livelock.

	In order to solve the problem at the API level, the sequence locks were extended
	to allow a proper handover between the the spinning reader and the maybe
	blocked writer.

	Sequence locks
	--------------

	Sequence counters and sequential locks are documented in
	Documentation/locking/seqlock.rst.

	The interface has been extended to ensure proper preemption states for the
	writer and spinning reader contexts. This is achieved by embedding the writer
	serialization lock directly into the sequence counter type, resulting in
	composite types such as seqcount_spinlock_t or seqcount_mutex_t.

	These composite types allow readers to detect an ongoing write and actively
	boost the writer’s priority to help it complete its update instead of spinning
	and waiting for its completion.

	If the plain seqcount_t is used, extra care must be taken to synchronize the
	reader with the writer during updates. The writer must ensure its update is
	serialized and non-preemptible relative to the reader. This cannot be achieved
	using a regular spinlock_t because spinlock_t on PREEMPT_RT does not disable
	preemption. In such cases, using seqcount_spinlock_t is the preferred solution.

	However, if there is no spinning involved i.e., if the reader only needs to
	detect whether a write has started and not serialize against it then using
	seqcount_t is reasonable.