Documentation/trace/debugging.rst - linux - Git at Google

 ==============================
 Using the tracer for debugging
 ==============================

 Copyright 2024 Google LLC.

 :Author:   Steven Rostedt <rostedt@goodmis.org>
 :License:  The GNU Free Documentation License, Version 1.2
           (dual licensed under the GPL v2)

 - Written for: 6.12

 Introduction
 ------------
 The tracing infrastructure can be very useful for debugging the Linux
 kernel. This document is a place to add various methods of using the tracer
 for debugging.

 First, make sure that the tracefs file system is mounted::

  $ sudo mount -t tracefs tracefs /sys/kernel/tracing


 Using trace_printk()
 --------------------

 trace_printk() is a very lightweight utility that can be used in any context
 inside the kernel, with the exception of "noinstr" sections. It can be used
 in normal, softirq, interrupt and even NMI context. The trace data is
 written to the tracing ring buffer in a lockless way. To make it even
 lighter weight, when possible, it will only record the pointer to the format
 string, and save the raw arguments into the buffer. The format and the
 arguments will be post processed when the ring buffer is read. This way the
 trace_printk() format conversions are not done during the hot path, where
 the trace is being recorded.

 trace_printk() is meant only for debugging, and should never be added into
 a subsystem of the kernel. If you need debugging traces, add trace events
 instead. If a trace_printk() is found in the kernel, the following will
 appear in the dmesg::

   **********************************************************
   **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
   **                                                      **
   ** trace_printk() being used. Allocating extra memory.  **
   **                                                      **
   ** This means that this is a DEBUG kernel and it is     **
   ** unsafe for production use.                           **
   **                                                      **
   ** If you see this message and you are not debugging    **
   ** the kernel, report this immediately to your vendor!  **
   **                                                      **
   **   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **
   **********************************************************

 Debugging kernel crashes
 ------------------------
 There is various methods of acquiring the state of the system when a kernel
 crash occurs. This could be from the oops message in printk, or one could
 use kexec/kdump. But these just show what happened at the time of the crash.
 It can be very useful in knowing what happened up to the point of the crash.
 The tracing ring buffer, by default, is a circular buffer than will
 overwrite older events with newer ones. When a crash happens, the content of
 the ring buffer will be all the events that lead up to the crash.

 There are several kernel command line parameters that can be used to help in
 this. The first is "ftrace_dump_on_oops". This will dump the tracing ring
 buffer when a oops occurs to the console. This can be useful if the console
 is being logged somewhere. If a serial console is used, it may be prudent to
 make sure the ring buffer is relatively small, otherwise the dumping of the
 ring buffer may take several minutes to hours to finish. Here's an example
 of the kernel command line::

   ftrace_dump_on_oops trace_buf_size=50K

 Note, the tracing buffer is made up of per CPU buffers where each of these
 buffers is broken up into sub-buffers that are by default PAGE_SIZE. The
 above trace_buf_size option above sets each of the per CPU buffers to 50K,
 so, on a machine with 8 CPUs, that's actually 400K total.

 Persistent buffers across boots
 -------------------------------
 If the system memory allows it, the tracing ring buffer can be specified at
 a specific location in memory. If the location is the same across boots and
 the memory is not modified, the tracing buffer can be retrieved from the
 following boot. There's two ways to reserve memory for the use of the ring
 buffer.

 The more reliable way (on x86) is to reserve memory with the "memmap" kernel
 command line option and then use that memory for the trace_instance. This
 requires a bit of knowledge of the physical memory layout of the system. The
 advantage of using this method, is that the memory for the ring buffer will
 always be the same::

   memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M

 The memmap above reserves 12 megabytes of memory at the physical memory
 location 0x284500000. Then the trace_instance option will create a trace
 instance "boot_map" at that same location with the same amount of memory
 reserved. As the ring buffer is broke up into per CPU buffers, the 12
 megabytes will be broken up evenly between those CPUs. If you have 8 CPUs,
 each per CPU ring buffer will be 1.5 megabytes in size. Note, that also
 includes meta data, so the amount of memory actually used by the ring buffer
 will be slightly smaller.

 Another more generic but less robust way to allocate a ring buffer mapping
 at boot is with the "reserve_mem" option::

   reserve_mem=12M:4096:trace trace_instance=boot_map@trace

 The reserve_mem option above will find 12 megabytes that are available at
 boot up, and align it by 4096 bytes. It will label this memory as "trace"
 that can be used by later command line options.

 The trace_instance option creates a "boot_map" instance and will use the
 memory reserved by reserve_mem that was labeled as "trace". This method is
 more generic but may not be as reliable. Due to KASLR, the memory reserved
 by reserve_mem may not be located at the same location. If this happens,
 then the ring buffer will not be from the previous boot and will be reset.

 Sometimes, by using a larger alignment, it can keep KASLR from moving things
 around in such a way that it will move the location of the reserve_mem. By
 using a larger alignment, you may find better that the buffer is more
 consistent to where it is placed::

   reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace

 On boot up, the memory reserved for the ring buffer is validated. It will go
 through a series of tests to make sure that the ring buffer contains valid
 data. If it is, it will then set it up to be available to read from the
 instance. If it fails any of the tests, it will clear the entire ring buffer
 and initialize it as new.

 The layout of this mapped memory may not be consistent from kernel to
 kernel, so only the same kernel is guaranteed to work if the mapping is
 preserved. Switching to a different kernel version may find a different
 layout and mark the buffer as invalid.

 NB: Both the mapped address and size must be page aligned for the architecture.

 Using trace_printk() in the boot instance
 -----------------------------------------
 By default, the content of trace_printk() goes into the top level tracing
 instance. But this instance is never preserved across boots. To have the
 trace_printk() content, and some other internal tracing go to the preserved
 buffer (like dump stacks), either set the instance to be the trace_printk()
 destination from the kernel command line, or set it after boot up via the
 trace_printk_dest option.

 After boot up::

   echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest

 From the kernel command line::

   reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace

 If setting it from the kernel command line, it is recommended to also
 disable tracing with the "traceoff" flag, and enable tracing after boot up.
 Otherwise the trace from the most recent boot will be mixed with the trace
 from the previous boot, and may make it confusing to read.
	==============================
	Using the tracer for debugging
	==============================

	Copyright 2024 Google LLC.

	:Author: Steven Rostedt <rostedt@goodmis.org>
	:License: The GNU Free Documentation License, Version 1.2
	(dual licensed under the GPL v2)

	- Written for: 6.12

	Introduction
	------------
	The tracing infrastructure can be very useful for debugging the Linux
	kernel. This document is a place to add various methods of using the tracer
	for debugging.

	First, make sure that the tracefs file system is mounted::

	$ sudo mount -t tracefs tracefs /sys/kernel/tracing


	Using trace_printk()
	--------------------

	trace_printk() is a very lightweight utility that can be used in any context
	inside the kernel, with the exception of "noinstr" sections. It can be used
	in normal, softirq, interrupt and even NMI context. The trace data is
	written to the tracing ring buffer in a lockless way. To make it even
	lighter weight, when possible, it will only record the pointer to the format
	string, and save the raw arguments into the buffer. The format and the
	arguments will be post processed when the ring buffer is read. This way the
	trace_printk() format conversions are not done during the hot path, where
	the trace is being recorded.

	trace_printk() is meant only for debugging, and should never be added into
	a subsystem of the kernel. If you need debugging traces, add trace events
	instead. If a trace_printk() is found in the kernel, the following will
	appear in the dmesg::

	**********************************************************
	NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE

	trace_printk() being used. Allocating extra memory.

	This means that this is a DEBUG kernel and it is
	unsafe for production use.

	If you see this message and you are not debugging
	the kernel, report this immediately to your vendor!

	NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE
	**********************************************************

	Debugging kernel crashes
	------------------------
	There is various methods of acquiring the state of the system when a kernel
	crash occurs. This could be from the oops message in printk, or one could
	use kexec/kdump. But these just show what happened at the time of the crash.
	It can be very useful in knowing what happened up to the point of the crash.
	The tracing ring buffer, by default, is a circular buffer than will
	overwrite older events with newer ones. When a crash happens, the content of
	the ring buffer will be all the events that lead up to the crash.

	There are several kernel command line parameters that can be used to help in
	this. The first is "ftrace_dump_on_oops". This will dump the tracing ring
	buffer when a oops occurs to the console. This can be useful if the console
	is being logged somewhere. If a serial console is used, it may be prudent to
	make sure the ring buffer is relatively small, otherwise the dumping of the
	ring buffer may take several minutes to hours to finish. Here's an example
	of the kernel command line::

	ftrace_dump_on_oops trace_buf_size=50K

	Note, the tracing buffer is made up of per CPU buffers where each of these
	buffers is broken up into sub-buffers that are by default PAGE_SIZE. The
	above trace_buf_size option above sets each of the per CPU buffers to 50K,
	so, on a machine with 8 CPUs, that's actually 400K total.

	Persistent buffers across boots
	-------------------------------
	If the system memory allows it, the tracing ring buffer can be specified at
	a specific location in memory. If the location is the same across boots and
	the memory is not modified, the tracing buffer can be retrieved from the
	following boot. There's two ways to reserve memory for the use of the ring
	buffer.

	The more reliable way (on x86) is to reserve memory with the "memmap" kernel
	command line option and then use that memory for the trace_instance. This
	requires a bit of knowledge of the physical memory layout of the system. The
	advantage of using this method, is that the memory for the ring buffer will
	always be the same::

	memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M

	The memmap above reserves 12 megabytes of memory at the physical memory
	location 0x284500000. Then the trace_instance option will create a trace
	instance "boot_map" at that same location with the same amount of memory
	reserved. As the ring buffer is broke up into per CPU buffers, the 12
	megabytes will be broken up evenly between those CPUs. If you have 8 CPUs,
	each per CPU ring buffer will be 1.5 megabytes in size. Note, that also
	includes meta data, so the amount of memory actually used by the ring buffer
	will be slightly smaller.

	Another more generic but less robust way to allocate a ring buffer mapping
	at boot is with the "reserve_mem" option::

	reserve_mem=12M:4096:trace trace_instance=boot_map@trace

	The reserve_mem option above will find 12 megabytes that are available at
	boot up, and align it by 4096 bytes. It will label this memory as "trace"
	that can be used by later command line options.

	The trace_instance option creates a "boot_map" instance and will use the
	memory reserved by reserve_mem that was labeled as "trace". This method is
	more generic but may not be as reliable. Due to KASLR, the memory reserved
	by reserve_mem may not be located at the same location. If this happens,
	then the ring buffer will not be from the previous boot and will be reset.

	Sometimes, by using a larger alignment, it can keep KASLR from moving things
	around in such a way that it will move the location of the reserve_mem. By
	using a larger alignment, you may find better that the buffer is more
	consistent to where it is placed::

	reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace

	On boot up, the memory reserved for the ring buffer is validated. It will go
	through a series of tests to make sure that the ring buffer contains valid
	data. If it is, it will then set it up to be available to read from the
	instance. If it fails any of the tests, it will clear the entire ring buffer
	and initialize it as new.

	The layout of this mapped memory may not be consistent from kernel to
	kernel, so only the same kernel is guaranteed to work if the mapping is
	preserved. Switching to a different kernel version may find a different
	layout and mark the buffer as invalid.

	NB: Both the mapped address and size must be page aligned for the architecture.

	Using trace_printk() in the boot instance
	-----------------------------------------
	By default, the content of trace_printk() goes into the top level tracing
	instance. But this instance is never preserved across boots. To have the
	trace_printk() content, and some other internal tracing go to the preserved
	buffer (like dump stacks), either set the instance to be the trace_printk()
	destination from the kernel command line, or set it after boot up via the
	trace_printk_dest option.

	After boot up::

	echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest

	From the kernel command line::

	reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace

	If setting it from the kernel command line, it is recommended to also
	disable tracing with the "traceoff" flag, and enable tracing after boot up.
	Otherwise the trace from the most recent boot will be mixed with the trace
	from the previous boot, and may make it confusing to read.