|  | ############### | 
|  | Timerlat tracer | 
|  | ############### | 
|  |  | 
|  | The timerlat tracer aims to help the preemptive kernel developers to | 
|  | find sources of wakeup latencies of real-time threads. Like cyclictest, | 
|  | the tracer sets a periodic timer that wakes up a thread. The thread then | 
|  | computes a *wakeup latency* value as the difference between the *current | 
|  | time* and the *absolute time* that the timer was set to expire. The main | 
|  | goal of timerlat is tracing in such a way to help kernel developers. | 
|  |  | 
|  | Usage | 
|  | ----- | 
|  |  | 
|  | Write the ASCII text "timerlat" into the current_tracer file of the | 
|  | tracing system (generally mounted at /sys/kernel/tracing). | 
|  |  | 
|  | For example:: | 
|  |  | 
|  | [root@f32 ~]# cd /sys/kernel/tracing/ | 
|  | [root@f32 tracing]# echo timerlat > current_tracer | 
|  |  | 
|  | It is possible to follow the trace by reading the trace file:: | 
|  |  | 
|  | [root@f32 tracing]# cat trace | 
|  | # tracer: timerlat | 
|  | # | 
|  | #                              _-----=> irqs-off | 
|  | #                             / _----=> need-resched | 
|  | #                            | / _---=> hardirq/softirq | 
|  | #                            || / _--=> preempt-depth | 
|  | #                            || / | 
|  | #                            ||||             ACTIVATION | 
|  | #         TASK-PID      CPU# ||||   TIMESTAMP    ID            CONTEXT                LATENCY | 
|  | #            | |         |   ||||      |         |                  |                       | | 
|  | <idle>-0       [000] d.h1    54.029328: #1     context    irq timer_latency       932 ns | 
|  | <...>-867     [000] ....    54.029339: #1     context thread timer_latency     11700 ns | 
|  | <idle>-0       [001] dNh1    54.029346: #1     context    irq timer_latency      2833 ns | 
|  | <...>-868     [001] ....    54.029353: #1     context thread timer_latency      9820 ns | 
|  | <idle>-0       [000] d.h1    54.030328: #2     context    irq timer_latency       769 ns | 
|  | <...>-867     [000] ....    54.030330: #2     context thread timer_latency      3070 ns | 
|  | <idle>-0       [001] d.h1    54.030344: #2     context    irq timer_latency       935 ns | 
|  | <...>-868     [001] ....    54.030347: #2     context thread timer_latency      4351 ns | 
|  |  | 
|  |  | 
|  | The tracer creates a per-cpu kernel thread with real-time priority that | 
|  | prints two lines at every activation. The first is the *timer latency* | 
|  | observed at the *hardirq* context before the activation of the thread. | 
|  | The second is the *timer latency* observed by the thread. The ACTIVATION | 
|  | ID field serves to relate the *irq* execution to its respective *thread* | 
|  | execution. | 
|  |  | 
|  | The *irq*/*thread* splitting is important to clarify in which context | 
|  | the unexpected high value is coming from. The *irq* context can be | 
|  | delayed by hardware-related actions, such as SMIs, NMIs, IRQs, | 
|  | or by thread masking interrupts. Once the timer happens, the delay | 
|  | can also be influenced by blocking caused by threads. For example, by | 
|  | postponing the scheduler execution via preempt_disable(), scheduler | 
|  | execution, or masking interrupts. Threads can also be delayed by the | 
|  | interference from other threads and IRQs. | 
|  |  | 
|  | Tracer options | 
|  | --------------------- | 
|  |  | 
|  | The timerlat tracer is built on top of osnoise tracer. | 
|  | So its configuration is also done in the osnoise/ config | 
|  | directory. The timerlat configs are: | 
|  |  | 
|  | - cpus: CPUs at which a timerlat thread will execute. | 
|  | - timerlat_period_us: the period of the timerlat thread. | 
|  | - stop_tracing_us: stop the system tracing if a | 
|  | timer latency at the *irq* context higher than the configured | 
|  | value happens. Writing 0 disables this option. | 
|  | - stop_tracing_total_us: stop the system tracing if a | 
|  | timer latency at the *thread* context is higher than the configured | 
|  | value happens. Writing 0 disables this option. | 
|  | - print_stack: save the stack of the IRQ occurrence. The stack is printed | 
|  | after the *thread context* event, or at the IRQ handler if *stop_tracing_us* | 
|  | is hit. | 
|  |  | 
|  | timerlat and osnoise | 
|  | ---------------------------- | 
|  |  | 
|  | The timerlat can also take advantage of the osnoise: traceevents. | 
|  | For example:: | 
|  |  | 
|  | [root@f32 ~]# cd /sys/kernel/tracing/ | 
|  | [root@f32 tracing]# echo timerlat > current_tracer | 
|  | [root@f32 tracing]# echo 1 > events/osnoise/enable | 
|  | [root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us | 
|  | [root@f32 tracing]# tail -10 trace | 
|  | cc1-87882   [005] d..h...   548.771078: #402268 context    irq timer_latency     13585 ns | 
|  | cc1-87882   [005] dNLh1..   548.771082: irq_noise: local_timer:236 start 548.771077442 duration 7597 ns | 
|  | cc1-87882   [005] dNLh2..   548.771099: irq_noise: qxl:21 start 548.771085017 duration 7139 ns | 
|  | cc1-87882   [005] d...3..   548.771102: thread_noise:      cc1:87882 start 548.771078243 duration 9909 ns | 
|  | timerlat/5-1035    [005] .......   548.771104: #402268 context thread timer_latency     39960 ns | 
|  |  | 
|  | In this case, the root cause of the timer latency does not point to a | 
|  | single cause but to multiple ones. Firstly, the timer IRQ was delayed | 
|  | for 13 us, which may point to a long IRQ disabled section (see IRQ | 
|  | stacktrace section). Then the timer interrupt that wakes up the timerlat | 
|  | thread took 7597 ns, and the qxl:21 device IRQ took 7139 ns. Finally, | 
|  | the cc1 thread noise took 9909 ns of time before the context switch. | 
|  | Such pieces of evidence are useful for the developer to use other | 
|  | tracing methods to figure out how to debug and optimize the system. | 
|  |  | 
|  | It is worth mentioning that the *duration* values reported | 
|  | by the osnoise: events are *net* values. For example, the | 
|  | thread_noise does not include the duration of the overhead caused | 
|  | by the IRQ execution (which indeed accounted for 12736 ns). But | 
|  | the values reported by the timerlat tracer (timerlat_latency) | 
|  | are *gross* values. | 
|  |  | 
|  | The art below illustrates a CPU timeline and how the timerlat tracer | 
|  | observes it at the top and the osnoise: events at the bottom. Each "-" | 
|  | in the timelines means circa 1 us, and the time moves ==>:: | 
|  |  | 
|  | External     timer irq                   thread | 
|  | clock        latency                    latency | 
|  | event        13585 ns                   39960 ns | 
|  | |             ^                         ^ | 
|  | v             |                         | | 
|  | |-------------|                         | | 
|  | |-------------+-------------------------| | 
|  | ^                         ^ | 
|  | ======================================================================== | 
|  | [tmr irq]  [dev irq] | 
|  | [another thread...^       v..^       v.......][timerlat/ thread]  <-- CPU timeline | 
|  | ========================================================================= | 
|  | |-------|  |-------| | 
|  | |--^       v-------| | 
|  | |          |       | | 
|  | |          |       + thread_noise: 9909 ns | 
|  | |          +-> irq_noise: 6139 ns | 
|  | +-> irq_noise: 7597 ns | 
|  |  | 
|  | IRQ stacktrace | 
|  | --------------------------- | 
|  |  | 
|  | The osnoise/print_stack option is helpful for the cases in which a thread | 
|  | noise causes the major factor for the timer latency, because of preempt or | 
|  | irq disabled. For example:: | 
|  |  | 
|  | [root@f32 tracing]# echo 500 > osnoise/stop_tracing_total_us | 
|  | [root@f32 tracing]# echo 500 > osnoise/print_stack | 
|  | [root@f32 tracing]# echo timerlat > current_tracer | 
|  | [root@f32 tracing]# tail -21 per_cpu/cpu7/trace | 
|  | insmod-1026    [007] dN.h1..   200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns | 
|  | insmod-1026    [007] d..h1..   200.202587: #29800 context    irq timer_latency      1616 ns | 
|  | insmod-1026    [007] dN.h2..   200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns | 
|  | insmod-1026    [007] dN.h3..   200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns | 
|  | insmod-1026    [007] d...3..   200.203444: thread_noise:   insmod:1026 start 200.202586933 duration 838681 ns | 
|  | timerlat/7-1001    [007] .......   200.203445: #29800 context thread timer_latency    859978 ns | 
|  | timerlat/7-1001    [007] ....1..   200.203446: <stack trace> | 
|  | => timerlat_irq | 
|  | => __hrtimer_run_queues | 
|  | => hrtimer_interrupt | 
|  | => __sysvec_apic_timer_interrupt | 
|  | => asm_call_irq_on_stack | 
|  | => sysvec_apic_timer_interrupt | 
|  | => asm_sysvec_apic_timer_interrupt | 
|  | => delay_tsc | 
|  | => dummy_load_1ms_pd_init | 
|  | => do_one_initcall | 
|  | => do_init_module | 
|  | => __do_sys_finit_module | 
|  | => do_syscall_64 | 
|  | => entry_SYSCALL_64_after_hwframe | 
|  |  | 
|  | In this case, it is possible to see that the thread added the highest | 
|  | contribution to the *timer latency* and the stack trace, saved during | 
|  | the timerlat IRQ handler, points to a function named | 
|  | dummy_load_1ms_pd_init, which had the following code (on purpose):: | 
|  |  | 
|  | static int __init dummy_load_1ms_pd_init(void) | 
|  | { | 
|  | preempt_disable(); | 
|  | mdelay(1); | 
|  | preempt_enable(); | 
|  | return 0; | 
|  |  | 
|  | } | 
|  |  | 
|  | User-space interface | 
|  | --------------------------- | 
|  |  | 
|  | Timerlat allows user-space threads to use timerlat infra-structure to | 
|  | measure scheduling latency. This interface is accessible via a per-CPU | 
|  | file descriptor inside $tracing_dir/osnoise/per_cpu/cpu$ID/timerlat_fd. | 
|  |  | 
|  | This interface is accessible under the following conditions: | 
|  |  | 
|  | - timerlat tracer is enable | 
|  | - osnoise workload option is set to NO_OSNOISE_WORKLOAD | 
|  | - The user-space thread is affined to a single processor | 
|  | - The thread opens the file associated with its single processor | 
|  | - Only one thread can access the file at a time | 
|  |  | 
|  | The open() syscall will fail if any of these conditions are not met. | 
|  | After opening the file descriptor, the user space can read from it. | 
|  |  | 
|  | The read() system call will run a timerlat code that will arm the | 
|  | timer in the future and wait for it as the regular kernel thread does. | 
|  |  | 
|  | When the timer IRQ fires, the timerlat IRQ will execute, report the | 
|  | IRQ latency and wake up the thread waiting in the read. The thread will be | 
|  | scheduled and report the thread latency via tracer - as for the kernel | 
|  | thread. | 
|  |  | 
|  | The difference from the in-kernel timerlat is that, instead of re-arming | 
|  | the timer, timerlat will return to the read() system call. At this point, | 
|  | the user can run any code. | 
|  |  | 
|  | If the application rereads the file timerlat file descriptor, the tracer | 
|  | will report the return from user-space latency, which is the total | 
|  | latency. If this is the end of the work, it can be interpreted as the | 
|  | response time for the request. | 
|  |  | 
|  | After reporting the total latency, timerlat will restart the cycle, arm | 
|  | a timer, and go to sleep for the following activation. | 
|  |  | 
|  | If at any time one of the conditions is broken, e.g., the thread migrates | 
|  | while in user space, or the timerlat tracer is disabled, the SIG_KILL | 
|  | signal will be sent to the user-space thread. | 
|  |  | 
|  | Here is an basic example of user-space code for timerlat:: | 
|  |  | 
|  | int main(void) | 
|  | { | 
|  | char buffer[1024]; | 
|  | int timerlat_fd; | 
|  | int retval; | 
|  | long cpu = 0;   /* place in CPU 0 */ | 
|  | cpu_set_t set; | 
|  |  | 
|  | CPU_ZERO(&set); | 
|  | CPU_SET(cpu, &set); | 
|  |  | 
|  | if (sched_setaffinity(gettid(), sizeof(set), &set) == -1) | 
|  | return 1; | 
|  |  | 
|  | snprintf(buffer, sizeof(buffer), | 
|  | "/sys/kernel/tracing/osnoise/per_cpu/cpu%ld/timerlat_fd", | 
|  | cpu); | 
|  |  | 
|  | timerlat_fd = open(buffer, O_RDONLY); | 
|  | if (timerlat_fd < 0) { | 
|  | printf("error opening %s: %s\n", buffer, strerror(errno)); | 
|  | exit(1); | 
|  | } | 
|  |  | 
|  | for (;;) { | 
|  | retval = read(timerlat_fd, buffer, 1024); | 
|  | if (retval < 0) | 
|  | break; | 
|  | } | 
|  |  | 
|  | close(timerlat_fd); | 
|  | exit(0); | 
|  | } |