trivial: unnecessary (void*) cast removal in sound/oss/msnd.c
[GitHub/mt8127/android_kernel_alcatel_ttab.git] / Documentation / trace / ftrace.txt
CommitLineData
eb6d42ea
SR
1 ftrace - Function Tracer
2 ========================
3
4Copyright 2008 Red Hat Inc.
a41eebab
SR
5 Author: Steven Rostedt <srostedt@redhat.com>
6 License: The GNU Free Documentation License, Version 1.2
a97762a7 7 (dual licensed under the GPL v2)
f2d9c740
SR
8Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
9 John Kacur, and David Teigland.
eb6d42ea 10
42ec632e 11Written for: 2.6.28-rc2
eb6d42ea
SR
12
13Introduction
14------------
15
16Ftrace is an internal tracer designed to help out developers and
17designers of systems to find what is going on inside the kernel.
5752674e
IM
18It can be used for debugging or analyzing latencies and
19performance issues that take place outside of user-space.
eb6d42ea
SR
20
21Although ftrace is the function tracer, it also includes an
5752674e
IM
22infrastructure that allows for other types of tracing. Some of
23the tracers that are currently in ftrace include a tracer to
24trace context switches, the time it takes for a high priority
25task to run after it was woken up, the time interrupts are
26disabled, and more (ftrace allows for tracer plugins, which
27means that the list of tracers can always grow).
eb6d42ea
SR
28
29
30The File System
31---------------
32
5752674e
IM
33Ftrace uses the debugfs file system to hold the control files as
34well as the files to display output.
eb6d42ea
SR
35
36To mount the debugfs system:
37
38 # mkdir /debug
39 # mount -t debugfs nodev /debug
40
5752674e
IM
41( Note: it is more common to mount at /sys/kernel/debug, but for
42 simplicity this document will use /debug)
eb6d42ea
SR
43
44That's it! (assuming that you have ftrace configured into your kernel)
45
46After mounting the debugfs, you can see a directory called
47"tracing". This directory contains the control and output files
48of ftrace. Here is a list of some of the key files:
49
50
51 Note: all time values are in microseconds.
52
5752674e
IM
53 current_tracer:
54
55 This is used to set or display the current tracer
56 that is configured.
57
58 available_tracers:
59
60 This holds the different types of tracers that
61 have been compiled into the kernel. The
62 tracers listed here can be configured by
63 echoing their name into current_tracer.
64
65 tracing_enabled:
66
67 This sets or displays whether the current_tracer
68 is activated and tracing or not. Echo 0 into this
69 file to disable the tracer or 1 to enable it.
70
71 trace:
72
73 This file holds the output of the trace in a human
74 readable format (described below).
75
76 latency_trace:
77
78 This file shows the same trace but the information
79 is organized more to display possible latencies
80 in the system (described below).
81
82 trace_pipe:
83
84 The output is the same as the "trace" file but this
85 file is meant to be streamed with live tracing.
86 Reads from this file will block until new data
87 is retrieved. Unlike the "trace" and "latency_trace"
88 files, this file is a consumer. This means reading
89 from this file causes sequential reads to display
90 more current data. Once data is read from this
91 file, it is consumed, and will not be read
92 again with a sequential read. The "trace" and
93 "latency_trace" files are static, and if the
94 tracer is not adding more data, they will display
95 the same information every time they are read.
96
97 trace_options:
98
99 This file lets the user control the amount of data
100 that is displayed in one of the above output
101 files.
102
42b40b3d 103 tracing_max_latency:
5752674e
IM
104
105 Some of the tracers record the max latency.
106 For example, the time interrupts are disabled.
107 This time is saved in this file. The max trace
108 will also be stored, and displayed by either
109 "trace" or "latency_trace". A new max trace will
110 only be recorded if the latency is greater than
111 the value in this file. (in microseconds)
112
113 buffer_size_kb:
114
115 This sets or displays the number of kilobytes each CPU
116 buffer can hold. The tracer buffers are the same size
117 for each CPU. The displayed number is the size of the
118 CPU buffer and not total size of all buffers. The
119 trace buffers are allocated in pages (blocks of memory
120 that the kernel uses for allocation, usually 4 KB in size).
121 If the last page allocated has room for more bytes
122 than requested, the rest of the page will be used,
123 making the actual allocation bigger than requested.
124 ( Note, the size may not be a multiple of the page size
125 due to buffer managment overhead. )
126
127 This can only be updated when the current_tracer
128 is set to "nop".
129
130 tracing_cpumask:
131
132 This is a mask that lets the user only trace
133 on specified CPUS. The format is a hex string
134 representing the CPUS.
135
136 set_ftrace_filter:
137
138 When dynamic ftrace is configured in (see the
139 section below "dynamic ftrace"), the code is dynamically
140 modified (code text rewrite) to disable calling of the
141 function profiler (mcount). This lets tracing be configured
142 in with practically no overhead in performance. This also
143 has a side effect of enabling or disabling specific functions
144 to be traced. Echoing names of functions into this file
145 will limit the trace to only those functions.
146
147 set_ftrace_notrace:
148
149 This has an effect opposite to that of
150 set_ftrace_filter. Any function that is added here will not
151 be traced. If a function exists in both set_ftrace_filter
152 and set_ftrace_notrace, the function will _not_ be traced.
153
154 set_ftrace_pid:
155
156 Have the function tracer only trace a single thread.
157
158 set_graph_function:
159
160 Set a "trigger" function where tracing should start
161 with the function graph tracer (See the section
162 "dynamic ftrace" for more details).
163
164 available_filter_functions:
165
166 This lists the functions that ftrace
167 has processed and can trace. These are the function
168 names that you can pass to "set_ftrace_filter" or
169 "set_ftrace_notrace". (See the section "dynamic ftrace"
170 below for more details.)
eb6d42ea
SR
171
172
173The Tracers
174-----------
175
f2d9c740 176Here is the list of current tracers that may be configured.
eb6d42ea 177
5752674e
IM
178 "function"
179
180 Function call tracer to trace all kernel functions.
181
bc5c6c04 182 "function_graph"
5752674e
IM
183
184 Similar to the function tracer except that the
185 function tracer probes the functions on their entry
186 whereas the function graph tracer traces on both entry
187 and exit of the functions. It then provides the ability
188 to draw a graph of function calls similar to C code
189 source.
190
191 "sched_switch"
192
193 Traces the context switches and wakeups between tasks.
194
195 "irqsoff"
196
197 Traces the areas that disable interrupts and saves
198 the trace with the longest max latency.
199 See tracing_max_latency. When a new max is recorded,
200 it replaces the old trace. It is best to view this
201 trace via the latency_trace file.
eb6d42ea 202
5752674e 203 "preemptoff"
985ec20a 204
5752674e
IM
205 Similar to irqsoff but traces and records the amount of
206 time for which preemption is disabled.
eb6d42ea 207
5752674e 208 "preemptirqsoff"
eb6d42ea 209
5752674e
IM
210 Similar to irqsoff and preemptoff, but traces and
211 records the largest time for which irqs and/or preemption
212 is disabled.
eb6d42ea 213
5752674e 214 "wakeup"
eb6d42ea 215
5752674e
IM
216 Traces and records the max latency that it takes for
217 the highest priority task to get scheduled after
218 it has been woken up.
eb6d42ea 219
5752674e 220 "hw-branch-tracer"
eb6d42ea 221
5752674e
IM
222 Uses the BTS CPU feature on x86 CPUs to traces all
223 branches executed.
224
225 "nop"
226
227 This is the "trace nothing" tracer. To remove all
228 tracers from tracing simply echo "nop" into
229 current_tracer.
e2ea5399 230
eb6d42ea
SR
231
232Examples of using the tracer
233----------------------------
234
5752674e
IM
235Here are typical examples of using the tracers when controlling
236them only with the debugfs interface (without using any
237user-land utilities).
eb6d42ea
SR
238
239Output format:
240--------------
241
f2d9c740 242Here is an example of the output format of the file "trace"
eb6d42ea
SR
243
244 --------
9b803c0f 245# tracer: function
eb6d42ea
SR
246#
247# TASK-PID CPU# TIMESTAMP FUNCTION
248# | | | | |
249 bash-4251 [01] 10152.583854: path_put <-path_walk
250 bash-4251 [01] 10152.583855: dput <-path_put
251 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
252 --------
253
5752674e
IM
254A header is printed with the tracer name that is represented by
255the trace. In this case the tracer is "function". Then a header
256showing the format. Task name "bash", the task PID "4251", the
257CPU that it was running on "01", the timestamp in <secs>.<usecs>
258format, the function name that was traced "path_put" and the
259parent function that called this function "path_walk". The
260timestamp is the time at which the function was entered.
eb6d42ea 261
5752674e
IM
262The sched_switch tracer also includes tracing of task wakeups
263and context switches.
eb6d42ea
SR
264
265 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S
266 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S
267 ksoftirqd/1-7 [01] 1453.070013: 7:115:R ==> 10:115:R
268 events/1-10 [01] 1453.070013: 10:115:S ==> 2916:115:R
269 kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R
270 ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R
271
5752674e
IM
272Wake ups are represented by a "+" and the context switches are
273shown as "==>". The format is:
eb6d42ea
SR
274
275 Context switches:
276
277 Previous task Next Task
278
279 <pid>:<prio>:<state> ==> <pid>:<prio>:<state>
280
281 Wake ups:
282
283 Current task Task waking up
284
285 <pid>:<prio>:<state> + <pid>:<prio>:<state>
286
5752674e
IM
287The prio is the internal kernel priority, which is the inverse
288of the priority that is usually displayed by user-space tools.
289Zero represents the highest priority (99). Prio 100 starts the
290"nice" priorities with 100 being equal to nice -20 and 139 being
291nice 19. The prio "140" is reserved for the idle task which is
292the lowest priority thread (pid 0).
eb6d42ea
SR
293
294
295Latency trace format
296--------------------
297
5752674e
IM
298For traces that display latency times, the latency_trace file
299gives somewhat more information to see why a latency happened.
300Here is a typical trace.
eb6d42ea
SR
301
302# tracer: irqsoff
303#
304irqsoff latency trace v1.1.5 on 2.6.26-rc8
305--------------------------------------------------------------------
306 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
307 -----------------
308 | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
309 -----------------
310 => started at: apic_timer_interrupt
311 => ended at: do_softirq
312
313# _------=> CPU#
314# / _-----=> irqs-off
315# | / _----=> need-resched
316# || / _---=> hardirq/softirq
317# ||| / _--=> preempt-depth
318# |||| /
319# ||||| delay
320# cmd pid ||||| time | caller
321# \ / ||||| \ | /
322 <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
323 <idle>-0 0d.s. 97us : __do_softirq (do_softirq)
324 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
325
326
5752674e
IM
327This shows that the current tracer is "irqsoff" tracing the time
328for which interrupts were disabled. It gives the trace version
329and the version of the kernel upon which this was executed on
330(2.6.26-rc8). Then it displays the max latency in microsecs (97
331us). The number of trace entries displayed and the total number
332recorded (both are three: #3/3). The type of preemption that was
333used (PREEMPT). VP, KP, SP, and HP are always zero and are
334reserved for later use. #P is the number of online CPUS (#P:2).
eb6d42ea 335
5752674e
IM
336The task is the process that was running when the latency
337occurred. (swapper pid: 0).
eb6d42ea 338
5752674e
IM
339The start and stop (the functions in which the interrupts were
340disabled and enabled respectively) that caused the latencies:
eb6d42ea
SR
341
342 apic_timer_interrupt is where the interrupts were disabled.
343 do_softirq is where they were enabled again.
344
345The next lines after the header are the trace itself. The header
346explains which is which.
347
348 cmd: The name of the process in the trace.
349
350 pid: The PID of that process.
351
f2d9c740 352 CPU#: The CPU which the process was running on.
eb6d42ea
SR
353
354 irqs-off: 'd' interrupts are disabled. '.' otherwise.
9244489a
SR
355 Note: If the architecture does not support a way to
356 read the irq flags variable, an 'X' will always
357 be printed here.
eb6d42ea
SR
358
359 need-resched: 'N' task need_resched is set, '.' otherwise.
360
361 hardirq/softirq:
f2d9c740 362 'H' - hard irq occurred inside a softirq.
eb6d42ea
SR
363 'h' - hard irq is running
364 's' - soft irq is running
365 '.' - normal context.
366
367 preempt-depth: The level of preempt_disabled
368
369The above is mostly meaningful for kernel developers.
370
a41eebab 371 time: This differs from the trace file output. The trace file output
f2d9c740 372 includes an absolute timestamp. The timestamp used by the
a41eebab 373 latency_trace file is relative to the start of the trace.
eb6d42ea
SR
374
375 delay: This is just to help catch your eye a bit better. And
5752674e
IM
376 needs to be fixed to be only relative to the same CPU.
377 The marks are determined by the difference between this
378 current trace and the next trace.
379 '!' - greater than preempt_mark_thresh (default 100)
380 '+' - greater than 1 microsecond
381 ' ' - less than or equal to 1 microsecond.
eb6d42ea
SR
382
383 The rest is the same as the 'trace' file.
384
385
ee6bce52
SR
386trace_options
387-------------
eb6d42ea 388
5752674e
IM
389The trace_options file is used to control what gets printed in
390the trace output. To see what is available, simply cat the file:
eb6d42ea 391
ee6bce52 392 cat /debug/tracing/trace_options
eb6d42ea 393 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
5752674e 394 noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj
eb6d42ea 395
5752674e
IM
396To disable one of the options, echo in the option prepended with
397"no".
eb6d42ea 398
ee6bce52 399 echo noprint-parent > /debug/tracing/trace_options
eb6d42ea
SR
400
401To enable an option, leave off the "no".
402
ee6bce52 403 echo sym-offset > /debug/tracing/trace_options
eb6d42ea
SR
404
405Here are the available options:
406
5752674e
IM
407 print-parent - On function traces, display the calling (parent)
408 function as well as the function being traced.
eb6d42ea
SR
409
410 print-parent:
411 bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul
412
413 noprint-parent:
414 bash-4000 [01] 1477.606694: simple_strtoul
415
416
5752674e
IM
417 sym-offset - Display not only the function name, but also the
418 offset in the function. For example, instead of
419 seeing just "ktime_get", you will see
420 "ktime_get+0xb/0x20".
eb6d42ea
SR
421
422 sym-offset:
423 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
424
5752674e
IM
425 sym-addr - this will also display the function address as well
426 as the function name.
eb6d42ea
SR
427
428 sym-addr:
429 bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
430
431 verbose - This deals with the latency_trace file.
432
433 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
434 (+0.000ms): simple_strtoul (strict_strtoul)
435
5752674e
IM
436 raw - This will display raw numbers. This option is best for
437 use with user applications that can translate the raw
438 numbers better than having it done in the kernel.
eb6d42ea 439
5752674e
IM
440 hex - Similar to raw, but the numbers will be in a hexadecimal
441 format.
eb6d42ea
SR
442
443 bin - This will print out the formats in raw binary.
444
445 block - TBD (needs update)
446
5752674e
IM
447 stacktrace - This is one of the options that changes the trace
448 itself. When a trace is recorded, so is the stack
449 of functions. This allows for back traces of
450 trace sites.
eb6d42ea 451
5752674e
IM
452 userstacktrace - This option changes the trace. It records a
453 stacktrace of the current userspace thread.
02b67518 454
5752674e
IM
455 sym-userobj - when user stacktrace are enabled, look up which
456 object the address belongs to, and print a
457 relative address. This is especially useful when
458 ASLR is on, otherwise you don't get a chance to
459 resolve the address to object/file/line after
460 the app is no longer running
b54d3de9 461
5752674e
IM
462 The lookup is performed when you read
463 trace,trace_pipe,latency_trace. Example:
b54d3de9
TE
464
465 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
466x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
467
5752674e
IM
468 sched-tree - trace all tasks that are on the runqueue, at
469 every scheduling event. Will add overhead if
470 there's a lot of tasks running at once.
eb6d42ea
SR
471
472
473sched_switch
474------------
475
f2d9c740 476This tracer simply records schedule switches. Here is an example
a41eebab 477of how to use it.
eb6d42ea
SR
478
479 # echo sched_switch > /debug/tracing/current_tracer
480 # echo 1 > /debug/tracing/tracing_enabled
481 # sleep 1
482 # echo 0 > /debug/tracing/tracing_enabled
483 # cat /debug/tracing/trace
484
485# tracer: sched_switch
486#
487# TASK-PID CPU# TIMESTAMP FUNCTION
488# | | | | |
489 bash-3997 [01] 240.132281: 3997:120:R + 4055:120:R
490 bash-3997 [01] 240.132284: 3997:120:R ==> 4055:120:R
491 sleep-4055 [01] 240.132371: 4055:120:S ==> 3997:120:R
492 bash-3997 [01] 240.132454: 3997:120:R + 4055:120:S
493 bash-3997 [01] 240.132457: 3997:120:R ==> 4055:120:R
494 sleep-4055 [01] 240.132460: 4055:120:D ==> 3997:120:R
495 bash-3997 [01] 240.132463: 3997:120:R + 4055:120:D
496 bash-3997 [01] 240.132465: 3997:120:R ==> 4055:120:R
497 <idle>-0 [00] 240.132589: 0:140:R + 4:115:S
498 <idle>-0 [00] 240.132591: 0:140:R ==> 4:115:R
499 ksoftirqd/0-4 [00] 240.132595: 4:115:S ==> 0:140:R
500 <idle>-0 [00] 240.132598: 0:140:R + 4:115:S
501 <idle>-0 [00] 240.132599: 0:140:R ==> 4:115:R
502 ksoftirqd/0-4 [00] 240.132603: 4:115:S ==> 0:140:R
503 sleep-4055 [01] 240.133058: 4055:120:S ==> 3997:120:R
504 [...]
505
506
5752674e
IM
507As we have discussed previously about this format, the header
508shows the name of the trace and points to the options. The
509"FUNCTION" is a misnomer since here it represents the wake ups
510and context switches.
eb6d42ea 511
5752674e
IM
512The sched_switch file only lists the wake ups (represented with
513'+') and context switches ('==>') with the previous task or
514current task first followed by the next task or task waking up.
515The format for both of these is PID:KERNEL-PRIO:TASK-STATE.
516Remember that the KERNEL-PRIO is the inverse of the actual
517priority with zero (0) being the highest priority and the nice
518values starting at 100 (nice -20). Below is a quick chart to map
519the kernel priority to user land priorities.
eb6d42ea 520
294ae401
GL
521 Kernel Space User Space
522 ===============================================================
523 0(high) to 98(low) user RT priority 99(high) to 1(low)
524 with SCHED_RR or SCHED_FIFO
525 ---------------------------------------------------------------
526 99 sched_priority is not used in scheduling
527 decisions(it must be specified as 0)
528 ---------------------------------------------------------------
529 100(high) to 139(low) user nice -20(high) to 19(low)
530 ---------------------------------------------------------------
531 140 idle task priority
532 ---------------------------------------------------------------
eb6d42ea
SR
533
534The task states are:
535
536 R - running : wants to run, may not actually be running
537 S - sleep : process is waiting to be woken up (handles signals)
f2d9c740
SR
538 D - disk sleep (uninterruptible sleep) : process must be woken up
539 (ignores signals)
eb6d42ea
SR
540 T - stopped : process suspended
541 t - traced : process is being traced (with something like gdb)
542 Z - zombie : process waiting to be cleaned up
543 X - unknown
544
545
546ftrace_enabled
547--------------
548
5752674e
IM
549The following tracers (listed below) give different output
550depending on whether or not the sysctl ftrace_enabled is set. To
551set ftrace_enabled, one can either use the sysctl function or
552set it via the proc file system interface.
eb6d42ea
SR
553
554 sysctl kernel.ftrace_enabled=1
555
556 or
557
558 echo 1 > /proc/sys/kernel/ftrace_enabled
559
5752674e
IM
560To disable ftrace_enabled simply replace the '1' with '0' in the
561above commands.
eb6d42ea 562
5752674e
IM
563When ftrace_enabled is set the tracers will also record the
564functions that are within the trace. The descriptions of the
565tracers will also show an example with ftrace enabled.
eb6d42ea
SR
566
567
568irqsoff
569-------
570
571When interrupts are disabled, the CPU can not react to any other
572external event (besides NMIs and SMIs). This prevents the timer
5752674e
IM
573interrupt from triggering or the mouse interrupt from letting
574the kernel know of a new mouse event. The result is a latency
575with the reaction time.
eb6d42ea 576
5752674e
IM
577The irqsoff tracer tracks the time for which interrupts are
578disabled. When a new maximum latency is hit, the tracer saves
579the trace leading up to that latency point so that every time a
580new maximum is reached, the old saved trace is discarded and the
581new trace is saved.
eb6d42ea 582
5752674e
IM
583To reset the maximum, echo 0 into tracing_max_latency. Here is
584an example:
eb6d42ea
SR
585
586 # echo irqsoff > /debug/tracing/current_tracer
587 # echo 0 > /debug/tracing/tracing_max_latency
588 # echo 1 > /debug/tracing/tracing_enabled
589 # ls -ltr
590 [...]
591 # echo 0 > /debug/tracing/tracing_enabled
592 # cat /debug/tracing/latency_trace
593# tracer: irqsoff
594#
f2d9c740 595irqsoff latency trace v1.1.5 on 2.6.26
eb6d42ea 596--------------------------------------------------------------------
f2d9c740 597 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
eb6d42ea 598 -----------------
f2d9c740 599 | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
eb6d42ea 600 -----------------
f2d9c740
SR
601 => started at: sys_setpgid
602 => ended at: sys_setpgid
eb6d42ea
SR
603
604# _------=> CPU#
605# / _-----=> irqs-off
606# | / _----=> need-resched
607# || / _---=> hardirq/softirq
608# ||| / _--=> preempt-depth
609# |||| /
610# ||||| delay
611# cmd pid ||||| time | caller
612# \ / ||||| \ | /
f2d9c740
SR
613 bash-3730 1d... 0us : _write_lock_irq (sys_setpgid)
614 bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid)
615 bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid)
eb6d42ea 616
eb6d42ea 617
f2d9c740 618Here we see that that we had a latency of 12 microsecs (which is
5752674e
IM
619very good). The _write_lock_irq in sys_setpgid disabled
620interrupts. The difference between the 12 and the displayed
621timestamp 14us occurred because the clock was incremented
622between the time of recording the max latency and the time of
623recording the function that had that latency.
eb6d42ea 624
f2d9c740
SR
625Note the above example had ftrace_enabled not set. If we set the
626ftrace_enabled, we get a much larger output:
eb6d42ea
SR
627
628# tracer: irqsoff
629#
630irqsoff latency trace v1.1.5 on 2.6.26-rc8
631--------------------------------------------------------------------
632 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
633 -----------------
634 | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
635 -----------------
636 => started at: __alloc_pages_internal
637 => ended at: __alloc_pages_internal
638
639# _------=> CPU#
640# / _-----=> irqs-off
641# | / _----=> need-resched
642# || / _---=> hardirq/softirq
643# ||| / _--=> preempt-depth
644# |||| /
645# ||||| delay
646# cmd pid ||||| time | caller
647# \ / ||||| \ | /
648 ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal)
649 ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist)
650 ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk)
651 ls-4339 0d..1 4us : add_preempt_count (_spin_lock)
652 ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk)
653 ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue)
654 ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest)
655 ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk)
656 ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue)
657 ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest)
658 ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk)
659 ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue)
660[...]
661 ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue)
662 ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest)
663 ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk)
664 ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue)
665 ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest)
666 ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk)
667 ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock)
668 ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal)
669 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal)
670
671
eb6d42ea
SR
672
673Here we traced a 50 microsecond latency. But we also see all the
5752674e
IM
674functions that were called during that time. Note that by
675enabling function tracing, we incur an added overhead. This
676overhead may extend the latency times. But nevertheless, this
677trace has provided some very helpful debugging information.
eb6d42ea
SR
678
679
680preemptoff
681----------
682
5752674e
IM
683When preemption is disabled, we may be able to receive
684interrupts but the task cannot be preempted and a higher
685priority task must wait for preemption to be enabled again
686before it can preempt a lower priority task.
eb6d42ea 687
a41eebab 688The preemptoff tracer traces the places that disable preemption.
5752674e
IM
689Like the irqsoff tracer, it records the maximum latency for
690which preemption was disabled. The control of preemptoff tracer
691is much like the irqsoff tracer.
eb6d42ea
SR
692
693 # echo preemptoff > /debug/tracing/current_tracer
694 # echo 0 > /debug/tracing/tracing_max_latency
695 # echo 1 > /debug/tracing/tracing_enabled
696 # ls -ltr
697 [...]
698 # echo 0 > /debug/tracing/tracing_enabled
699 # cat /debug/tracing/latency_trace
700# tracer: preemptoff
701#
702preemptoff latency trace v1.1.5 on 2.6.26-rc8
703--------------------------------------------------------------------
704 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
705 -----------------
706 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
707 -----------------
708 => started at: do_IRQ
709 => ended at: __do_softirq
710
711# _------=> CPU#
712# / _-----=> irqs-off
713# | / _----=> need-resched
714# || / _---=> hardirq/softirq
715# ||| / _--=> preempt-depth
716# |||| /
717# ||||| delay
718# cmd pid ||||| time | caller
719# \ / ||||| \ | /
720 sshd-4261 0d.h. 0us+: irq_enter (do_IRQ)
721 sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq)
722 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq)
723
724
5752674e
IM
725This has some more changes. Preemption was disabled when an
726interrupt came in (notice the 'h'), and was enabled while doing
727a softirq. (notice the 's'). But we also see that interrupts
728have been disabled when entering the preempt off section and
729leaving it (the 'd'). We do not know if interrupts were enabled
730in the mean time.
eb6d42ea
SR
731
732# tracer: preemptoff
733#
734preemptoff latency trace v1.1.5 on 2.6.26-rc8
735--------------------------------------------------------------------
736 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
737 -----------------
738 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
739 -----------------
740 => started at: remove_wait_queue
741 => ended at: __do_softirq
742
743# _------=> CPU#
744# / _-----=> irqs-off
745# | / _----=> need-resched
746# || / _---=> hardirq/softirq
747# ||| / _--=> preempt-depth
748# |||| /
749# ||||| delay
750# cmd pid ||||| time | caller
751# \ / ||||| \ | /
752 sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue)
753 sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue)
754 sshd-4261 0d..1 2us : do_IRQ (common_interrupt)
755 sshd-4261 0d..1 2us : irq_enter (do_IRQ)
756 sshd-4261 0d..1 2us : idle_cpu (irq_enter)
757 sshd-4261 0d..1 3us : add_preempt_count (irq_enter)
758 sshd-4261 0d.h1 3us : idle_cpu (irq_enter)
759 sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ)
760[...]
761 sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock)
762 sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
763 sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq)
764 sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq)
765 sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock)
766 sshd-4261 0d.h1 14us : irq_exit (do_IRQ)
767 sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit)
768 sshd-4261 0d..2 15us : do_softirq (irq_exit)
769 sshd-4261 0d... 15us : __do_softirq (do_softirq)
770 sshd-4261 0d... 16us : __local_bh_disable (__do_softirq)
771 sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable)
772 sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable)
773 sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable)
774 sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable)
775[...]
776 sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable)
777 sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable)
778 sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable)
779 sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable)
780 sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip)
781 sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip)
782 sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable)
783 sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable)
784[...]
785 sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq)
786 sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq)
787
788
5752674e
IM
789The above is an example of the preemptoff trace with
790ftrace_enabled set. Here we see that interrupts were disabled
791the entire time. The irq_enter code lets us know that we entered
792an interrupt 'h'. Before that, the functions being traced still
793show that it is not in an interrupt, but we can see from the
794functions themselves that this is not the case.
eb6d42ea 795
5752674e
IM
796Notice that __do_softirq when called does not have a
797preempt_count. It may seem that we missed a preempt enabling.
798What really happened is that the preempt count is held on the
799thread's stack and we switched to the softirq stack (4K stacks
800in effect). The code does not copy the preempt count, but
801because interrupts are disabled, we do not need to worry about
802it. Having a tracer like this is good for letting people know
803what really happens inside the kernel.
eb6d42ea
SR
804
805
806preemptirqsoff
807--------------
808
5752674e
IM
809Knowing the locations that have interrupts disabled or
810preemption disabled for the longest times is helpful. But
811sometimes we would like to know when either preemption and/or
812interrupts are disabled.
eb6d42ea 813
f2d9c740 814Consider the following code:
eb6d42ea
SR
815
816 local_irq_disable();
817 call_function_with_irqs_off();
818 preempt_disable();
819 call_function_with_irqs_and_preemption_off();
820 local_irq_enable();
821 call_function_with_preemption_off();
822 preempt_enable();
823
824The irqsoff tracer will record the total length of
825call_function_with_irqs_off() and
826call_function_with_irqs_and_preemption_off().
827
828The preemptoff tracer will record the total length of
829call_function_with_irqs_and_preemption_off() and
830call_function_with_preemption_off().
831
5752674e
IM
832But neither will trace the time that interrupts and/or
833preemption is disabled. This total time is the time that we can
834not schedule. To record this time, use the preemptirqsoff
835tracer.
eb6d42ea 836
5752674e
IM
837Again, using this trace is much like the irqsoff and preemptoff
838tracers.
eb6d42ea 839
a41eebab 840 # echo preemptirqsoff > /debug/tracing/current_tracer
eb6d42ea
SR
841 # echo 0 > /debug/tracing/tracing_max_latency
842 # echo 1 > /debug/tracing/tracing_enabled
843 # ls -ltr
844 [...]
845 # echo 0 > /debug/tracing/tracing_enabled
846 # cat /debug/tracing/latency_trace
847# tracer: preemptirqsoff
848#
849preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
850--------------------------------------------------------------------
851 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
852 -----------------
853 | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
854 -----------------
855 => started at: apic_timer_interrupt
856 => ended at: __do_softirq
857
858# _------=> CPU#
859# / _-----=> irqs-off
860# | / _----=> need-resched
861# || / _---=> hardirq/softirq
862# ||| / _--=> preempt-depth
863# |||| /
864# ||||| delay
865# cmd pid ||||| time | caller
866# \ / ||||| \ | /
867 ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
868 ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq)
869 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq)
870
871
eb6d42ea
SR
872
873The trace_hardirqs_off_thunk is called from assembly on x86 when
5752674e
IM
874interrupts are disabled in the assembly code. Without the
875function tracing, we do not know if interrupts were enabled
876within the preemption points. We do see that it started with
877preemption enabled.
eb6d42ea
SR
878
879Here is a trace with ftrace_enabled set:
880
881
882# tracer: preemptirqsoff
883#
884preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
885--------------------------------------------------------------------
886 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
887 -----------------
888 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
889 -----------------
890 => started at: write_chan
891 => ended at: __do_softirq
892
893# _------=> CPU#
894# / _-----=> irqs-off
895# | / _----=> need-resched
896# || / _---=> hardirq/softirq
897# ||| / _--=> preempt-depth
898# |||| /
899# ||||| delay
900# cmd pid ||||| time | caller
901# \ / ||||| \ | /
902 ls-4473 0.N.. 0us : preempt_schedule (write_chan)
903 ls-4473 0dN.1 1us : _spin_lock (schedule)
904 ls-4473 0dN.1 2us : add_preempt_count (_spin_lock)
905 ls-4473 0d..2 2us : put_prev_task_fair (schedule)
906[...]
907 ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts)
908 ls-4473 0d..2 13us : __switch_to (schedule)
909 sshd-4261 0d..2 14us : finish_task_switch (schedule)
910 sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch)
911 sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave)
912 sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set)
913 sshd-4261 0d..2 16us : do_IRQ (common_interrupt)
914 sshd-4261 0d..2 17us : irq_enter (do_IRQ)
915 sshd-4261 0d..2 17us : idle_cpu (irq_enter)
916 sshd-4261 0d..2 18us : add_preempt_count (irq_enter)
917 sshd-4261 0d.h2 18us : idle_cpu (irq_enter)
918 sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ)
919 sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq)
920 sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock)
921 sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq)
922 sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock)
923[...]
924 sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq)
925 sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock)
926 sshd-4261 0d.h2 29us : irq_exit (do_IRQ)
927 sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit)
928 sshd-4261 0d..3 30us : do_softirq (irq_exit)
929 sshd-4261 0d... 30us : __do_softirq (do_softirq)
930 sshd-4261 0d... 31us : __local_bh_disable (__do_softirq)
931 sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable)
932 sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable)
933[...]
934 sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip)
935 sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip)
936 sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt)
937 sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt)
938 sshd-4261 0d.s3 45us : idle_cpu (irq_enter)
939 sshd-4261 0d.s3 46us : add_preempt_count (irq_enter)
940 sshd-4261 0d.H3 46us : idle_cpu (irq_enter)
941 sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt)
942 sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt)
943[...]
944 sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt)
945 sshd-4261 0d.H3 82us : ktime_get (tick_program_event)
946 sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get)
947 sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts)
948 sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts)
949 sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event)
950 sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event)
951 sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt)
952 sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit)
953 sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit)
954 sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable)
955[...]
956 sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action)
957 sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq)
958 sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq)
959 sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq)
960 sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable)
961 sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq)
962 sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq)
963
964
5752674e
IM
965This is a very interesting trace. It started with the preemption
966of the ls task. We see that the task had the "need_resched" bit
967set via the 'N' in the trace. Interrupts were disabled before
968the spin_lock at the beginning of the trace. We see that a
969schedule took place to run sshd. When the interrupts were
970enabled, we took an interrupt. On return from the interrupt
971handler, the softirq ran. We took another interrupt while
972running the softirq as we see from the capital 'H'.
eb6d42ea
SR
973
974
975wakeup
976------
977
5752674e
IM
978In a Real-Time environment it is very important to know the
979wakeup time it takes for the highest priority task that is woken
980up to the time that it executes. This is also known as "schedule
981latency". I stress the point that this is about RT tasks. It is
982also important to know the scheduling latency of non-RT tasks,
983but the average schedule latency is better for non-RT tasks.
984Tools like LatencyTop are more appropriate for such
985measurements.
eb6d42ea 986
a41eebab 987Real-Time environments are interested in the worst case latency.
5752674e
IM
988That is the longest latency it takes for something to happen,
989and not the average. We can have a very fast scheduler that may
990only have a large latency once in a while, but that would not
991work well with Real-Time tasks. The wakeup tracer was designed
992to record the worst case wakeups of RT tasks. Non-RT tasks are
993not recorded because the tracer only records one worst case and
994tracing non-RT tasks that are unpredictable will overwrite the
995worst case latency of RT tasks.
996
997Since this tracer only deals with RT tasks, we will run this
998slightly differently than we did with the previous tracers.
999Instead of performing an 'ls', we will run 'sleep 1' under
1000'chrt' which changes the priority of the task.
eb6d42ea
SR
1001
1002 # echo wakeup > /debug/tracing/current_tracer
1003 # echo 0 > /debug/tracing/tracing_max_latency
1004 # echo 1 > /debug/tracing/tracing_enabled
1005 # chrt -f 5 sleep 1
1006 # echo 0 > /debug/tracing/tracing_enabled
1007 # cat /debug/tracing/latency_trace
1008# tracer: wakeup
1009#
1010wakeup latency trace v1.1.5 on 2.6.26-rc8
1011--------------------------------------------------------------------
1012 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
1013 -----------------
1014 | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
1015 -----------------
1016
1017# _------=> CPU#
1018# / _-----=> irqs-off
1019# | / _----=> need-resched
1020# || / _---=> hardirq/softirq
1021# ||| / _--=> preempt-depth
1022# |||| /
1023# ||||| delay
1024# cmd pid ||||| time | caller
1025# \ / ||||| \ | /
1026 <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process)
1027 <idle>-0 1d..4 4us : schedule (cpu_idle)
1028
1029
5752674e
IM
1030Running this on an idle system, we see that it only took 4
1031microseconds to perform the task switch. Note, since the trace
1032marker in the schedule is before the actual "switch", we stop
1033the tracing when the recorded task is about to schedule in. This
1034may change if we add a new marker at the end of the scheduler.
eb6d42ea 1035
5752674e
IM
1036Notice that the recorded task is 'sleep' with the PID of 4901
1037and it has an rt_prio of 5. This priority is user-space priority
1038and not the internal kernel priority. The policy is 1 for
1039SCHED_FIFO and 2 for SCHED_RR.
eb6d42ea
SR
1040
1041Doing the same with chrt -r 5 and ftrace_enabled set.
1042
1043# tracer: wakeup
1044#
1045wakeup latency trace v1.1.5 on 2.6.26-rc8
1046--------------------------------------------------------------------
1047 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
1048 -----------------
1049 | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
1050 -----------------
1051
1052# _------=> CPU#
1053# / _-----=> irqs-off
1054# | / _----=> need-resched
1055# || / _---=> hardirq/softirq
1056# ||| / _--=> preempt-depth
1057# |||| /
1058# ||||| delay
1059# cmd pid ||||| time | caller
1060# \ / ||||| \ | /
1061ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process)
1062ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb)
1063ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up)
1064ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup)
1065ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr)
1066ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup)
1067ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up)
1068ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up)
1069[...]
1070ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt)
1071ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit)
1072ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit)
1073ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq)
1074[...]
1075ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks)
1076ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq)
1077ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable)
1078ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd)
1079ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd)
1080ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched)
1081ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched)
1082ksoftirq-7 1.N.2 33us : schedule (__cond_resched)
1083ksoftirq-7 1.N.2 33us : add_preempt_count (schedule)
1084ksoftirq-7 1.N.3 34us : hrtick_clear (schedule)
1085ksoftirq-7 1dN.3 35us : _spin_lock (schedule)
1086ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock)
1087ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule)
1088ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair)
1089[...]
1090ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline)
1091ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock)
1092ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline)
1093ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock)
1094ksoftirq-7 1d..4 50us : schedule (__cond_resched)
1095
5752674e
IM
1096The interrupt went off while running ksoftirqd. This task runs
1097at SCHED_OTHER. Why did not we see the 'N' set early? This may
1098be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K
1099stacks configured, the interrupt and softirq run with their own
1100stack. Some information is held on the top of the task's stack
1101(need_resched and preempt_count are both stored there). The
1102setting of the NEED_RESCHED bit is done directly to the task's
1103stack, but the reading of the NEED_RESCHED is done by looking at
1104the current stack, which in this case is the stack for the hard
1105interrupt. This hides the fact that NEED_RESCHED has been set.
1106We do not see the 'N' until we switch back to the task's
a41eebab 1107assigned stack.
eb6d42ea 1108
9b803c0f
SR
1109function
1110--------
eb6d42ea 1111
9b803c0f 1112This tracer is the function tracer. Enabling the function tracer
5752674e
IM
1113can be done from the debug file system. Make sure the
1114ftrace_enabled is set; otherwise this tracer is a nop.
eb6d42ea
SR
1115
1116 # sysctl kernel.ftrace_enabled=1
9b803c0f 1117 # echo function > /debug/tracing/current_tracer
eb6d42ea
SR
1118 # echo 1 > /debug/tracing/tracing_enabled
1119 # usleep 1
1120 # echo 0 > /debug/tracing/tracing_enabled
1121 # cat /debug/tracing/trace
9b803c0f 1122# tracer: function
eb6d42ea
SR
1123#
1124# TASK-PID CPU# TIMESTAMP FUNCTION
1125# | | | | |
1126 bash-4003 [00] 123.638713: finish_task_switch <-schedule
1127 bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch
1128 bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq
1129 bash-4003 [00] 123.638715: hrtick_set <-schedule
1130 bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set
1131 bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave
1132 bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set
1133 bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore
1134 bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set
1135 bash-4003 [00] 123.638718: sub_preempt_count <-schedule
1136 bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule
1137 bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run
1138 bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion
1139 bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common
1140 bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq
1141[...]
1142
1143
5752674e
IM
1144Note: function tracer uses ring buffers to store the above
1145entries. The newest data may overwrite the oldest data.
1146Sometimes using echo to stop the trace is not sufficient because
1147the tracing could have overwritten the data that you wanted to
1148record. For this reason, it is sometimes better to disable
1149tracing directly from a program. This allows you to stop the
1150tracing at the point that you hit the part that you are
1151interested in. To disable the tracing directly from a C program,
1152something like following code snippet can be used:
eb6d42ea
SR
1153
1154int trace_fd;
1155[...]
1156int main(int argc, char *argv[]) {
1157 [...]
1158 trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY);
1159 [...]
1160 if (condition_hit()) {
f2d9c740 1161 write(trace_fd, "0", 1);
eb6d42ea
SR
1162 }
1163 [...]
1164}
1165
f2d9c740 1166Note: Here we hard coded the path name. The debugfs mount is not
5752674e
IM
1167guaranteed to be at /debug (and is more commonly at
1168/sys/kernel/debug). For simple one time traces, the above is
1169sufficent. For anything else, a search through /proc/mounts may
1170be needed to find where the debugfs file-system is mounted.
eb6d42ea 1171
df4fc315
SR
1172
1173Single thread tracing
1174---------------------
1175
1176By writing into /debug/tracing/set_ftrace_pid you can trace a
1177single thread. For example:
1178
1179# cat /debug/tracing/set_ftrace_pid
1180no pid
1181# echo 3111 > /debug/tracing/set_ftrace_pid
1182# cat /debug/tracing/set_ftrace_pid
11833111
1184# echo function > /debug/tracing/current_tracer
1185# cat /debug/tracing/trace | head
1186 # tracer: function
1187 #
1188 # TASK-PID CPU# TIMESTAMP FUNCTION
1189 # | | | | |
1190 yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return
1191 yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
1192 yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
1193 yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
1194 yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll
1195 yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll
1196# echo -1 > /debug/tracing/set_ftrace_pid
1197# cat /debug/tracing/trace |head
1198 # tracer: function
1199 #
1200 # TASK-PID CPU# TIMESTAMP FUNCTION
1201 # | | | | |
1202 ##### CPU 3 buffer started ####
1203 yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait
1204 yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry
1205 yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry
1206 yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit
1207 yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit
1208
1209If you want to trace a function when executing, you could use
1210something like this simple program:
1211
1212#include <stdio.h>
1213#include <stdlib.h>
1214#include <sys/types.h>
1215#include <sys/stat.h>
1216#include <fcntl.h>
1217#include <unistd.h>
1218
1219int main (int argc, char **argv)
1220{
1221 if (argc < 1)
1222 exit(-1);
1223
1224 if (fork() > 0) {
1225 int fd, ffd;
1226 char line[64];
1227 int s;
1228
1229 ffd = open("/debug/tracing/current_tracer", O_WRONLY);
1230 if (ffd < 0)
1231 exit(-1);
1232 write(ffd, "nop", 3);
1233
1234 fd = open("/debug/tracing/set_ftrace_pid", O_WRONLY);
1235 s = sprintf(line, "%d\n", getpid());
1236 write(fd, line, s);
1237
1238 write(ffd, "function", 8);
1239
1240 close(fd);
1241 close(ffd);
1242
1243 execvp(argv[1], argv+1);
1244 }
1245
1246 return 0;
1247}
1248
e2ea5399
MM
1249
1250hw-branch-tracer (x86 only)
1251---------------------------
1252
1253This tracer uses the x86 last branch tracing hardware feature to
1254collect a branch trace on all cpus with relatively low overhead.
1255
1256The tracer uses a fixed-size circular buffer per cpu and only
1257traces ring 0 branches. The trace file dumps that buffer in the
1258following format:
1259
1260# tracer: hw-branch-tracer
1261#
1262# CPU# TO <- FROM
1263 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6
1264 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a
1265 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf
1266 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf
1267 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a
1268 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf
1269
1270
5752674e
IM
1271The tracer may be used to dump the trace for the oops'ing cpu on
1272a kernel oops into the system log. To enable this,
1273ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one
1274can either use the sysctl function or set it via the proc system
1275interface.
e2ea5399
MM
1276
1277 sysctl kernel.ftrace_dump_on_oops=1
1278
1279or
1280
1281 echo 1 > /proc/sys/kernel/ftrace_dump_on_oops
1282
1283
5752674e
IM
1284Here's an example of such a dump after a null pointer
1285dereference in a kernel module:
e2ea5399
MM
1286
1287[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
1288[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops]
1289[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0
1290[57848.106019] Oops: 0002 [#1] SMP
1291[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus
1292[57848.106019] Dumping ftrace buffer:
1293[57848.106019] ---------------------------------
1294[...]
1295[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24
1296[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165
1297[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165
1298[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165
1299[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165
1300[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops]
1301[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30
1302[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b
1303[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31
1304[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1
1305[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30
1306[...]
1307[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2
1308[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881
1309[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881
1310[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96
1311[...]
1312[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3
1313[57848.106019] ---------------------------------
1314[57848.106019] CPU 0
1315[57848.106019] Modules linked in: oops
1316[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23
1317[57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops]
1318[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246
1319[...]
1320
1321
985ec20a
FW
1322function graph tracer
1323---------------------------
1324
5752674e
IM
1325This tracer is similar to the function tracer except that it
1326probes a function on its entry and its exit. This is done by
1327using a dynamically allocated stack of return addresses in each
1328task_struct. On function entry the tracer overwrites the return
1329address of each function traced to set a custom probe. Thus the
1330original return address is stored on the stack of return address
1331in the task_struct.
985ec20a 1332
5752674e
IM
1333Probing on both ends of a function leads to special features
1334such as:
985ec20a 1335
5752674e
IM
1336- measure of a function's time execution
1337- having a reliable call stack to draw function calls graph
985ec20a
FW
1338
1339This tracer is useful in several situations:
1340
5752674e
IM
1341- you want to find the reason of a strange kernel behavior and
1342 need to see what happens in detail on any areas (or specific
1343 ones).
1344
1345- you are experiencing weird latencies but it's difficult to
1346 find its origin.
1347
1348- you want to find quickly which path is taken by a specific
1349 function
1350
1351- you just want to peek inside a working kernel and want to see
1352 what happens there.
985ec20a
FW
1353
1354# tracer: function_graph
1355#
1356# CPU DURATION FUNCTION CALLS
1357# | | | | | | |
1358
1359 0) | sys_open() {
1360 0) | do_sys_open() {
1361 0) | getname() {
1362 0) | kmem_cache_alloc() {
1363 0) 1.382 us | __might_sleep();
1364 0) 2.478 us | }
1365 0) | strncpy_from_user() {
1366 0) | might_fault() {
1367 0) 1.389 us | __might_sleep();
1368 0) 2.553 us | }
1369 0) 3.807 us | }
1370 0) 7.876 us | }
1371 0) | alloc_fd() {
1372 0) 0.668 us | _spin_lock();
1373 0) 0.570 us | expand_files();
1374 0) 0.586 us | _spin_unlock();
1375
1376
5752674e
IM
1377There are several columns that can be dynamically
1378enabled/disabled. You can use every combination of options you
1379want, depending on your needs.
985ec20a 1380
5752674e
IM
1381- The cpu number on which the function executed is default
1382 enabled. It is sometimes better to only trace one cpu (see
1383 tracing_cpu_mask file) or you might sometimes see unordered
1384 function calls while cpu tracing switch.
985ec20a
FW
1385
1386 hide: echo nofuncgraph-cpu > /debug/tracing/trace_options
1387 show: echo funcgraph-cpu > /debug/tracing/trace_options
1388
5752674e
IM
1389- The duration (function's time of execution) is displayed on
1390 the closing bracket line of a function or on the same line
1391 than the current function in case of a leaf one. It is default
1392 enabled.
985ec20a
FW
1393
1394 hide: echo nofuncgraph-duration > /debug/tracing/trace_options
1395 show: echo funcgraph-duration > /debug/tracing/trace_options
1396
5752674e
IM
1397- The overhead field precedes the duration field in case of
1398 reached duration thresholds.
985ec20a
FW
1399
1400 hide: echo nofuncgraph-overhead > /debug/tracing/trace_options
1401 show: echo funcgraph-overhead > /debug/tracing/trace_options
1402 depends on: funcgraph-duration
1403
1404 ie:
1405
1406 0) | up_write() {
1407 0) 0.646 us | _spin_lock_irqsave();
1408 0) 0.684 us | _spin_unlock_irqrestore();
1409 0) 3.123 us | }
1410 0) 0.548 us | fput();
1411 0) + 58.628 us | }
1412
1413 [...]
1414
1415 0) | putname() {
1416 0) | kmem_cache_free() {
1417 0) 0.518 us | __phys_addr();
1418 0) 1.757 us | }
1419 0) 2.861 us | }
1420 0) ! 115.305 us | }
1421 0) ! 116.402 us | }
1422
1423 + means that the function exceeded 10 usecs.
1424 ! means that the function exceeded 100 usecs.
1425
1426
5752674e
IM
1427- The task/pid field displays the thread cmdline and pid which
1428 executed the function. It is default disabled.
985ec20a
FW
1429
1430 hide: echo nofuncgraph-proc > /debug/tracing/trace_options
1431 show: echo funcgraph-proc > /debug/tracing/trace_options
1432
1433 ie:
1434
1435 # tracer: function_graph
1436 #
1437 # CPU TASK/PID DURATION FUNCTION CALLS
1438 # | | | | | | | | |
1439 0) sh-4802 | | d_free() {
1440 0) sh-4802 | | call_rcu() {
1441 0) sh-4802 | | __call_rcu() {
1442 0) sh-4802 | 0.616 us | rcu_process_gp_end();
1443 0) sh-4802 | 0.586 us | check_for_new_grace_period();
1444 0) sh-4802 | 2.899 us | }
1445 0) sh-4802 | 4.040 us | }
1446 0) sh-4802 | 5.151 us | }
1447 0) sh-4802 | + 49.370 us | }
1448
1449
5752674e
IM
1450- The absolute time field is an absolute timestamp given by the
1451 system clock since it started. A snapshot of this time is
1452 given on each entry/exit of functions
985ec20a
FW
1453
1454 hide: echo nofuncgraph-abstime > /debug/tracing/trace_options
1455 show: echo funcgraph-abstime > /debug/tracing/trace_options
1456
1457 ie:
1458
1459 #
1460 # TIME CPU DURATION FUNCTION CALLS
1461 # | | | | | | | |
1462 360.774522 | 1) 0.541 us | }
1463 360.774522 | 1) 4.663 us | }
1464 360.774523 | 1) 0.541 us | __wake_up_bit();
1465 360.774524 | 1) 6.796 us | }
1466 360.774524 | 1) 7.952 us | }
1467 360.774525 | 1) 9.063 us | }
1468 360.774525 | 1) 0.615 us | journal_mark_dirty();
1469 360.774527 | 1) 0.578 us | __brelse();
1470 360.774528 | 1) | reiserfs_prepare_for_journal() {
1471 360.774528 | 1) | unlock_buffer() {
1472 360.774529 | 1) | wake_up_bit() {
1473 360.774529 | 1) | bit_waitqueue() {
1474 360.774530 | 1) 0.594 us | __phys_addr();
1475
1476
5752674e 1477You can put some comments on specific functions by using
5e1607a0 1478trace_printk() For example, if you want to put a comment inside
5752674e 1479the __might_sleep() function, you just have to include
5e1607a0 1480<linux/ftrace.h> and call trace_printk() inside __might_sleep()
985ec20a 1481
5e1607a0 1482trace_printk("I'm a comment!\n")
985ec20a
FW
1483
1484will produce:
1485
1486 1) | __might_sleep() {
1487 1) | /* I'm a comment! */
1488 1) 1.449 us | }
1489
1490
5752674e
IM
1491You might find other useful features for this tracer in the
1492following "dynamic ftrace" section such as tracing only specific
1493functions or tasks.
985ec20a 1494
eb6d42ea
SR
1495dynamic ftrace
1496--------------
1497
f2d9c740 1498If CONFIG_DYNAMIC_FTRACE is set, the system will run with
eb6d42ea
SR
1499virtually no overhead when function tracing is disabled. The way
1500this works is the mcount function call (placed at the start of
5752674e
IM
1501every kernel function, produced by the -pg switch in gcc),
1502starts of pointing to a simple return. (Enabling FTRACE will
1503include the -pg switch in the compiling of the kernel.)
eb6d42ea 1504
9b803c0f
SR
1505At compile time every C file object is run through the
1506recordmcount.pl script (located in the scripts directory). This
1507script will process the C object using objdump to find all the
5752674e
IM
1508locations in the .text section that call mcount. (Note, only the
1509.text section is processed, since processing other sections like
1510.init.text may cause races due to those sections being freed).
9b803c0f 1511
5752674e
IM
1512A new section called "__mcount_loc" is created that holds
1513references to all the mcount call sites in the .text section.
1514This section is compiled back into the original object. The
1515final linker will add all these references into a single table.
9b803c0f
SR
1516
1517On boot up, before SMP is initialized, the dynamic ftrace code
5752674e
IM
1518scans this table and updates all the locations into nops. It
1519also records the locations, which are added to the
1520available_filter_functions list. Modules are processed as they
1521are loaded and before they are executed. When a module is
1522unloaded, it also removes its functions from the ftrace function
1523list. This is automatic in the module unload code, and the
1524module author does not need to worry about it.
1525
1526When tracing is enabled, kstop_machine is called to prevent
1527races with the CPUS executing code being modified (which can
1528cause the CPU to do undesireable things), and the nops are
1529patched back to calls. But this time, they do not call mcount
1530(which is just a function stub). They now call into the ftrace
1531infrastructure.
eb6d42ea
SR
1532
1533One special side-effect to the recording of the functions being
f2d9c740 1534traced is that we can now selectively choose which functions we
5752674e
IM
1535wish to trace and which ones we want the mcount calls to remain
1536as nops.
eb6d42ea 1537
5752674e
IM
1538Two files are used, one for enabling and one for disabling the
1539tracing of specified functions. They are:
eb6d42ea
SR
1540
1541 set_ftrace_filter
1542
1543and
1544
1545 set_ftrace_notrace
1546
5752674e
IM
1547A list of available functions that you can add to these files is
1548listed in:
eb6d42ea
SR
1549
1550 available_filter_functions
1551
1552 # cat /debug/tracing/available_filter_functions
1553put_prev_task_idle
1554kmem_cache_create
1555pick_next_task_rt
1556get_online_cpus
1557pick_next_task_fair
1558mutex_lock
1559[...]
1560
f2d9c740 1561If I am only interested in sys_nanosleep and hrtimer_interrupt:
eb6d42ea
SR
1562
1563 # echo sys_nanosleep hrtimer_interrupt \
1564 > /debug/tracing/set_ftrace_filter
1565 # echo ftrace > /debug/tracing/current_tracer
1566 # echo 1 > /debug/tracing/tracing_enabled
1567 # usleep 1
1568 # echo 0 > /debug/tracing/tracing_enabled
1569 # cat /debug/tracing/trace
1570# tracer: ftrace
1571#
1572# TASK-PID CPU# TIMESTAMP FUNCTION
1573# | | | | |
1574 usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
1575 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call
1576 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
1577
f2d9c740 1578To see which functions are being traced, you can cat the file:
eb6d42ea
SR
1579
1580 # cat /debug/tracing/set_ftrace_filter
1581hrtimer_interrupt
1582sys_nanosleep
1583
1584
5752674e
IM
1585Perhaps this is not enough. The filters also allow simple wild
1586cards. Only the following are currently available
eb6d42ea 1587
a41eebab 1588 <match>* - will match functions that begin with <match>
eb6d42ea
SR
1589 *<match> - will match functions that end with <match>
1590 *<match>* - will match functions that have <match> in it
1591
f2d9c740 1592These are the only wild cards which are supported.
eb6d42ea
SR
1593
1594 <match>*<match> will not work.
1595
5752674e
IM
1596Note: It is better to use quotes to enclose the wild cards,
1597 otherwise the shell may expand the parameters into names
1598 of files in the local directory.
c072c249 1599
1600 # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter
eb6d42ea
SR
1601
1602Produces:
1603
1604# tracer: ftrace
1605#
1606# TASK-PID CPU# TIMESTAMP FUNCTION
1607# | | | | |
1608 bash-4003 [00] 1480.611794: hrtimer_init <-copy_process
1609 bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set
1610 bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear
1611 bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
1612 <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
1613 <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
1614 <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
1615 <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
1616 <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
1617
1618
1619Notice that we lost the sys_nanosleep.
1620
1621 # cat /debug/tracing/set_ftrace_filter
1622hrtimer_run_queues
1623hrtimer_run_pending
1624hrtimer_init
1625hrtimer_cancel
1626hrtimer_try_to_cancel
1627hrtimer_forward
1628hrtimer_start
1629hrtimer_reprogram
1630hrtimer_force_reprogram
1631hrtimer_get_next_event
1632hrtimer_interrupt
1633hrtimer_nanosleep
1634hrtimer_wakeup
1635hrtimer_get_remaining
1636hrtimer_get_res
1637hrtimer_init_sleeper
1638
1639
1640This is because the '>' and '>>' act just like they do in bash.
1641To rewrite the filters, use '>'
1642To append to the filters, use '>>'
1643
5752674e
IM
1644To clear out a filter so that all functions will be recorded
1645again:
eb6d42ea
SR
1646
1647 # echo > /debug/tracing/set_ftrace_filter
1648 # cat /debug/tracing/set_ftrace_filter
1649 #
1650
1651Again, now we want to append.
1652
1653 # echo sys_nanosleep > /debug/tracing/set_ftrace_filter
1654 # cat /debug/tracing/set_ftrace_filter
1655sys_nanosleep
c072c249 1656 # echo 'hrtimer_*' >> /debug/tracing/set_ftrace_filter
eb6d42ea
SR
1657 # cat /debug/tracing/set_ftrace_filter
1658hrtimer_run_queues
1659hrtimer_run_pending
1660hrtimer_init
1661hrtimer_cancel
1662hrtimer_try_to_cancel
1663hrtimer_forward
1664hrtimer_start
1665hrtimer_reprogram
1666hrtimer_force_reprogram
1667hrtimer_get_next_event
1668hrtimer_interrupt
1669sys_nanosleep
1670hrtimer_nanosleep
1671hrtimer_wakeup
1672hrtimer_get_remaining
1673hrtimer_get_res
1674hrtimer_init_sleeper
1675
1676
5752674e
IM
1677The set_ftrace_notrace prevents those functions from being
1678traced.
eb6d42ea
SR
1679
1680 # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace
1681
1682Produces:
1683
1684# tracer: ftrace
1685#
1686# TASK-PID CPU# TIMESTAMP FUNCTION
1687# | | | | |
1688 bash-4043 [01] 115.281644: finish_task_switch <-schedule
1689 bash-4043 [01] 115.281645: hrtick_set <-schedule
1690 bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set
1691 bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run
1692 bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion
1693 bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run
1694 bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop
1695 bash-4043 [01] 115.281648: wake_up_process <-kthread_stop
1696 bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process
1697
1698We can see that there's no more lock or preempt tracing.
1699
985ec20a 1700
5752674e
IM
1701Dynamic ftrace with the function graph tracer
1702---------------------------------------------
985ec20a 1703
5752674e
IM
1704Although what has been explained above concerns both the
1705function tracer and the function-graph-tracer, there are some
1706special features only available in the function-graph tracer.
985ec20a 1707
5752674e
IM
1708If you want to trace only one function and all of its children,
1709you just have to echo its name into set_graph_function:
985ec20a 1710
5752674e 1711 echo __do_fault > set_graph_function
985ec20a 1712
5752674e
IM
1713will produce the following "expanded" trace of the __do_fault()
1714function:
985ec20a
FW
1715
1716 0) | __do_fault() {
1717 0) | filemap_fault() {
1718 0) | find_lock_page() {
1719 0) 0.804 us | find_get_page();
1720 0) | __might_sleep() {
1721 0) 1.329 us | }
1722 0) 3.904 us | }
1723 0) 4.979 us | }
1724 0) 0.653 us | _spin_lock();
1725 0) 0.578 us | page_add_file_rmap();
1726 0) 0.525 us | native_set_pte_at();
1727 0) 0.585 us | _spin_unlock();
1728 0) | unlock_page() {
1729 0) 0.541 us | page_waitqueue();
1730 0) 0.639 us | __wake_up_bit();
1731 0) 2.786 us | }
1732 0) + 14.237 us | }
1733 0) | __do_fault() {
1734 0) | filemap_fault() {
1735 0) | find_lock_page() {
1736 0) 0.698 us | find_get_page();
1737 0) | __might_sleep() {
1738 0) 1.412 us | }
1739 0) 3.950 us | }
1740 0) 5.098 us | }
1741 0) 0.631 us | _spin_lock();
1742 0) 0.571 us | page_add_file_rmap();
1743 0) 0.526 us | native_set_pte_at();
1744 0) 0.586 us | _spin_unlock();
1745 0) | unlock_page() {
1746 0) 0.533 us | page_waitqueue();
1747 0) 0.638 us | __wake_up_bit();
1748 0) 2.793 us | }
1749 0) + 14.012 us | }
1750
5752674e 1751You can also expand several functions at once:
985ec20a 1752
5752674e
IM
1753 echo sys_open > set_graph_function
1754 echo sys_close >> set_graph_function
985ec20a 1755
5752674e
IM
1756Now if you want to go back to trace all functions you can clear
1757this special filter via:
985ec20a 1758
5752674e 1759 echo > set_graph_function
985ec20a
FW
1760
1761
eb6d42ea
SR
1762trace_pipe
1763----------
1764
5752674e
IM
1765The trace_pipe outputs the same content as the trace file, but
1766the effect on the tracing is different. Every read from
1767trace_pipe is consumed. This means that subsequent reads will be
1768different. The trace is live.
eb6d42ea 1769
9b803c0f 1770 # echo function > /debug/tracing/current_tracer
eb6d42ea
SR
1771 # cat /debug/tracing/trace_pipe > /tmp/trace.out &
1772[1] 4153
1773 # echo 1 > /debug/tracing/tracing_enabled
1774 # usleep 1
1775 # echo 0 > /debug/tracing/tracing_enabled
1776 # cat /debug/tracing/trace
9b803c0f 1777# tracer: function
eb6d42ea
SR
1778#
1779# TASK-PID CPU# TIMESTAMP FUNCTION
1780# | | | | |
1781
1782 #
1783 # cat /tmp/trace.out
1784 bash-4043 [00] 41.267106: finish_task_switch <-schedule
1785 bash-4043 [00] 41.267106: hrtick_set <-schedule
1786 bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set
1787 bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run
1788 bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion
1789 bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run
1790 bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop
1791 bash-4043 [00] 41.267110: wake_up_process <-kthread_stop
1792 bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process
1793 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1794
1795
5752674e
IM
1796Note, reading the trace_pipe file will block until more input is
1797added. By changing the tracer, trace_pipe will issue an EOF. We
1798needed to set the function tracer _before_ we "cat" the
1799trace_pipe file.
eb6d42ea
SR
1800
1801
1802trace entries
1803-------------
1804
5752674e
IM
1805Having too much or not enough data can be troublesome in
1806diagnosing an issue in the kernel. The file buffer_size_kb is
1807used to modify the size of the internal trace buffers. The
1808number listed is the number of entries that can be recorded per
1809CPU. To know the full size, multiply the number of possible CPUS
1810with the number of entries.
eb6d42ea 1811
a94c80e7 1812 # cat /debug/tracing/buffer_size_kb
1696b2b0 18131408 (units kilobytes)
eb6d42ea 1814
5752674e
IM
1815Note, to modify this, you must have tracing completely disabled.
1816To do that, echo "nop" into the current_tracer. If the
1817current_tracer is not set to "nop", an EINVAL error will be
1818returned.
eb6d42ea 1819
9b803c0f 1820 # echo nop > /debug/tracing/current_tracer
1696b2b0 1821 # echo 10000 > /debug/tracing/buffer_size_kb
a94c80e7 1822 # cat /debug/tracing/buffer_size_kb
1696b2b0 182310000 (units kilobytes)
eb6d42ea 1824
5752674e
IM
1825The number of pages which will be allocated is limited to a
1826percentage of available memory. Allocating too much will produce
1827an error.
eb6d42ea 1828
a94c80e7 1829 # echo 1000000000000 > /debug/tracing/buffer_size_kb
eb6d42ea 1830-bash: echo: write error: Cannot allocate memory
a94c80e7 1831 # cat /debug/tracing/buffer_size_kb
eb6d42ea
SR
183285
1833
5752674e
IM
1834-----------
1835
1836More details can be found in the source code, in the
1837kernel/tracing/*.c files.