Merge branch 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux...
[GitHub/mt8127/android_kernel_alcatel_ttab.git] / Documentation / ftrace.txt
CommitLineData
eb6d42ea
SR
1 ftrace - Function Tracer
2 ========================
3
4Copyright 2008 Red Hat Inc.
a41eebab
SR
5 Author: Steven Rostedt <srostedt@redhat.com>
6 License: The GNU Free Documentation License, Version 1.2
7Reviewers: Elias Oltmanns and Randy Dunlap
eb6d42ea 8
a41eebab 9Writen for: 2.6.26-rc8 linux-2.6-tip.git tip/tracing/ftrace branch
eb6d42ea
SR
10
11Introduction
12------------
13
14Ftrace is an internal tracer designed to help out developers and
15designers of systems to find what is going on inside the kernel.
16It can be used for debugging or analyzing latencies and performance
17issues that take place outside of user-space.
18
19Although ftrace is the function tracer, it also includes an
20infrastructure that allows for other types of tracing. Some of the
21tracers that are currently in ftrace is a tracer to trace
22context switches, the time it takes for a high priority task to
23run after it was woken up, the time interrupts are disabled, and
24more.
25
26
27The File System
28---------------
29
30Ftrace uses the debugfs file system to hold the control files as well
31as the files to display output.
32
33To mount the debugfs system:
34
35 # mkdir /debug
36 # mount -t debugfs nodev /debug
37
38
39That's it! (assuming that you have ftrace configured into your kernel)
40
41After mounting the debugfs, you can see a directory called
42"tracing". This directory contains the control and output files
43of ftrace. Here is a list of some of the key files:
44
45
46 Note: all time values are in microseconds.
47
48 current_tracer : This is used to set or display the current tracer
49 that is configured.
50
51 available_tracers : This holds the different types of tracers that
a41eebab 52 have been compiled into the kernel. The tracers
eb6d42ea
SR
53 listed here can be configured by echoing in their
54 name into current_tracer.
55
56 tracing_enabled : This sets or displays whether the current_tracer
57 is activated and tracing or not. Echo 0 into this
58 file to disable the tracer or 1 (or non-zero) to
59 enable it.
60
61 trace : This file holds the output of the trace in a human readable
62 format.
63
64 latency_trace : This file shows the same trace but the information
65 is organized more to display possible latencies
66 in the system.
67
68 trace_pipe : The output is the same as the "trace" file but this
69 file is meant to be streamed with live tracing.
70 Reads from this file will block until new data
71 is retrieved. Unlike the "trace" and "latency_trace"
72 files, this file is a consumer. This means reading
73 from this file causes sequential reads to display
74 more current data. Once data is read from this
75 file, it is consumed, and will not be read
76 again with a sequential read. The "trace" and
77 "latency_trace" files are static, and if the
78 tracer isn't adding more data, they will display
79 the same information every time they are read.
80
81 iter_ctrl : This file lets the user control the amount of data
82 that is displayed in one of the above output
83 files.
84
85 trace_max_latency : Some of the tracers record the max latency.
86 For example, the time interrupts are disabled.
87 This time is saved in this file. The max trace
88 will also be stored, and displayed by either
89 "trace" or "latency_trace". A new max trace will
90 only be recorded if the latency is greater than
91 the value in this file. (in microseconds)
92
93 trace_entries : This sets or displays the number of trace
94 entries each CPU buffer can hold. The tracer buffers
95 are the same size for each CPU, so care must be
a41eebab
SR
96 taken when modifying the trace_entries. The trace
97 buffers are allocated in pages (blocks of memory that
98 the kernel uses for allocation, usually 4 KB in size).
99 Since each entry is smaller than a page, if the last
100 allocated page has room for more entries than were
101 requested, the rest of the page is used to allocate
102 entries.
eb6d42ea
SR
103
104 This can only be updated when the current_tracer
105 is set to "none".
106
107 NOTE: It is planned on changing the allocated buffers
108 from being the number of possible CPUS to
109 the number of online CPUS.
110
111 tracing_cpumask : This is a mask that lets the user only trace
112 on specified CPUS. The format is a hex string
113 representing the CPUS.
114
115 set_ftrace_filter : When dynamic ftrace is configured in, the
116 code is dynamically modified to disable calling
117 of the function profiler (mcount). This lets
118 tracing be configured in with practically no overhead
119 in performance. This also has a side effect of
120 enabling or disabling specific functions to be
121 traced. Echoing in names of functions into this
a41eebab 122 file will limit the trace to only these functions.
eb6d42ea
SR
123
124 set_ftrace_notrace: This has the opposite effect that
125 set_ftrace_filter has. Any function that is added
126 here will not be traced. If a function exists
a41eebab
SR
127 in both set_ftrace_filter and set_ftrace_notrace,
128 the function will _not_ be traced.
eb6d42ea
SR
129
130 available_filter_functions : When a function is encountered the first
131 time by the dynamic tracer, it is recorded and
132 later the call is converted into a nop. This file
133 lists the functions that have been recorded
134 by the dynamic tracer and these functions can
135 be used to set the ftrace filter by the above
136 "set_ftrace_filter" file.
137
138
139The Tracers
140-----------
141
142Here are the list of current tracers that can be configured.
143
144 ftrace - function tracer that uses mcount to trace all functions.
145 It is possible to filter out which functions that are
a41eebab 146 to be traced when dynamic ftrace is configured in.
eb6d42ea
SR
147
148 sched_switch - traces the context switches between tasks.
149
150 irqsoff - traces the areas that disable interrupts and saves off
151 the trace with the longest max latency.
152 See tracing_max_latency. When a new max is recorded,
153 it replaces the old trace. It is best to view this
154 trace with the latency_trace file.
155
156 preemptoff - Similar to irqsoff but traces and records the time
157 preemption is disabled.
158
159 preemptirqsoff - Similar to irqsoff and preemptoff, but traces and
160 records the largest time irqs and/or preemption is
161 disabled.
162
163 wakeup - Traces and records the max latency that it takes for
164 the highest priority task to get scheduled after
165 it has been woken up.
166
167 none - This is not a tracer. To remove all tracers from tracing
168 simply echo "none" into current_tracer.
169
170
171Examples of using the tracer
172----------------------------
173
174Here are typical examples of using the tracers with only controlling
175them with the debugfs interface (without using any user-land utilities).
176
177Output format:
178--------------
179
180Here's an example of the output format of the file "trace"
181
182 --------
183# tracer: ftrace
184#
185# TASK-PID CPU# TIMESTAMP FUNCTION
186# | | | | |
187 bash-4251 [01] 10152.583854: path_put <-path_walk
188 bash-4251 [01] 10152.583855: dput <-path_put
189 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
190 --------
191
192A header is printed with the trace that is represented. In this case
193the tracer is "ftrace". Then a header showing the format. Task name
194"bash", the task PID "4251", the CPU that it was running on
195"01", the timestamp in <secs>.<usecs> format, the function name that was
196traced "path_put" and the parent function that called this function
197"path_walk".
198
199The sched_switch tracer also includes tracing of task wake ups and
200context switches.
201
202 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S
203 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S
204 ksoftirqd/1-7 [01] 1453.070013: 7:115:R ==> 10:115:R
205 events/1-10 [01] 1453.070013: 10:115:S ==> 2916:115:R
206 kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R
207 ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R
208
209Wake ups are represented by a "+" and the context switches show
210"==>". The format is:
211
212 Context switches:
213
214 Previous task Next Task
215
216 <pid>:<prio>:<state> ==> <pid>:<prio>:<state>
217
218 Wake ups:
219
220 Current task Task waking up
221
222 <pid>:<prio>:<state> + <pid>:<prio>:<state>
223
224The prio is the internal kernel priority, which is inverse to the
225priority that is usually displayed by user-space tools. Zero represents
226the highest priority (99). Prio 100 starts the "nice" priorities with
227100 being equal to nice -20 and 139 being nice 19. The prio "140" is
228reserved for the idle task which is the lowest priority thread (pid 0).
229
230
231Latency trace format
232--------------------
233
234For traces that display latency times, the latency_trace file gives
235a bit more information to see why a latency happened. Here's a typical
236trace.
237
238# tracer: irqsoff
239#
240irqsoff latency trace v1.1.5 on 2.6.26-rc8
241--------------------------------------------------------------------
242 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
243 -----------------
244 | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
245 -----------------
246 => started at: apic_timer_interrupt
247 => ended at: do_softirq
248
249# _------=> CPU#
250# / _-----=> irqs-off
251# | / _----=> need-resched
252# || / _---=> hardirq/softirq
253# ||| / _--=> preempt-depth
254# |||| /
255# ||||| delay
256# cmd pid ||||| time | caller
257# \ / ||||| \ | /
258 <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
259 <idle>-0 0d.s. 97us : __do_softirq (do_softirq)
260 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
261
262
263vim:ft=help
264
265
266This shows that the current tracer is "irqsoff" tracing the time
267interrupts are disabled. It gives the trace version and the kernel
268this was executed on (2.6.26-rc8). Then it displays the max latency
269in microsecs (97 us). The number of trace entries displayed
270by the total number recorded (both are three: #3/3). The type of
271preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero
272and reserved for later use. #P is the number of online CPUS (#P:2).
273
274The task is the process that was running when the latency happened.
275(swapper pid: 0).
276
277The start and stop that caused the latencies:
278
279 apic_timer_interrupt is where the interrupts were disabled.
280 do_softirq is where they were enabled again.
281
282The next lines after the header are the trace itself. The header
283explains which is which.
284
285 cmd: The name of the process in the trace.
286
287 pid: The PID of that process.
288
289 CPU#: The CPU that the process was running on.
290
291 irqs-off: 'd' interrupts are disabled. '.' otherwise.
292
293 need-resched: 'N' task need_resched is set, '.' otherwise.
294
295 hardirq/softirq:
296 'H' - hard irq happened inside a softirq.
297 'h' - hard irq is running
298 's' - soft irq is running
299 '.' - normal context.
300
301 preempt-depth: The level of preempt_disabled
302
303The above is mostly meaningful for kernel developers.
304
a41eebab
SR
305 time: This differs from the trace file output. The trace file output
306 included an absolute timestamp. The timestamp used by the
307 latency_trace file is relative to the start of the trace.
eb6d42ea
SR
308
309 delay: This is just to help catch your eye a bit better. And
310 needs to be fixed to be only relative to the same CPU.
a41eebab 311 The marks are determined by the difference between this
eb6d42ea
SR
312 current trace and the next trace.
313 '!' - greater than preempt_mark_thresh (default 100)
314 '+' - greater than 1 microsecond
315 ' ' - less than or equal to 1 microsecond.
316
317 The rest is the same as the 'trace' file.
318
319
320iter_ctrl
321---------
322
323The iter_ctrl file is used to control what gets printed in the trace
324output. To see what is available, simply cat the file:
325
326 cat /debug/tracing/iter_ctrl
327 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
328 noblock nostacktrace nosched-tree
329
a41eebab 330To disable one of the options, echo in the option prepended with "no".
eb6d42ea
SR
331
332 echo noprint-parent > /debug/tracing/iter_ctrl
333
334To enable an option, leave off the "no".
335
a41eebab 336 echo sym-offset > /debug/tracing/iter_ctrl
eb6d42ea
SR
337
338Here are the available options:
339
340 print-parent - On function traces, display the calling function
341 as well as the function being traced.
342
343 print-parent:
344 bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul
345
346 noprint-parent:
347 bash-4000 [01] 1477.606694: simple_strtoul
348
349
350 sym-offset - Display not only the function name, but also the offset
351 in the function. For example, instead of seeing just
a41eebab 352 "ktime_get", you will see "ktime_get+0xb/0x20".
eb6d42ea
SR
353
354 sym-offset:
355 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
356
357 sym-addr - this will also display the function address as well as
358 the function name.
359
360 sym-addr:
361 bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
362
363 verbose - This deals with the latency_trace file.
364
365 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
366 (+0.000ms): simple_strtoul (strict_strtoul)
367
368 raw - This will display raw numbers. This option is best for use with
369 user applications that can translate the raw numbers better than
370 having it done in the kernel.
371
a41eebab 372 hex - Similar to raw, but the numbers will be in a hexadecimal format.
eb6d42ea
SR
373
374 bin - This will print out the formats in raw binary.
375
376 block - TBD (needs update)
377
378 stacktrace - This is one of the options that changes the trace itself.
379 When a trace is recorded, so is the stack of functions.
380 This allows for back traces of trace sites.
381
382 sched-tree - TBD (any users??)
383
384
385sched_switch
386------------
387
388This tracer simply records schedule switches. Here's an example
a41eebab 389of how to use it.
eb6d42ea
SR
390
391 # echo sched_switch > /debug/tracing/current_tracer
392 # echo 1 > /debug/tracing/tracing_enabled
393 # sleep 1
394 # echo 0 > /debug/tracing/tracing_enabled
395 # cat /debug/tracing/trace
396
397# tracer: sched_switch
398#
399# TASK-PID CPU# TIMESTAMP FUNCTION
400# | | | | |
401 bash-3997 [01] 240.132281: 3997:120:R + 4055:120:R
402 bash-3997 [01] 240.132284: 3997:120:R ==> 4055:120:R
403 sleep-4055 [01] 240.132371: 4055:120:S ==> 3997:120:R
404 bash-3997 [01] 240.132454: 3997:120:R + 4055:120:S
405 bash-3997 [01] 240.132457: 3997:120:R ==> 4055:120:R
406 sleep-4055 [01] 240.132460: 4055:120:D ==> 3997:120:R
407 bash-3997 [01] 240.132463: 3997:120:R + 4055:120:D
408 bash-3997 [01] 240.132465: 3997:120:R ==> 4055:120:R
409 <idle>-0 [00] 240.132589: 0:140:R + 4:115:S
410 <idle>-0 [00] 240.132591: 0:140:R ==> 4:115:R
411 ksoftirqd/0-4 [00] 240.132595: 4:115:S ==> 0:140:R
412 <idle>-0 [00] 240.132598: 0:140:R + 4:115:S
413 <idle>-0 [00] 240.132599: 0:140:R ==> 4:115:R
414 ksoftirqd/0-4 [00] 240.132603: 4:115:S ==> 0:140:R
415 sleep-4055 [01] 240.133058: 4055:120:S ==> 3997:120:R
416 [...]
417
418
419As we have discussed previously about this format, the header shows
420the name of the trace and points to the options. The "FUNCTION"
421is a misnomer since here it represents the wake ups and context
422switches.
423
424The sched_switch only lists the wake ups (represented with '+')
425and context switches ('==>') with the previous task or current
426first followed by the next task or task waking up. The format for both
427of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO
428is the inverse of the actual priority with zero (0) being the highest
429priority and the nice values starting at 100 (nice -20). Below is
430a quick chart to map the kernel priority to user land priorities.
431
432 Kernel priority: 0 to 99 ==> user RT priority 99 to 0
433 Kernel priority: 100 to 139 ==> user nice -20 to 19
434 Kernel priority: 140 ==> idle task priority
435
436The task states are:
437
438 R - running : wants to run, may not actually be running
439 S - sleep : process is waiting to be woken up (handles signals)
440 D - deep sleep : process must be woken up (ignores signals)
441 T - stopped : process suspended
442 t - traced : process is being traced (with something like gdb)
443 Z - zombie : process waiting to be cleaned up
444 X - unknown
445
446
447ftrace_enabled
448--------------
449
450The following tracers give different output depending on whether
451or not the sysctl ftrace_enabled is set. To set ftrace_enabled,
452one can either use the sysctl function or set it via the proc
453file system interface.
454
455 sysctl kernel.ftrace_enabled=1
456
457 or
458
459 echo 1 > /proc/sys/kernel/ftrace_enabled
460
461To disable ftrace_enabled simply replace the '1' with '0' in
462the above commands.
463
464When ftrace_enabled is set the tracers will also record the functions
465that are within the trace. The descriptions of the tracers
466will also show an example with ftrace enabled.
467
468
469irqsoff
470-------
471
472When interrupts are disabled, the CPU can not react to any other
473external event (besides NMIs and SMIs). This prevents the timer
474interrupt from triggering or the mouse interrupt from letting the
475kernel know of a new mouse event. The result is a latency with the
476reaction time.
477
a41eebab 478The irqsoff tracer tracks the time interrupts are disabled to the time
eb6d42ea
SR
479they are re-enabled. When a new maximum latency is hit, it saves off
480the trace so that it may be retrieved at a later time. Every time a
481new maximum in reached, the old saved trace is discarded and the new
482trace is saved.
483
484To reset the maximum, echo 0 into tracing_max_latency. Here's an
485example:
486
487 # echo irqsoff > /debug/tracing/current_tracer
488 # echo 0 > /debug/tracing/tracing_max_latency
489 # echo 1 > /debug/tracing/tracing_enabled
490 # ls -ltr
491 [...]
492 # echo 0 > /debug/tracing/tracing_enabled
493 # cat /debug/tracing/latency_trace
494# tracer: irqsoff
495#
496irqsoff latency trace v1.1.5 on 2.6.26-rc8
497--------------------------------------------------------------------
498 latency: 6 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
499 -----------------
500 | task: bash-4269 (uid:0 nice:0 policy:0 rt_prio:0)
501 -----------------
502 => started at: copy_page_range
503 => ended at: copy_page_range
504
505# _------=> CPU#
506# / _-----=> irqs-off
507# | / _----=> need-resched
508# || / _---=> hardirq/softirq
509# ||| / _--=> preempt-depth
510# |||| /
511# ||||| delay
512# cmd pid ||||| time | caller
513# \ / ||||| \ | /
514 bash-4269 1...1 0us+: _spin_lock (copy_page_range)
515 bash-4269 1...1 7us : _spin_unlock (copy_page_range)
516 bash-4269 1...2 7us : trace_preempt_on (copy_page_range)
517
518
519vim:ft=help
520
521Here we see that that we had a latency of 6 microsecs (which is
522very good). The spin_lock in copy_page_range disabled interrupts.
523The difference between the 6 and the displayed timestamp 7us is
524because the clock must have incremented between the time of recording
525the max latency and recording the function that had that latency.
526
a41eebab 527Note the above had ftrace_enabled not set. If we set the ftrace_enabled,
eb6d42ea
SR
528we get a much larger output:
529
530# tracer: irqsoff
531#
532irqsoff latency trace v1.1.5 on 2.6.26-rc8
533--------------------------------------------------------------------
534 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
535 -----------------
536 | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
537 -----------------
538 => started at: __alloc_pages_internal
539 => ended at: __alloc_pages_internal
540
541# _------=> CPU#
542# / _-----=> irqs-off
543# | / _----=> need-resched
544# || / _---=> hardirq/softirq
545# ||| / _--=> preempt-depth
546# |||| /
547# ||||| delay
548# cmd pid ||||| time | caller
549# \ / ||||| \ | /
550 ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal)
551 ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist)
552 ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk)
553 ls-4339 0d..1 4us : add_preempt_count (_spin_lock)
554 ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk)
555 ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue)
556 ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest)
557 ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk)
558 ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue)
559 ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest)
560 ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk)
561 ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue)
562[...]
563 ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue)
564 ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest)
565 ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk)
566 ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue)
567 ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest)
568 ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk)
569 ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock)
570 ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal)
571 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal)
572
573
574vim:ft=help
575
576
577Here we traced a 50 microsecond latency. But we also see all the
a41eebab
SR
578functions that were called during that time. Note that by enabling
579function tracing, we endure an added overhead. This overhead may
580extend the latency times. But nevertheless, this trace has provided
581some very helpful debugging information.
eb6d42ea
SR
582
583
584preemptoff
585----------
586
a41eebab
SR
587When preemption is disabled, we may be able to receive interrupts but
588the task cannot be preempted and a higher priority task must wait
eb6d42ea
SR
589for preemption to be enabled again before it can preempt a lower
590priority task.
591
a41eebab 592The preemptoff tracer traces the places that disable preemption.
eb6d42ea
SR
593Like the irqsoff, it records the maximum latency that preemption
594was disabled. The control of preemptoff is much like the irqsoff.
595
596 # echo preemptoff > /debug/tracing/current_tracer
597 # echo 0 > /debug/tracing/tracing_max_latency
598 # echo 1 > /debug/tracing/tracing_enabled
599 # ls -ltr
600 [...]
601 # echo 0 > /debug/tracing/tracing_enabled
602 # cat /debug/tracing/latency_trace
603# tracer: preemptoff
604#
605preemptoff latency trace v1.1.5 on 2.6.26-rc8
606--------------------------------------------------------------------
607 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
608 -----------------
609 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
610 -----------------
611 => started at: do_IRQ
612 => ended at: __do_softirq
613
614# _------=> CPU#
615# / _-----=> irqs-off
616# | / _----=> need-resched
617# || / _---=> hardirq/softirq
618# ||| / _--=> preempt-depth
619# |||| /
620# ||||| delay
621# cmd pid ||||| time | caller
622# \ / ||||| \ | /
623 sshd-4261 0d.h. 0us+: irq_enter (do_IRQ)
624 sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq)
625 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq)
626
627
628vim:ft=help
629
630This has some more changes. Preemption was disabled when an interrupt
631came in (notice the 'h'), and was enabled while doing a softirq.
632(notice the 's'). But we also see that interrupts have been disabled
633when entering the preempt off section and leaving it (the 'd').
634We do not know if interrupts were enabled in the mean time.
635
636# tracer: preemptoff
637#
638preemptoff latency trace v1.1.5 on 2.6.26-rc8
639--------------------------------------------------------------------
640 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
641 -----------------
642 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
643 -----------------
644 => started at: remove_wait_queue
645 => ended at: __do_softirq
646
647# _------=> CPU#
648# / _-----=> irqs-off
649# | / _----=> need-resched
650# || / _---=> hardirq/softirq
651# ||| / _--=> preempt-depth
652# |||| /
653# ||||| delay
654# cmd pid ||||| time | caller
655# \ / ||||| \ | /
656 sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue)
657 sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue)
658 sshd-4261 0d..1 2us : do_IRQ (common_interrupt)
659 sshd-4261 0d..1 2us : irq_enter (do_IRQ)
660 sshd-4261 0d..1 2us : idle_cpu (irq_enter)
661 sshd-4261 0d..1 3us : add_preempt_count (irq_enter)
662 sshd-4261 0d.h1 3us : idle_cpu (irq_enter)
663 sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ)
664[...]
665 sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock)
666 sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
667 sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq)
668 sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq)
669 sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock)
670 sshd-4261 0d.h1 14us : irq_exit (do_IRQ)
671 sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit)
672 sshd-4261 0d..2 15us : do_softirq (irq_exit)
673 sshd-4261 0d... 15us : __do_softirq (do_softirq)
674 sshd-4261 0d... 16us : __local_bh_disable (__do_softirq)
675 sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable)
676 sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable)
677 sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable)
678 sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable)
679[...]
680 sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable)
681 sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable)
682 sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable)
683 sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable)
684 sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip)
685 sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip)
686 sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable)
687 sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable)
688[...]
689 sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq)
690 sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq)
691
692
693The above is an example of the preemptoff trace with ftrace_enabled
694set. Here we see that interrupts were disabled the entire time.
695The irq_enter code lets us know that we entered an interrupt 'h'.
696Before that, the functions being traced still show that it is not
697in an interrupt, but we can see by the functions themselves that
698this is not the case.
699
700Notice that the __do_softirq when called doesn't have a preempt_count.
701It may seem that we missed a preempt enabled. What really happened
702is that the preempt count is held on the threads stack and we
703switched to the softirq stack (4K stacks in effect). The code
a41eebab 704does not copy the preempt count, but because interrupts are disabled,
eb6d42ea
SR
705we don't need to worry about it. Having a tracer like this is good
706to let people know what really happens inside the kernel.
707
708
709preemptirqsoff
710--------------
711
712Knowing the locations that have interrupts disabled or preemption
713disabled for the longest times is helpful. But sometimes we would
714like to know when either preemption and/or interrupts are disabled.
715
716The following code:
717
718 local_irq_disable();
719 call_function_with_irqs_off();
720 preempt_disable();
721 call_function_with_irqs_and_preemption_off();
722 local_irq_enable();
723 call_function_with_preemption_off();
724 preempt_enable();
725
726The irqsoff tracer will record the total length of
727call_function_with_irqs_off() and
728call_function_with_irqs_and_preemption_off().
729
730The preemptoff tracer will record the total length of
731call_function_with_irqs_and_preemption_off() and
732call_function_with_preemption_off().
733
734But neither will trace the time that interrupts and/or preemption
735is disabled. This total time is the time that we can not schedule.
736To record this time, use the preemptirqsoff tracer.
737
738Again, using this trace is much like the irqsoff and preemptoff tracers.
739
a41eebab 740 # echo preemptirqsoff > /debug/tracing/current_tracer
eb6d42ea
SR
741 # echo 0 > /debug/tracing/tracing_max_latency
742 # echo 1 > /debug/tracing/tracing_enabled
743 # ls -ltr
744 [...]
745 # echo 0 > /debug/tracing/tracing_enabled
746 # cat /debug/tracing/latency_trace
747# tracer: preemptirqsoff
748#
749preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
750--------------------------------------------------------------------
751 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
752 -----------------
753 | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
754 -----------------
755 => started at: apic_timer_interrupt
756 => ended at: __do_softirq
757
758# _------=> CPU#
759# / _-----=> irqs-off
760# | / _----=> need-resched
761# || / _---=> hardirq/softirq
762# ||| / _--=> preempt-depth
763# |||| /
764# ||||| delay
765# cmd pid ||||| time | caller
766# \ / ||||| \ | /
767 ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
768 ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq)
769 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq)
770
771
772vim:ft=help
773
774
775The trace_hardirqs_off_thunk is called from assembly on x86 when
776interrupts are disabled in the assembly code. Without the function
777tracing, we don't know if interrupts were enabled within the preemption
778points. We do see that it started with preemption enabled.
779
780Here is a trace with ftrace_enabled set:
781
782
783# tracer: preemptirqsoff
784#
785preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
786--------------------------------------------------------------------
787 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
788 -----------------
789 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
790 -----------------
791 => started at: write_chan
792 => ended at: __do_softirq
793
794# _------=> CPU#
795# / _-----=> irqs-off
796# | / _----=> need-resched
797# || / _---=> hardirq/softirq
798# ||| / _--=> preempt-depth
799# |||| /
800# ||||| delay
801# cmd pid ||||| time | caller
802# \ / ||||| \ | /
803 ls-4473 0.N.. 0us : preempt_schedule (write_chan)
804 ls-4473 0dN.1 1us : _spin_lock (schedule)
805 ls-4473 0dN.1 2us : add_preempt_count (_spin_lock)
806 ls-4473 0d..2 2us : put_prev_task_fair (schedule)
807[...]
808 ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts)
809 ls-4473 0d..2 13us : __switch_to (schedule)
810 sshd-4261 0d..2 14us : finish_task_switch (schedule)
811 sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch)
812 sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave)
813 sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set)
814 sshd-4261 0d..2 16us : do_IRQ (common_interrupt)
815 sshd-4261 0d..2 17us : irq_enter (do_IRQ)
816 sshd-4261 0d..2 17us : idle_cpu (irq_enter)
817 sshd-4261 0d..2 18us : add_preempt_count (irq_enter)
818 sshd-4261 0d.h2 18us : idle_cpu (irq_enter)
819 sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ)
820 sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq)
821 sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock)
822 sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq)
823 sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock)
824[...]
825 sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq)
826 sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock)
827 sshd-4261 0d.h2 29us : irq_exit (do_IRQ)
828 sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit)
829 sshd-4261 0d..3 30us : do_softirq (irq_exit)
830 sshd-4261 0d... 30us : __do_softirq (do_softirq)
831 sshd-4261 0d... 31us : __local_bh_disable (__do_softirq)
832 sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable)
833 sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable)
834[...]
835 sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip)
836 sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip)
837 sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt)
838 sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt)
839 sshd-4261 0d.s3 45us : idle_cpu (irq_enter)
840 sshd-4261 0d.s3 46us : add_preempt_count (irq_enter)
841 sshd-4261 0d.H3 46us : idle_cpu (irq_enter)
842 sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt)
843 sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt)
844[...]
845 sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt)
846 sshd-4261 0d.H3 82us : ktime_get (tick_program_event)
847 sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get)
848 sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts)
849 sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts)
850 sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event)
851 sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event)
852 sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt)
853 sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit)
854 sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit)
855 sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable)
856[...]
857 sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action)
858 sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq)
859 sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq)
860 sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq)
861 sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable)
862 sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq)
863 sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq)
864
865
866This is a very interesting trace. It started with the preemption of
867the ls task. We see that the task had the "need_resched" bit set
868with the 'N' in the trace. Interrupts are disabled in the spin_lock
869and the trace started. We see that a schedule took place to run
a41eebab
SR
870sshd. When the interrupts were enabled, we took an interrupt.
871On return from the interrupt handler, the softirq ran. We took another
872interrupt while running the softirq as we see with the capital 'H'.
eb6d42ea
SR
873
874
875wakeup
876------
877
878In Real-Time environment it is very important to know the wakeup
879time it takes for the highest priority task that wakes up to the
880time it executes. This is also known as "schedule latency".
881I stress the point that this is about RT tasks. It is also important
882to know the scheduling latency of non-RT tasks, but the average
883schedule latency is better for non-RT tasks. Tools like
a41eebab 884LatencyTop are more appropriate for such measurements.
eb6d42ea 885
a41eebab 886Real-Time environments are interested in the worst case latency.
eb6d42ea
SR
887That is the longest latency it takes for something to happen, and
888not the average. We can have a very fast scheduler that may only
889have a large latency once in a while, but that would not work well
890with Real-Time tasks. The wakeup tracer was designed to record
891the worst case wakeups of RT tasks. Non-RT tasks are not recorded
892because the tracer only records one worst case and tracing non-RT
893tasks that are unpredictable will overwrite the worst case latency
894of RT tasks.
895
896Since this tracer only deals with RT tasks, we will run this slightly
a41eebab
SR
897differently than we did with the previous tracers. Instead of performing
898an 'ls', we will run 'sleep 1' under 'chrt' which changes the
eb6d42ea
SR
899priority of the task.
900
901 # echo wakeup > /debug/tracing/current_tracer
902 # echo 0 > /debug/tracing/tracing_max_latency
903 # echo 1 > /debug/tracing/tracing_enabled
904 # chrt -f 5 sleep 1
905 # echo 0 > /debug/tracing/tracing_enabled
906 # cat /debug/tracing/latency_trace
907# tracer: wakeup
908#
909wakeup latency trace v1.1.5 on 2.6.26-rc8
910--------------------------------------------------------------------
911 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
912 -----------------
913 | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
914 -----------------
915
916# _------=> CPU#
917# / _-----=> irqs-off
918# | / _----=> need-resched
919# || / _---=> hardirq/softirq
920# ||| / _--=> preempt-depth
921# |||| /
922# ||||| delay
923# cmd pid ||||| time | caller
924# \ / ||||| \ | /
925 <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process)
926 <idle>-0 1d..4 4us : schedule (cpu_idle)
927
928
929vim:ft=help
930
931
a41eebab 932Running this on an idle system, we see that it only took 4 microseconds
eb6d42ea 933to perform the task switch. Note, since the trace marker in the
a41eebab 934schedule is before the actual "switch", we stop the tracing when
eb6d42ea
SR
935the recorded task is about to schedule in. This may change if
936we add a new marker at the end of the scheduler.
937
938Notice that the recorded task is 'sleep' with the PID of 4901 and it
939has an rt_prio of 5. This priority is user-space priority and not
940the internal kernel priority. The policy is 1 for SCHED_FIFO and 2
941for SCHED_RR.
942
943Doing the same with chrt -r 5 and ftrace_enabled set.
944
945# tracer: wakeup
946#
947wakeup latency trace v1.1.5 on 2.6.26-rc8
948--------------------------------------------------------------------
949 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
950 -----------------
951 | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
952 -----------------
953
954# _------=> CPU#
955# / _-----=> irqs-off
956# | / _----=> need-resched
957# || / _---=> hardirq/softirq
958# ||| / _--=> preempt-depth
959# |||| /
960# ||||| delay
961# cmd pid ||||| time | caller
962# \ / ||||| \ | /
963ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process)
964ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb)
965ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up)
966ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup)
967ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr)
968ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup)
969ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up)
970ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up)
971[...]
972ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt)
973ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit)
974ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit)
975ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq)
976[...]
977ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks)
978ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq)
979ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable)
980ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd)
981ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd)
982ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched)
983ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched)
984ksoftirq-7 1.N.2 33us : schedule (__cond_resched)
985ksoftirq-7 1.N.2 33us : add_preempt_count (schedule)
986ksoftirq-7 1.N.3 34us : hrtick_clear (schedule)
987ksoftirq-7 1dN.3 35us : _spin_lock (schedule)
988ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock)
989ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule)
990ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair)
991[...]
992ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline)
993ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock)
994ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline)
995ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock)
996ksoftirq-7 1d..4 50us : schedule (__cond_resched)
997
998The interrupt went off while running ksoftirqd. This task runs at
999SCHED_OTHER. Why didn't we see the 'N' set early? This may be
a41eebab
SR
1000a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks
1001configured, the interrupt and softirq runs with their own stack.
1002Some information is held on the top of the task's stack (need_resched
1003and preempt_count are both stored there). The setting of the NEED_RESCHED
1004bit is done directly to the task's stack, but the reading of the
1005NEED_RESCHED is done by looking at the current stack, which in this case
1006is the stack for the hard interrupt. This hides the fact that NEED_RESCHED
1007has been set. We don't see the 'N' until we switch back to the task's
1008assigned stack.
eb6d42ea
SR
1009
1010ftrace
1011------
1012
1013ftrace is not only the name of the tracing infrastructure, but it
1014is also a name of one of the tracers. The tracer is the function
1015tracer. Enabling the function tracer can be done from the
1016debug file system. Make sure the ftrace_enabled is set otherwise
1017this tracer is a nop.
1018
1019 # sysctl kernel.ftrace_enabled=1
1020 # echo ftrace > /debug/tracing/current_tracer
1021 # echo 1 > /debug/tracing/tracing_enabled
1022 # usleep 1
1023 # echo 0 > /debug/tracing/tracing_enabled
1024 # cat /debug/tracing/trace
1025# tracer: ftrace
1026#
1027# TASK-PID CPU# TIMESTAMP FUNCTION
1028# | | | | |
1029 bash-4003 [00] 123.638713: finish_task_switch <-schedule
1030 bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch
1031 bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq
1032 bash-4003 [00] 123.638715: hrtick_set <-schedule
1033 bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set
1034 bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave
1035 bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set
1036 bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore
1037 bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set
1038 bash-4003 [00] 123.638718: sub_preempt_count <-schedule
1039 bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule
1040 bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run
1041 bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion
1042 bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common
1043 bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq
1044[...]
1045
1046
1047Note: It is sometimes better to enable or disable tracing directly from
1048a program, because the buffer may be overflowed by the echo commands
1049before you get to the point you want to trace. It is also easier to
1050stop the tracing at the point that you hit the part that you are
1051interested in. Since the ftrace buffer is a ring buffer with the
1052oldest data being overwritten, usually it is sufficient to start the
1053tracer with an echo command but have you code stop it. Something
1054like the following is usually appropriate for this.
1055
1056int trace_fd;
1057[...]
1058int main(int argc, char *argv[]) {
1059 [...]
1060 trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY);
1061 [...]
1062 if (condition_hit()) {
1063 write(trace_fd, "0", 1);
1064 }
1065 [...]
1066}
1067
1068
1069dynamic ftrace
1070--------------
1071
1072If CONFIG_DYNAMIC_FTRACE is set, then the system will run with
1073virtually no overhead when function tracing is disabled. The way
1074this works is the mcount function call (placed at the start of
1075every kernel function, produced by the -pg switch in gcc), starts
1076of pointing to a simple return.
1077
a41eebab
SR
1078When dynamic ftrace is initialized, it calls kstop_machine to make
1079the machine act like a uniprocessor so that it can freely modify code
1080without worrying about other processors executing that same code. At
1081initialization, the mcount calls are changed to call a "record_ip"
eb6d42ea
SR
1082function. After this, the first time a kernel function is called,
1083it has the calling address saved in a hash table.
1084
1085Later on the ftraced kernel thread is awoken and will again call
1086kstop_machine if new functions have been recorded. The ftraced thread
1087will change all calls to mcount to "nop". Just calling mcount
1088and having mcount return has shown a 10% overhead. By converting
1089it to a nop, there is no recordable overhead to the system.
1090
1091One special side-effect to the recording of the functions being
1092traced, is that we can now selectively choose which functions we
1093want to trace and which ones we want the mcount calls to remain as
1094nops.
1095
a41eebab
SR
1096Two files are used, one for enabling and one for disabling the tracing
1097of recorded functions. They are:
eb6d42ea
SR
1098
1099 set_ftrace_filter
1100
1101and
1102
1103 set_ftrace_notrace
1104
a41eebab 1105A list of available functions that you can add to these files is listed
eb6d42ea
SR
1106in:
1107
1108 available_filter_functions
1109
1110 # cat /debug/tracing/available_filter_functions
1111put_prev_task_idle
1112kmem_cache_create
1113pick_next_task_rt
1114get_online_cpus
1115pick_next_task_fair
1116mutex_lock
1117[...]
1118
1119If I'm only interested in sys_nanosleep and hrtimer_interrupt:
1120
1121 # echo sys_nanosleep hrtimer_interrupt \
1122 > /debug/tracing/set_ftrace_filter
1123 # echo ftrace > /debug/tracing/current_tracer
1124 # echo 1 > /debug/tracing/tracing_enabled
1125 # usleep 1
1126 # echo 0 > /debug/tracing/tracing_enabled
1127 # cat /debug/tracing/trace
1128# tracer: ftrace
1129#
1130# TASK-PID CPU# TIMESTAMP FUNCTION
1131# | | | | |
1132 usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
1133 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call
1134 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
1135
1136To see what functions are being traced, you can cat the file:
1137
1138 # cat /debug/tracing/set_ftrace_filter
1139hrtimer_interrupt
1140sys_nanosleep
1141
1142
1143Perhaps this isn't enough. The filters also allow simple wild cards.
a41eebab 1144Only the following are currently available
eb6d42ea 1145
a41eebab 1146 <match>* - will match functions that begin with <match>
eb6d42ea
SR
1147 *<match> - will match functions that end with <match>
1148 *<match>* - will match functions that have <match> in it
1149
1150Thats all the wild cards that are allowed.
1151
1152 <match>*<match> will not work.
1153
1154 # echo hrtimer_* > /debug/tracing/set_ftrace_filter
1155
1156Produces:
1157
1158# tracer: ftrace
1159#
1160# TASK-PID CPU# TIMESTAMP FUNCTION
1161# | | | | |
1162 bash-4003 [00] 1480.611794: hrtimer_init <-copy_process
1163 bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set
1164 bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear
1165 bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
1166 <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
1167 <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
1168 <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
1169 <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
1170 <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
1171
1172
1173Notice that we lost the sys_nanosleep.
1174
1175 # cat /debug/tracing/set_ftrace_filter
1176hrtimer_run_queues
1177hrtimer_run_pending
1178hrtimer_init
1179hrtimer_cancel
1180hrtimer_try_to_cancel
1181hrtimer_forward
1182hrtimer_start
1183hrtimer_reprogram
1184hrtimer_force_reprogram
1185hrtimer_get_next_event
1186hrtimer_interrupt
1187hrtimer_nanosleep
1188hrtimer_wakeup
1189hrtimer_get_remaining
1190hrtimer_get_res
1191hrtimer_init_sleeper
1192
1193
1194This is because the '>' and '>>' act just like they do in bash.
1195To rewrite the filters, use '>'
1196To append to the filters, use '>>'
1197
a41eebab 1198To clear out a filter so that all functions will be recorded again:
eb6d42ea
SR
1199
1200 # echo > /debug/tracing/set_ftrace_filter
1201 # cat /debug/tracing/set_ftrace_filter
1202 #
1203
1204Again, now we want to append.
1205
1206 # echo sys_nanosleep > /debug/tracing/set_ftrace_filter
1207 # cat /debug/tracing/set_ftrace_filter
1208sys_nanosleep
1209 # echo hrtimer_* >> /debug/tracing/set_ftrace_filter
1210 # cat /debug/tracing/set_ftrace_filter
1211hrtimer_run_queues
1212hrtimer_run_pending
1213hrtimer_init
1214hrtimer_cancel
1215hrtimer_try_to_cancel
1216hrtimer_forward
1217hrtimer_start
1218hrtimer_reprogram
1219hrtimer_force_reprogram
1220hrtimer_get_next_event
1221hrtimer_interrupt
1222sys_nanosleep
1223hrtimer_nanosleep
1224hrtimer_wakeup
1225hrtimer_get_remaining
1226hrtimer_get_res
1227hrtimer_init_sleeper
1228
1229
1230The set_ftrace_notrace prevents those functions from being traced.
1231
1232 # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace
1233
1234Produces:
1235
1236# tracer: ftrace
1237#
1238# TASK-PID CPU# TIMESTAMP FUNCTION
1239# | | | | |
1240 bash-4043 [01] 115.281644: finish_task_switch <-schedule
1241 bash-4043 [01] 115.281645: hrtick_set <-schedule
1242 bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set
1243 bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run
1244 bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion
1245 bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run
1246 bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop
1247 bash-4043 [01] 115.281648: wake_up_process <-kthread_stop
1248 bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process
1249
1250We can see that there's no more lock or preempt tracing.
1251
1252ftraced
1253-------
1254
1255As mentioned above, when dynamic ftrace is configured in, a kernel
1256thread wakes up once a second and checks to see if there are mcount
a41eebab
SR
1257calls that need to be converted into nops. If there are not any, then
1258it simply goes back to sleep. But if there are some, it will call
eb6d42ea
SR
1259kstop_machine to convert the calls to nops.
1260
1261There may be a case that you do not want this added latency.
1262Perhaps you are doing some audio recording and this activity might
1263cause skips in the playback. There is an interface to disable
1264and enable the ftraced kernel thread.
1265
1266 # echo 0 > /debug/tracing/ftraced_enabled
1267
1268This will disable the calling of the kstop_machine to update the
1269mcount calls to nops. Remember that there's a large overhead
1270to calling mcount. Without this kernel thread, that overhead will
1271exist.
1272
a41eebab
SR
1273If there are recorded calls to mcount, any write to the ftraced_enabled
1274file will cause the kstop_machine to run. This means that a
eb6d42ea
SR
1275user can manually perform the updates when they want to by simply
1276echoing a '0' into the ftraced_enabled file.
1277
1278The updates are also done at the beginning of enabling a tracer
1279that uses ftrace function recording.
1280
1281
1282trace_pipe
1283----------
1284
1285The trace_pipe outputs the same as trace, but the effect on the
1286tracing is different. Every read from trace_pipe is consumed.
1287This means that subsequent reads will be different. The trace
1288is live.
1289
1290 # echo ftrace > /debug/tracing/current_tracer
1291 # cat /debug/tracing/trace_pipe > /tmp/trace.out &
1292[1] 4153
1293 # echo 1 > /debug/tracing/tracing_enabled
1294 # usleep 1
1295 # echo 0 > /debug/tracing/tracing_enabled
1296 # cat /debug/tracing/trace
1297# tracer: ftrace
1298#
1299# TASK-PID CPU# TIMESTAMP FUNCTION
1300# | | | | |
1301
1302 #
1303 # cat /tmp/trace.out
1304 bash-4043 [00] 41.267106: finish_task_switch <-schedule
1305 bash-4043 [00] 41.267106: hrtick_set <-schedule
1306 bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set
1307 bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run
1308 bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion
1309 bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run
1310 bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop
1311 bash-4043 [00] 41.267110: wake_up_process <-kthread_stop
1312 bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process
1313 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1314
1315
1316Note, reading the trace_pipe will block until more input is added.
1317By changing the tracer, trace_pipe will issue an EOF. We needed
1318to set the ftrace tracer _before_ cating the trace_pipe file.
1319
1320
1321trace entries
1322-------------
1323
1324Having too much or not enough data can be troublesome in diagnosing
1325some issue in the kernel. The file trace_entries is used to modify
a41eebab 1326the size of the internal trace buffers. The number listed
eb6d42ea
SR
1327is the number of entries that can be recorded per CPU. To know
1328the full size, multiply the number of possible CPUS with the
1329number of entries.
1330
1331 # cat /debug/tracing/trace_entries
133265620
1333
a41eebab 1334Note, to modify this, you must have tracing completely disabled. To do that,
eb6d42ea
SR
1335echo "none" into the current_tracer.
1336
1337 # echo none > /debug/tracing/current_tracer
1338 # echo 100000 > /debug/tracing/trace_entries
1339 # cat /debug/tracing/trace_entries
1340100045
1341
1342
1343Notice that we echoed in 100,000 but the size is 100,045. The entries
1344are held by individual pages. It allocates the number of pages it takes
1345to fulfill the request. If more entries may fit on the last page
1346it will add them.
1347
1348 # echo 1 > /debug/tracing/trace_entries
1349 # cat /debug/tracing/trace_entries
135085
1351
1352This shows us that 85 entries can fit on a single page.
1353
1354The number of pages that will be allocated is a percentage of available
a41eebab 1355memory. Allocating too much will produce an error.
eb6d42ea
SR
1356
1357 # echo 1000000000000 > /debug/tracing/trace_entries
1358-bash: echo: write error: Cannot allocate memory
1359 # cat /debug/tracing/trace_entries
136085
1361