doc: Update stallwarn.txt to make causes more prominent

author Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Wed, 8 Feb 2017 22:30:15 +0000 (14:30 -0800)

committer Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Wed, 12 Apr 2017 15:23:40 +0000 (08:23 -0700)
author Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Wed, 8 Feb 2017 22:30:15 +0000 (14:30 -0800)
committer Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Wed, 12 Apr 2017 15:23:40 +0000 (08:23 -0700)
diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt

index e93d04133fe7ae58cfb6c3da3e0717b0640f3b9a..96a3d81837e1b120098eccfd1b95b0bfc4947be9 100644 (file)
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -1,9 +1,102 @@
  Using RCU's CPU Stall Detector
  
-The rcu_cpu_stall_suppress module parameter enables RCU's CPU stall
-detector, which detects conditions that unduly delay RCU grace periods.
-This module parameter enables CPU stall detection by default, but
-may be overridden via boot-time parameter or at runtime via sysfs.
+This document first discusses what sorts of issues RCU's CPU stall
+detector can locate, and then discusses kernel parameters and Kconfig
+options that can be used to fine-tune the detector's operation.  Finally,
+this document explains the stall detector's "splat" format.
+
+
+What Causes RCU CPU Stall Warnings?
+
+So your kernel printed an RCU CPU stall warning.  The next question is
+"What caused it?"  The following problems can result in RCU CPU stall
+warnings:
+
+o      A CPU looping in an RCU read-side critical section.
+
+o      A CPU looping with interrupts disabled.
+
+o      A CPU looping with preemption disabled.  This condition can
+       result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
+       stalls.
+
+o      A CPU looping with bottom halves disabled.  This condition can
+       result in RCU-sched and RCU-bh stalls.
+
+o      For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
+       kernel without invoking schedule().  Note that cond_resched()
+       does not necessarily prevent RCU CPU stall warnings.  Therefore,
+       if the looping in the kernel is really expected and desirable
+       behavior, you might need to replace some of the cond_resched()
+       calls with calls to cond_resched_rcu_qs().
+
+o      Booting Linux using a console connection that is too slow to
+       keep up with the boot-time console-message rate.  For example,
+       a 115Kbaud serial console can be -way- too slow to keep up
+       with boot-time message rates, and will frequently result in
+       RCU CPU stall warning messages.  Especially if you have added
+       debug printk()s.
+
+o      Anything that prevents RCU's grace-period kthreads from running.
+       This can result in the "All QSes seen" console-log message.
+       This message will include information on when the kthread last
+       ran and how often it should be expected to run.
+
+o      A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
+       happen to preempt a low-priority task in the middle of an RCU
+       read-side critical section.   This is especially damaging if
+       that low-priority task is not permitted to run on any other CPU,
+       in which case the next RCU grace period can never complete, which
+       will eventually cause the system to run out of memory and hang.
+       While the system is in the process of running itself out of
+       memory, you might see stall-warning messages.
+
+o      A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
+       is running at a higher priority than the RCU softirq threads.
+       This will prevent RCU callbacks from ever being invoked,
+       and in a CONFIG_PREEMPT_RCU kernel will further prevent
+       RCU grace periods from ever completing.  Either way, the
+       system will eventually run out of memory and hang.  In the
+       CONFIG_PREEMPT_RCU case, you might see stall-warning
+       messages.
+
+o      A hardware or software issue shuts off the scheduler-clock
+       interrupt on a CPU that is not in dyntick-idle mode.  This
+       problem really has happened, and seems to be most likely to
+       result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
+
+o      A bug in the RCU implementation.
+
+o      A hardware failure.  This is quite unlikely, but has occurred
+       at least once in real life.  A CPU failed in a running system,
+       becoming unresponsive, but not causing an immediate crash.
+       This resulted in a series of RCU CPU stall warnings, eventually
+       leading the realization that the CPU had failed.
+
+The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
+warning.  Note that SRCU does -not- have CPU stall warnings.  Please note
+that RCU only detects CPU stalls when there is a grace period in progress.
+No grace period, no CPU stall warnings.
+
+To diagnose the cause of the stall, inspect the stack traces.
+The offending function will usually be near the top of the stack.
+If you have a series of stall warnings from a single extended stall,
+comparing the stack traces can often help determine where the stall
+is occurring, which will usually be in the function nearest the top of
+that portion of the stack which remains the same from trace to trace.
+If you can reliably trigger the stall, ftrace can be quite helpful.
+
+RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
+and with RCU's event tracing.  For information on RCU's event tracing,
+see include/trace/events/rcu.h.
+
+
+Fine-Tuning the RCU CPU Stall Detector
+
+The rcuupdate.rcu_cpu_stall_suppress module parameter disables RCU's
+CPU stall detector, which detects conditions that unduly delay RCU grace
+periods.  This module parameter enables CPU stall detection by default,
+but may be overridden via boot-time parameter or at runtime via sysfs.
  The stall detector's idea of what constitutes "unduly delayed" is
  controlled by a set of kernel configuration variables and cpp macros:
  
@@ -56,6 +149,9 @@ rcupdate.rcu_task_stall_timeout
         And continues with the output of sched_show_task() for each
         task stalling the current RCU-tasks grace period.
  
+
+Interpreting RCU's CPU Stall-Detector "Splats"
+
  For non-RCU-tasks flavors of RCU, when a CPU detects that it is stalling,
  it will print a message similar to the following:
  
@@ -178,89 +274,3 @@ grace period is in flight.
  
  It is entirely possible to see stall warnings from normal and from
  expedited grace periods at about the same time from the same run.
-
-
-What Causes RCU CPU Stall Warnings?
-
-So your kernel printed an RCU CPU stall warning.  The next question is
-"What caused it?"  The following problems can result in RCU CPU stall
-warnings:
-
-o      A CPU looping in an RCU read-side critical section.
-       
-o      A CPU looping with interrupts disabled.  This condition can
-       result in RCU-sched and RCU-bh stalls.
-
-o      A CPU looping with preemption disabled.  This condition can
-       result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
-       stalls.
-
-o      A CPU looping with bottom halves disabled.  This condition can
-       result in RCU-sched and RCU-bh stalls.
-
-o      For !CONFIG_PREEMPT kernels, a CPU looping anywhere in the
-       kernel without invoking schedule().  Note that cond_resched()
-       does not necessarily prevent RCU CPU stall warnings.  Therefore,
-       if the looping in the kernel is really expected and desirable
-       behavior, you might need to replace some of the cond_resched()
-       calls with calls to cond_resched_rcu_qs().
-
-o      Booting Linux using a console connection that is too slow to
-       keep up with the boot-time console-message rate.  For example,
-       a 115Kbaud serial console can be -way- too slow to keep up
-       with boot-time message rates, and will frequently result in
-       RCU CPU stall warning messages.  Especially if you have added
-       debug printk()s.
-
-o      Anything that prevents RCU's grace-period kthreads from running.
-       This can result in the "All QSes seen" console-log message.
-       This message will include information on when the kthread last
-       ran and how often it should be expected to run.
-
-o      A CPU-bound real-time task in a CONFIG_PREEMPT kernel, which might
-       happen to preempt a low-priority task in the middle of an RCU
-       read-side critical section.   This is especially damaging if
-       that low-priority task is not permitted to run on any other CPU,
-       in which case the next RCU grace period can never complete, which
-       will eventually cause the system to run out of memory and hang.
-       While the system is in the process of running itself out of
-       memory, you might see stall-warning messages.
-
-o      A CPU-bound real-time task in a CONFIG_PREEMPT_RT kernel that
-       is running at a higher priority than the RCU softirq threads.
-       This will prevent RCU callbacks from ever being invoked,
-       and in a CONFIG_PREEMPT_RCU kernel will further prevent
-       RCU grace periods from ever completing.  Either way, the
-       system will eventually run out of memory and hang.  In the
-       CONFIG_PREEMPT_RCU case, you might see stall-warning
-       messages.
-
-o      A hardware or software issue shuts off the scheduler-clock
-       interrupt on a CPU that is not in dyntick-idle mode.  This
-       problem really has happened, and seems to be most likely to
-       result in RCU CPU stall warnings for CONFIG_NO_HZ_COMMON=n kernels.
-
-o      A bug in the RCU implementation.
-
-o      A hardware failure.  This is quite unlikely, but has occurred
-       at least once in real life.  A CPU failed in a running system,
-       becoming unresponsive, but not causing an immediate crash.
-       This resulted in a series of RCU CPU stall warnings, eventually
-       leading the realization that the CPU had failed.
-
-The RCU, RCU-sched, RCU-bh, and RCU-tasks implementations have CPU stall
-warning.  Note that SRCU does -not- have CPU stall warnings.  Please note
-that RCU only detects CPU stalls when there is a grace period in progress.
-No grace period, no CPU stall warnings.
-
-To diagnose the cause of the stall, inspect the stack traces.
-The offending function will usually be near the top of the stack.
-If you have a series of stall warnings from a single extended stall,
-comparing the stack traces can often help determine where the stall
-is occurring, which will usually be in the function nearest the top of
-that portion of the stack which remains the same from trace to trace.
-If you can reliably trigger the stall, ftrace can be quite helpful.
-
-RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
-and with RCU's event tracing.  For information on RCU's event tracing,
-see include/trace/events/rcu.h.
author	Paul E. McKenney <paulmck@linux.vnet.ibm.com>
	Wed, 8 Feb 2017 22:30:15 +0000 (14:30 -0800)
committer	Paul E. McKenney <paulmck@linux.vnet.ibm.com>
	Wed, 12 Apr 2017 15:23:40 +0000 (08:23 -0700)