From b6c17ea4eff360359d1741272028610035bb2da9 Mon Sep 17 00:00:00 2001 From: Rusty Russell Date: Fri, 9 Sep 2005 13:10:11 -0700 Subject: [PATCH] [PATCH] Update Documentation/DocBook/kernel-hacking.tmpl Update the hacking guide, before CONFIG_PREEMPT_RT goes in and it needs rewriting again. Changes include modernization of quotes, removal of most references to bottom halves (some mention required because we still use bh in places to mean softirq). It would be nice to have a discussion of sparse and various annotations. Please send patches straight to akpm. Signed-off-by: Rusty Russell (authored) Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- Documentation/DocBook/kernel-hacking.tmpl | 310 ++++++++++------------ 1 file changed, 144 insertions(+), 166 deletions(-) diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index 49a9ef82d575..6367bba32d22 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl @@ -8,8 +8,7 @@ - Paul - Rusty + Rusty Russell
@@ -20,7 +19,7 @@ - 2001 + 2005 Rusty Russell @@ -64,7 +63,7 @@ Introduction - Welcome, gentle reader, to Rusty's Unreliable Guide to Linux + Welcome, gentle reader, to Rusty's Remarkably Unreliable Guide to Linux Kernel Hacking. This document describes the common routines and general requirements for kernel code: its goal is to serve as a primer for Linux kernel development for experienced C @@ -96,13 +95,13 @@ - not associated with any process, serving a softirq, tasklet or bh; + not associated with any process, serving a softirq or tasklet; - running in kernel space, associated with a process; + running in kernel space, associated with a process (user context); @@ -114,11 +113,12 @@ - There is a strict ordering between these: other than the last - category (userspace) each can only be pre-empted by those above. - For example, while a softirq is running on a CPU, no other - softirq will pre-empt it, but a hardware interrupt can. However, - any other CPUs in the system execute independently. + There is an ordering between these. The bottom two can preempt + each other, but above that is a strict hierarchy: each can only be + preempted by the ones above it. For example, while a softirq is + running on a CPU, no other softirq will preempt it, but a hardware + interrupt can. However, any other CPUs in the system execute + independently. @@ -130,10 +130,10 @@ User Context - User context is when you are coming in from a system call or - other trap: you can sleep, and you own the CPU (except for - interrupts) until you call schedule(). - In other words, user context (unlike userspace) is not pre-emptable. + User context is when you are coming in from a system call or other + trap: like userspace, you can be preempted by more important tasks + and by interrupts. You can sleep, by calling + schedule(). @@ -153,7 +153,7 @@ - Beware that if you have interrupts or bottom halves disabled + Beware that if you have preemption or softirqs disabled (see below), in_interrupt() will return a false positive. @@ -168,10 +168,10 @@ keyboard are examples of real hardware which produce interrupts at any time. The kernel runs interrupt handlers, which services the hardware. The kernel - guarantees that this handler is never re-entered: if another + guarantees that this handler is never re-entered: if the same interrupt arrives, it is queued (or dropped). Because it disables interrupts, this handler has to be fast: frequently it - simply acknowledges the interrupt, marks a `software interrupt' + simply acknowledges the interrupt, marks a 'software interrupt' for execution and exits. @@ -188,60 +188,52 @@ - Software Interrupt Context: Bottom Halves, Tasklets, softirqs + Software Interrupt Context: Softirqs and Tasklets Whenever a system call is about to return to userspace, or a - hardware interrupt handler exits, any `software interrupts' + hardware interrupt handler exits, any 'software interrupts' which are marked pending (usually by hardware interrupts) are run (kernel/softirq.c). Much of the real interrupt handling work is done here. Early in - the transition to SMP, there were only `bottom + the transition to SMP, there were only 'bottom halves' (BHs), which didn't take advantage of multiple CPUs. Shortly after we switched from wind-up computers made of match-sticks and snot, - we abandoned this limitation. + we abandoned this limitation and switched to 'softirqs'. include/linux/interrupt.h lists the - different BH's. No matter how many CPUs you have, no two BHs will run at - the same time. This made the transition to SMP simpler, but sucks hard for - scalable performance. A very important bottom half is the timer - BH (include/linux/timer.h): you - can register to have it call functions for you in a given length of time. + different softirqs. A very important softirq is the + timer softirq (include/linux/timer.h): you can + register to have it call functions for you in a given length of + time. - 2.3.43 introduced softirqs, and re-implemented the (now - deprecated) BHs underneath them. Softirqs are fully-SMP - versions of BHs: they can run on as many CPUs at once as - required. This means they need to deal with any races in shared - data using their own locks. A bitmask is used to keep track of - which are enabled, so the 32 available softirqs should not be - used up lightly. (Yes, people will - notice). - - - - tasklets (include/linux/interrupt.h) - are like softirqs, except they are dynamically-registrable (meaning you - can have as many as you want), and they also guarantee that any tasklet - will only run on one CPU at any time, although different tasklets can - run simultaneously (unlike different BHs). + Softirqs are often a pain to deal with, since the same softirq + will run simultaneously on more than one CPU. For this reason, + tasklets (include/linux/interrupt.h) are more + often used: they are dynamically-registrable (meaning you can have + as many as you want), and they also guarantee that any tasklet + will only run on one CPU at any time, although different tasklets + can run simultaneously. - The name `tasklet' is misleading: they have nothing to do with `tasks', + The name 'tasklet' is misleading: they have nothing to do with 'tasks', and probably more to do with some bad vodka Alexey Kuznetsov had at the time. - You can tell you are in a softirq (or bottom half, or tasklet) + You can tell you are in a softirq (or tasklet) using the in_softirq() macro (include/linux/interrupt.h). @@ -288,11 +280,10 @@ A rigid stack limit - The kernel stack is about 6K in 2.2 (for most - architectures: it's about 14K on the Alpha), and shared - with interrupts so you can't use it all. Avoid deep - recursion and huge local arrays on the stack (allocate - them dynamically instead). + Depending on configuration options the kernel stack is about 3K to 6K for most 32-bit architectures: it's + about 14K on most 64-bit archs, and often shared with interrupts + so you can't use it all. Avoid deep recursion and huge local + arrays on the stack (allocate them dynamically instead). @@ -339,7 +330,7 @@ asmlinkage long sys_mycall(int arg) If all your routine does is read or write some parameter, consider - implementing a sysctl interface instead. + implementing a sysfs interface instead. @@ -417,7 +408,10 @@ cond_resched(); /* Will sleep */ - You will eventually lock up your box if you break these rules. + You should always compile your kernel + CONFIG_DEBUG_SPINLOCK_SLEEP on, and it will warn + you if you break these rules. If you do break + the rules, you will eventually lock up your box. @@ -515,8 +509,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); success). - [Yes, this moronic interface makes me cringe. Please submit a - patch and become my hero --RR.] + [Yes, this moronic interface makes me cringe. The flamewar comes up every year or so. --RR.] The functions may sleep implicitly. This should never be called @@ -587,10 +580,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); - If you see a kmem_grow: Called nonatomically from int - warning message you called a memory allocation function - from interrupt context without GFP_ATOMIC. - You should really fix that. Run, don't walk. + If you see a sleeping function called from invalid + context warning message, then maybe you called a + sleeping allocation function from interrupt context without + GFP_ATOMIC. You should really fix that. + Run, don't walk. @@ -639,16 +633,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); - <function>udelay()</function>/<function>mdelay()</function> + <title><function>mdelay()</function>/<function>udelay()</function> <filename class="headerfile">include/asm/delay.h</filename> <filename class="headerfile">include/linux/delay.h</filename> - The udelay() function can be used for small pauses. - Do not use large values with udelay() as you risk + The udelay() and ndelay() functions can be used for small pauses. + Do not use large values with them as you risk overflow - the helper function mdelay() is useful - here, or even consider schedule_timeout(). + here, or consider msleep(). @@ -698,8 +692,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); These routines disable soft interrupts on the local CPU, and restore them. They are reentrant; if soft interrupts were disabled before, they will still be disabled after this pair - of functions has been called. They prevent softirqs, tasklets - and bottom halves from running on the current CPU. + of functions has been called. They prevent softirqs and tasklets + from running on the current CPU. @@ -708,10 +702,16 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); include/asm/smp.h - smp_processor_id() returns the current - processor number, between 0 and NR_CPUS (the - maximum number of CPUs supported by Linux, currently 32). These - values are not necessarily continuous. + get_cpu() disables preemption (so you won't + suddenly get moved to another CPU) and returns the current + processor number, between 0 and NR_CPUS. Note + that the CPU numbers are not necessarily continuous. You return + it again with put_cpu() when you are done. + + + If you know you cannot be preempted by another task (ie. you are + in interrupt context, or have preemption disabled) you can use + smp_processor_id(). @@ -722,19 +722,14 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); After boot, the kernel frees up a special section; functions marked with __init and data structures marked with - __initdata are dropped after boot is complete (within - modules this directive is currently ignored). __exit + __initdata are dropped after boot is complete: similarly + modules discard this memory after initialization. __exit is used to declare a function which is only required on exit: the function will be dropped if this file is not compiled as a module. See the header file for use. Note that it makes no sense for a function marked with __init to be exported to modules with EXPORT_SYMBOL() - this will break. - - Static data structures marked as __initdata must be initialised - (as opposed to ordinary static data which is zeroed BSS) and cannot be - const. - @@ -762,9 +757,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); The function can return a negative error number to cause module loading to fail (unfortunately, this has no effect if - the module is compiled into the kernel). For modules, this is - called in user context, with interrupts enabled, and the - kernel lock held, so it can sleep. + the module is compiled into the kernel). This function is + called in user context with interrupts enabled, so it can sleep. @@ -779,6 +773,34 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); reached zero. This function can also sleep, but cannot fail: everything must be cleaned up by the time it returns. + + + Note that this macro is optional: if it is not present, your + module will not be removable (except for 'rmmod -f'). + + + + + <function>try_module_get()</function>/<function>module_put()</function> + <filename class="headerfile">include/linux/module.h</filename> + + + These manipulate the module usage count, to protect against + removal (a module also can't be removed if another module uses one + of its exported symbols: see below). Before calling into module + code, you should call try_module_get() on + that module: if it fails, then the module is being removed and you + should act as if it wasn't there. Otherwise, you can safely enter + the module, and call module_put() when you're + finished. + + + + Most registerable structures have an + owner field, such as in the + file_operations structure. Set this field + to the macro THIS_MODULE. + @@ -821,7 +843,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); There is a macro to do this: wait_event_interruptible() - include/linux/sched.h The + include/linux/wait.h The first argument is the wait queue head, and the second is an expression which is evaluated; the macro returns 0 when this expression is true, or @@ -847,10 +869,11 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); Call wake_up() - include/linux/sched.h;, + include/linux/wait.h;, which will wake up every process in the queue. The exception is if one has TASK_EXCLUSIVE set, in which case - the remainder of the queue will not be woken. + the remainder of the queue will not be woken. There are other variants + of this basic function available in the same header. @@ -863,7 +886,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); first class of operations work on atomic_t include/asm/atomic.h; this - contains a signed integer (at least 24 bits long), and you must use + contains a signed integer (at least 32 bits long), and you must use these functions to manipulate or read atomic_t variables. atomic_read() and atomic_set() get and set the counter, @@ -882,13 +905,12 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); Note that these functions are slower than normal arithmetic, and - so should not be used unnecessarily. On some platforms they - are much slower, like 32-bit Sparc where they use a spinlock. + so should not be used unnecessarily. - The second class of atomic operations is atomic bit operations on a - long, defined in + The second class of atomic operations is atomic bit operations on an + unsigned long, defined in include/linux/bitops.h. These operations generally take a pointer to the bit pattern, and a bit @@ -899,7 +921,7 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); test_and_clear_bit() and test_and_change_bit() do the same thing, except return true if the bit was previously set; these are - particularly useful for very simple locking. + particularly useful for atomically setting flags. @@ -907,12 +929,6 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); than BITS_PER_LONG. The resulting behavior is strange on big-endian platforms though so it is a good idea not to do this. - - - Note that the order of bits depends on the architecture, and in - particular, the bitfield passed to these operations must be at - least as large as a long. - @@ -932,11 +948,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); include/linux/module.h - This is the classic method of exporting a symbol, and it works - for both modules and non-modules. In the kernel all these - declarations are often bundled into a single file to help - genksyms (which searches source files for these declarations). - See the comment on genksyms and Makefiles below. + This is the classic method of exporting a symbol: dynamically + loaded modules will be able to use the symbol as normal. @@ -949,7 +962,8 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); symbols exported by EXPORT_SYMBOL_GPL() can only be seen by modules with a MODULE_LICENSE() that specifies a GPL - compatible license. + compatible license. It implies that the function is considered + an internal implementation issue, and not really an interface. @@ -962,12 +976,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); include/linux/list.h - There are three sets of linked-list routines in the kernel - headers, but this one seems to be winning out (and Linus has - used it). If you don't have some particular pressing need for - a single list, it's a good choice. In fact, I don't care - whether it's a good choice or not, just use it so we can get - rid of the others. + There used to be three sets of linked-list routines in the kernel + headers, but this one is the winner. If you don't have some + particular pressing need for a single list, it's a good choice. + + + + In particular, list_for_each_entry is useful. @@ -979,14 +994,13 @@ printk(KERN_INFO "my ip: %d.%d.%d.%d\n", NIPQUAD(ipaddress)); convention, and return 0 for success, and a negative error number (eg. -EFAULT) for failure. This can be - unintuitive at first, but it's fairly widespread in the networking - code, for example. + unintuitive at first, but it's fairly widespread in the kernel. - The filesystem code uses ERR_PTR() + Using ERR_PTR() - include/linux/fs.h; to + include/linux/err.h; to encode a negative error number into a pointer, and IS_ERR() and PTR_ERR() to get it back out again: avoids a separate pointer parameter for @@ -1040,7 +1054,7 @@ static struct block_device_operations opt_fops = { supported, due to lack of general use, but the following are considered standard (see the GCC info page section "C Extensions" for more details - Yes, really the info page, the - man page is only a short summary of the stuff in info): + man page is only a short summary of the stuff in info). @@ -1091,7 +1105,7 @@ static struct block_device_operations opt_fops = { - Function names as strings (__FUNCTION__) + Function names as strings (__func__). @@ -1164,63 +1178,35 @@ static struct block_device_operations opt_fops = { Usually you want a configuration option for your kernel hack. - Edit Config.in in the appropriate directory - (but under arch/ it's called - config.in). The Config Language used is not - bash, even though it looks like bash; the safe way is to use only - the constructs that you already see in - Config.in files (see - Documentation/kbuild/kconfig-language.txt). - It's good to run "make xconfig" at least once to test (because - it's the only one with a static parser). - - - - Variables which can be Y or N use bool followed by a - tagline and the config define name (which must start with - CONFIG_). The tristate function is the same, but - allows the answer M (which defines - CONFIG_foo_MODULE in your source, instead of - CONFIG_FOO) if CONFIG_MODULES - is enabled. + Edit Kconfig in the appropriate directory. + The Config language is simple to use by cut and paste, and there's + complete documentation in + Documentation/kbuild/kconfig-language.txt. You may well want to make your CONFIG option only visible if CONFIG_EXPERIMENTAL is enabled: this serves as a warning to users. There many other fancy things you can do: see - the various Config.in files for ideas. + the various Kconfig files for ideas. - - - Edit the Makefile: the CONFIG variables are - exported here so you can conditionalize compilation with `ifeq'. - If your file exports symbols then add the names to - export-objs so that genksyms will find them. - - - There is a restriction on the kernel build system that objects - which export symbols must have globally unique names. - If your object does not have a globally unique name then the - standard fix is to move the - EXPORT_SYMBOL() statements to their own - object with a unique name. - This is why several systems have separate exporting objects, - usually suffixed with ksyms. - - + In your description of the option, make sure you address both the + expert user and the user who knows nothing about your feature. Mention + incompatibilities and issues here. Definitely + end your description with if in doubt, say N + (or, occasionally, `Y'); this is for people who have no + idea what you are talking about. - Document your option in Documentation/Configure.help. Mention - incompatibilities and issues here. Definitely - end your description with if in doubt, say N - (or, occasionally, `Y'); this is for people who have no - idea what you are talking about. + Edit the Makefile: the CONFIG variables are + exported here so you can usually just add a "obj-$(CONFIG_xxx) += + xxx.o" line. The syntax is documented in + Documentation/kbuild/makefiles.txt. @@ -1253,20 +1239,12 @@ static struct block_device_operations opt_fops = { - include/linux/brlock.h: + include/asm-i386/delay.h: -extern inline void br_read_lock (enum brlock_indices idx) -{ - /* - * This causes a link-time bug message if an - * invalid index is used: - */ - if (idx >= __BR_END) - __br_lock_usage_bug(); - - read_lock(&__brlock_array[smp_processor_id()][idx]); -} +#define ndelay(n) (__builtin_constant_p(n) ? \ + ((n) > 20000 ? __bad_ndelay() : __const_udelay((n) * 5ul)) : \ + __ndelay(n)) -- 2.20.1