Commit | Line | Data |
---|---|---|
bbb5bbb0 RD |
1 | <?xml version="1.0" encoding="UTF-8"?> |
2 | <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" | |
3 | "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> | |
4 | ||
5 | <book id="Linux-filesystems-API"> | |
6 | <bookinfo> | |
7 | <title>Linux Filesystems API</title> | |
8 | ||
9 | <legalnotice> | |
10 | <para> | |
11 | This documentation is free software; you can redistribute | |
12 | it and/or modify it under the terms of the GNU General Public | |
13 | License as published by the Free Software Foundation; either | |
14 | version 2 of the License, or (at your option) any later | |
15 | version. | |
16 | </para> | |
17 | ||
18 | <para> | |
19 | This program is distributed in the hope that it will be | |
20 | useful, but WITHOUT ANY WARRANTY; without even the implied | |
21 | warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | |
22 | See the GNU General Public License for more details. | |
23 | </para> | |
24 | ||
25 | <para> | |
26 | You should have received a copy of the GNU General Public | |
27 | License along with this program; if not, write to the Free | |
28 | Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, | |
29 | MA 02111-1307 USA | |
30 | </para> | |
31 | ||
32 | <para> | |
33 | For more details see the file COPYING in the source | |
34 | distribution of Linux. | |
35 | </para> | |
36 | </legalnotice> | |
37 | </bookinfo> | |
38 | ||
39 | <toc></toc> | |
40 | ||
41 | <chapter id="vfs"> | |
42 | <title>The Linux VFS</title> | |
5c3b4474 | 43 | <sect1 id="the_filesystem_types"><title>The Filesystem types</title> |
bbb5bbb0 RD |
44 | !Iinclude/linux/fs.h |
45 | </sect1> | |
5c3b4474 | 46 | <sect1 id="the_directory_cache"><title>The Directory Cache</title> |
bbb5bbb0 RD |
47 | !Efs/dcache.c |
48 | !Iinclude/linux/dcache.h | |
49 | </sect1> | |
5c3b4474 | 50 | <sect1 id="inode_handling"><title>Inode Handling</title> |
bbb5bbb0 RD |
51 | !Efs/inode.c |
52 | !Efs/bad_inode.c | |
53 | </sect1> | |
5c3b4474 | 54 | <sect1 id="registration_and_superblocks"><title>Registration and Superblocks</title> |
bbb5bbb0 RD |
55 | !Efs/super.c |
56 | </sect1> | |
5c3b4474 | 57 | <sect1 id="file_locks"><title>File Locks</title> |
bbb5bbb0 RD |
58 | !Efs/locks.c |
59 | !Ifs/locks.c | |
60 | </sect1> | |
5c3b4474 | 61 | <sect1 id="other_functions"><title>Other Functions</title> |
bbb5bbb0 RD |
62 | !Efs/mpage.c |
63 | !Efs/namei.c | |
64 | !Efs/buffer.c | |
65 | !Efs/bio.c | |
66 | !Efs/seq_file.c | |
67 | !Efs/filesystems.c | |
68 | !Efs/fs-writeback.c | |
69 | !Efs/block_dev.c | |
70 | </sect1> | |
71 | </chapter> | |
72 | ||
73 | <chapter id="proc"> | |
74 | <title>The proc filesystem</title> | |
75 | ||
5c3b4474 | 76 | <sect1 id="sysctl_interface"><title>sysctl interface</title> |
bbb5bbb0 RD |
77 | !Ekernel/sysctl.c |
78 | </sect1> | |
79 | ||
5c3b4474 | 80 | <sect1 id="proc_filesystem_interface"><title>proc filesystem interface</title> |
bbb5bbb0 RD |
81 | !Ifs/proc/base.c |
82 | </sect1> | |
83 | </chapter> | |
84 | ||
85 | <chapter id="sysfs"> | |
86 | <title>The Filesystem for Exporting Kernel Objects</title> | |
87 | !Efs/sysfs/file.c | |
88 | !Efs/sysfs/symlink.c | |
89 | !Efs/sysfs/bin.c | |
90 | </chapter> | |
91 | ||
92 | <chapter id="debugfs"> | |
93 | <title>The debugfs filesystem</title> | |
94 | ||
5c3b4474 | 95 | <sect1 id="debugfs_interface"><title>debugfs interface</title> |
bbb5bbb0 RD |
96 | !Efs/debugfs/inode.c |
97 | !Efs/debugfs/file.c | |
98 | </sect1> | |
99 | </chapter> | |
100 | ||
733b72c3 RD |
101 | <chapter id="LinuxJDBAPI"> |
102 | <chapterinfo> | |
103 | <title>The Linux Journalling API</title> | |
104 | ||
105 | <authorgroup> | |
106 | <author> | |
107 | <firstname>Roger</firstname> | |
108 | <surname>Gammans</surname> | |
109 | <affiliation> | |
110 | <address> | |
111 | <email>rgammans@computer-surgery.co.uk</email> | |
112 | </address> | |
113 | </affiliation> | |
114 | </author> | |
115 | </authorgroup> | |
116 | ||
117 | <authorgroup> | |
118 | <author> | |
119 | <firstname>Stephen</firstname> | |
120 | <surname>Tweedie</surname> | |
121 | <affiliation> | |
122 | <address> | |
123 | <email>sct@redhat.com</email> | |
124 | </address> | |
125 | </affiliation> | |
126 | </author> | |
127 | </authorgroup> | |
128 | ||
129 | <copyright> | |
130 | <year>2002</year> | |
131 | <holder>Roger Gammans</holder> | |
132 | </copyright> | |
133 | </chapterinfo> | |
134 | ||
135 | <title>The Linux Journalling API</title> | |
136 | ||
5c3b4474 | 137 | <sect1 id="journaling_overview"> |
733b72c3 | 138 | <title>Overview</title> |
5c3b4474 | 139 | <sect2 id="journaling_details"> |
733b72c3 RD |
140 | <title>Details</title> |
141 | <para> | |
142 | The journalling layer is easy to use. You need to | |
143 | first of all create a journal_t data structure. There are | |
144 | two calls to do this dependent on how you decide to allocate the physical | |
145 | media on which the journal resides. The journal_init_inode() call | |
146 | is for journals stored in filesystem inodes, or the journal_init_dev() | |
147 | call can be use for journal stored on a raw device (in a continuous range | |
148 | of blocks). A journal_t is a typedef for a struct pointer, so when | |
149 | you are finally finished make sure you call journal_destroy() on it | |
150 | to free up any used kernel memory. | |
151 | </para> | |
152 | ||
153 | <para> | |
154 | Once you have got your journal_t object you need to 'mount' or load the journal | |
155 | file, unless of course you haven't initialised it yet - in which case you | |
156 | need to call journal_create(). | |
157 | </para> | |
158 | ||
159 | <para> | |
160 | Most of the time however your journal file will already have been created, but | |
161 | before you load it you must call journal_wipe() to empty the journal file. | |
162 | Hang on, you say , what if the filesystem wasn't cleanly umount()'d . Well, it is the | |
163 | job of the client file system to detect this and skip the call to journal_wipe(). | |
164 | </para> | |
165 | ||
166 | <para> | |
167 | In either case the next call should be to journal_load() which prepares the | |
168 | journal file for use. Note that journal_wipe(..,0) calls journal_skip_recovery() | |
169 | for you if it detects any outstanding transactions in the journal and similarly | |
170 | journal_load() will call journal_recover() if necessary. | |
171 | I would advise reading fs/ext3/super.c for examples on this stage. | |
172 | [RGG: Why is the journal_wipe() call necessary - doesn't this needlessly | |
173 | complicate the API. Or isn't a good idea for the journal layer to hide | |
174 | dirty mounts from the client fs] | |
175 | </para> | |
176 | ||
177 | <para> | |
178 | Now you can go ahead and start modifying the underlying | |
179 | filesystem. Almost. | |
180 | </para> | |
181 | ||
182 | <para> | |
183 | ||
184 | You still need to actually journal your filesystem changes, this | |
185 | is done by wrapping them into transactions. Additionally you | |
186 | also need to wrap the modification of each of the buffers | |
187 | with calls to the journal layer, so it knows what the modifications | |
188 | you are actually making are. To do this use journal_start() which | |
189 | returns a transaction handle. | |
190 | </para> | |
191 | ||
192 | <para> | |
193 | journal_start() | |
194 | and its counterpart journal_stop(), which indicates the end of a transaction | |
195 | are nestable calls, so you can reenter a transaction if necessary, | |
196 | but remember you must call journal_stop() the same number of times as | |
197 | journal_start() before the transaction is completed (or more accurately | |
198 | leaves the update phase). Ext3/VFS makes use of this feature to simplify | |
199 | quota support. | |
200 | </para> | |
201 | ||
202 | <para> | |
203 | Inside each transaction you need to wrap the modifications to the | |
204 | individual buffers (blocks). Before you start to modify a buffer you | |
205 | need to call journal_get_{create,write,undo}_access() as appropriate, | |
206 | this allows the journalling layer to copy the unmodified data if it | |
207 | needs to. After all the buffer may be part of a previously uncommitted | |
208 | transaction. | |
209 | At this point you are at last ready to modify a buffer, and once | |
210 | you are have done so you need to call journal_dirty_{meta,}data(). | |
211 | Or if you've asked for access to a buffer you now know is now longer | |
212 | required to be pushed back on the device you can call journal_forget() | |
213 | in much the same way as you might have used bforget() in the past. | |
214 | </para> | |
215 | ||
216 | <para> | |
217 | A journal_flush() may be called at any time to commit and checkpoint | |
218 | all your transactions. | |
219 | </para> | |
220 | ||
221 | <para> | |
222 | Then at umount time , in your put_super() (2.4) or write_super() (2.5) | |
223 | you can then call journal_destroy() to clean up your in-core journal object. | |
224 | </para> | |
225 | ||
226 | <para> | |
227 | Unfortunately there a couple of ways the journal layer can cause a deadlock. | |
228 | The first thing to note is that each task can only have | |
229 | a single outstanding transaction at any one time, remember nothing | |
230 | commits until the outermost journal_stop(). This means | |
231 | you must complete the transaction at the end of each file/inode/address | |
232 | etc. operation you perform, so that the journalling system isn't re-entered | |
233 | on another journal. Since transactions can't be nested/batched | |
234 | across differing journals, and another filesystem other than | |
235 | yours (say ext3) may be modified in a later syscall. | |
236 | </para> | |
237 | ||
238 | <para> | |
239 | The second case to bear in mind is that journal_start() can | |
240 | block if there isn't enough space in the journal for your transaction | |
241 | (based on the passed nblocks param) - when it blocks it merely(!) needs to | |
242 | wait for transactions to complete and be committed from other tasks, | |
243 | so essentially we are waiting for journal_stop(). So to avoid | |
244 | deadlocks you must treat journal_start/stop() as if they | |
245 | were semaphores and include them in your semaphore ordering rules to prevent | |
246 | deadlocks. Note that journal_extend() has similar blocking behaviour to | |
247 | journal_start() so you can deadlock here just as easily as on journal_start(). | |
248 | </para> | |
249 | ||
250 | <para> | |
251 | Try to reserve the right number of blocks the first time. ;-). This will | |
252 | be the maximum number of blocks you are going to touch in this transaction. | |
253 | I advise having a look at at least ext3_jbd.h to see the basis on which | |
254 | ext3 uses to make these decisions. | |
255 | </para> | |
256 | ||
257 | <para> | |
258 | Another wriggle to watch out for is your on-disk block allocation strategy. | |
259 | why? Because, if you undo a delete, you need to ensure you haven't reused any | |
260 | of the freed blocks in a later transaction. One simple way of doing this | |
261 | is make sure any blocks you allocate only have checkpointed transactions | |
262 | listed against them. Ext3 does this in ext3_test_allocatable(). | |
263 | </para> | |
264 | ||
265 | <para> | |
266 | Lock is also providing through journal_{un,}lock_updates(), | |
267 | ext3 uses this when it wants a window with a clean and stable fs for a moment. | |
268 | eg. | |
269 | </para> | |
270 | ||
271 | <programlisting> | |
272 | ||
273 | journal_lock_updates() //stop new stuff happening.. | |
274 | journal_flush() // checkpoint everything. | |
275 | ..do stuff on stable fs | |
276 | journal_unlock_updates() // carry on with filesystem use. | |
277 | </programlisting> | |
278 | ||
279 | <para> | |
280 | The opportunities for abuse and DOS attacks with this should be obvious, | |
281 | if you allow unprivileged userspace to trigger codepaths containing these | |
282 | calls. | |
283 | </para> | |
284 | ||
285 | <para> | |
286 | A new feature of jbd since 2.5.25 is commit callbacks with the new | |
287 | journal_callback_set() function you can now ask the journalling layer | |
288 | to call you back when the transaction is finally committed to disk, so that | |
289 | you can do some of your own management. The key to this is the journal_callback | |
290 | struct, this maintains the internal callback information but you can | |
291 | extend it like this:- | |
292 | </para> | |
293 | <programlisting> | |
294 | struct myfs_callback_s { | |
295 | //Data structure element required by jbd.. | |
296 | struct journal_callback for_jbd; | |
297 | // Stuff for myfs allocated together. | |
298 | myfs_inode* i_commited; | |
299 | ||
300 | } | |
301 | </programlisting> | |
302 | ||
303 | <para> | |
304 | this would be useful if you needed to know when data was committed to a | |
305 | particular inode. | |
306 | </para> | |
307 | ||
308 | </sect2> | |
309 | ||
5c3b4474 | 310 | <sect2 id="jbd_summary"> |
733b72c3 RD |
311 | <title>Summary</title> |
312 | <para> | |
313 | Using the journal is a matter of wrapping the different context changes, | |
314 | being each mount, each modification (transaction) and each changed buffer | |
315 | to tell the journalling layer about them. | |
316 | </para> | |
317 | ||
318 | <para> | |
319 | Here is a some pseudo code to give you an idea of how it works, as | |
320 | an example. | |
321 | </para> | |
322 | ||
323 | <programlisting> | |
324 | journal_t* my_jnrl = journal_create(); | |
325 | journal_init_{dev,inode}(jnrl,...) | |
326 | if (clean) journal_wipe(); | |
327 | journal_load(); | |
328 | ||
329 | foreach(transaction) { /*transactions must be | |
330 | completed before | |
331 | a syscall returns to | |
332 | userspace*/ | |
333 | ||
334 | handle_t * xct=journal_start(my_jnrl); | |
335 | foreach(bh) { | |
336 | journal_get_{create,write,undo}_access(xact,bh); | |
337 | if ( myfs_modify(bh) ) { /* returns true | |
338 | if makes changes */ | |
339 | journal_dirty_{meta,}data(xact,bh); | |
340 | } else { | |
341 | journal_forget(bh); | |
342 | } | |
343 | } | |
344 | journal_stop(xct); | |
345 | } | |
346 | journal_destroy(my_jrnl); | |
347 | </programlisting> | |
348 | </sect2> | |
349 | ||
350 | </sect1> | |
351 | ||
5c3b4474 | 352 | <sect1 id="data_types"> |
733b72c3 RD |
353 | <title>Data Types</title> |
354 | <para> | |
355 | The journalling layer uses typedefs to 'hide' the concrete definitions | |
356 | of the structures used. As a client of the JBD layer you can | |
357 | just rely on the using the pointer as a magic cookie of some sort. | |
358 | ||
359 | Obviously the hiding is not enforced as this is 'C'. | |
360 | </para> | |
5c3b4474 | 361 | <sect2 id="structures"><title>Structures</title> |
733b72c3 RD |
362 | !Iinclude/linux/jbd.h |
363 | </sect2> | |
364 | </sect1> | |
365 | ||
5c3b4474 | 366 | <sect1 id="functions"> |
733b72c3 RD |
367 | <title>Functions</title> |
368 | <para> | |
369 | The functions here are split into two groups those that | |
370 | affect a journal as a whole, and those which are used to | |
371 | manage transactions | |
372 | </para> | |
5c3b4474 | 373 | <sect2 id="journal_level"><title>Journal Level</title> |
733b72c3 RD |
374 | !Efs/jbd/journal.c |
375 | !Ifs/jbd/recovery.c | |
376 | </sect2> | |
5c3b4474 | 377 | <sect2 id="transaction_level"><title>Transasction Level</title> |
733b72c3 RD |
378 | !Efs/jbd/transaction.c |
379 | </sect2> | |
380 | </sect1> | |
5c3b4474 | 381 | <sect1 id="see_also"> |
733b72c3 RD |
382 | <title>See also</title> |
383 | <para> | |
384 | <citation> | |
385 | <ulink url="ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/journal-design.ps.gz"> | |
386 | Journaling the Linux ext2fs Filesystem, LinuxExpo 98, Stephen Tweedie | |
387 | </ulink> | |
388 | </citation> | |
389 | </para> | |
390 | <para> | |
391 | <citation> | |
392 | <ulink url="http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html"> | |
393 | Ext3 Journalling FileSystem, OLS 2000, Dr. Stephen Tweedie | |
394 | </ulink> | |
395 | </citation> | |
396 | </para> | |
397 | </sect1> | |
398 | ||
399 | </chapter> | |
400 | ||
073b86da RD |
401 | <chapter id="splice"> |
402 | <title>splice API</title> | |
403 | <para> | |
404 | splice is a method for moving blocks of data around inside the | |
405 | kernel, without continually transferring them between the kernel | |
406 | and user space. | |
407 | </para> | |
408 | !Ffs/splice.c | |
409 | </chapter> | |
410 | ||
411 | <chapter id="pipes"> | |
412 | <title>pipes API</title> | |
413 | <para> | |
414 | Pipe interfaces are all for in-kernel (builtin image) use. | |
415 | They are not exported for use by modules. | |
416 | </para> | |
417 | !Iinclude/linux/pipe_fs_i.h | |
418 | !Ffs/pipe.c | |
419 | </chapter> | |
420 | ||
bbb5bbb0 | 421 | </book> |