Merge tag 'v3.10.55' into update
[GitHub/mt8127/android_kernel_alcatel_ttab.git] / Documentation / power / tuxonice-internals.txt
1 TuxOnIce 3.0 Internal Documentation.
2 Updated to 26 March 2009
3
4 1. Introduction.
5
6 TuxOnIce 3.0 is an addition to the Linux Kernel, designed to
7 allow the user to quickly shutdown and quickly boot a computer, without
8 needing to close documents or programs. It is equivalent to the
9 hibernate facility in some laptops. This implementation, however,
10 requires no special BIOS or hardware support.
11
12 The code in these files is based upon the original implementation
13 prepared by Gabor Kuti and additional work by Pavel Machek and a
14 host of others. This code has been substantially reworked by Nigel
15 Cunningham, again with the help and testing of many others, not the
16 least of whom is Michael Frank. At its heart, however, the operation is
17 essentially the same as Gabor's version.
18
19 2. Overview of operation.
20
21 The basic sequence of operations is as follows:
22
23 a. Quiesce all other activity.
24 b. Ensure enough memory and storage space are available, and attempt
25 to free memory/storage if necessary.
26 c. Allocate the required memory and storage space.
27 d. Write the image.
28 e. Power down.
29
30 There are a number of complicating factors which mean that things are
31 not as simple as the above would imply, however...
32
33 o The activity of each process must be stopped at a point where it will
34 not be holding locks necessary for saving the image, or unexpectedly
35 restart operations due to something like a timeout and thereby make
36 our image inconsistent.
37
38 o It is desirous that we sync outstanding I/O to disk before calculating
39 image statistics. This reduces corruption if one should suspend but
40 then not resume, and also makes later parts of the operation safer (see
41 below).
42
43 o We need to get as close as we can to an atomic copy of the data.
44 Inconsistencies in the image will result in inconsistent memory contents at
45 resume time, and thus in instability of the system and/or file system
46 corruption. This would appear to imply a maximum image size of one half of
47 the amount of RAM, but we have a solution... (again, below).
48
49 o In 2.6, we choose to play nicely with the other suspend-to-disk
50 implementations.
51
52 3. Detailed description of internals.
53
54 a. Quiescing activity.
55
56 Safely quiescing the system is achieved using three separate but related
57 aspects.
58
59 First, we note that the vast majority of processes don't need to run during
60 suspend. They can be 'frozen'. We therefore implement a refrigerator
61 routine, which processes enter and in which they remain until the cycle is
62 complete. Processes enter the refrigerator via try_to_freeze() invocations
63 at appropriate places. A process cannot be frozen in any old place. It
64 must not be holding locks that will be needed for writing the image or
65 freezing other processes. For this reason, userspace processes generally
66 enter the refrigerator via the signal handling code, and kernel threads at
67 the place in their event loops where they drop locks and yield to other
68 processes or sleep.
69
70 The task of freezing processes is complicated by the fact that there can be
71 interdependencies between processes. Freezing process A before process B may
72 mean that process B cannot be frozen, because it stops at waiting for
73 process A rather than in the refrigerator. This issue is seen where
74 userspace waits on freezeable kernel threads or fuse filesystem threads. To
75 address this issue, we implement the following algorithm for quiescing
76 activity:
77
78 - Freeze filesystems (including fuse - userspace programs starting
79 new requests are immediately frozen; programs already running
80 requests complete their work before being frozen in the next
81 step)
82 - Freeze userspace
83 - Thaw filesystems (this is safe now that userspace is frozen and no
84 fuse requests are outstanding).
85 - Invoke sys_sync (noop on fuse).
86 - Freeze filesystems
87 - Freeze kernel threads
88
89 If we need to free memory, we thaw kernel threads and filesystems, but not
90 userspace. We can then free caches without worrying about deadlocks due to
91 swap files being on frozen filesystems or such like.
92
93 b. Ensure enough memory & storage are available.
94
95 We have a number of constraints to meet in order to be able to successfully
96 suspend and resume.
97
98 First, the image will be written in two parts, described below. One of these
99 parts needs to have an atomic copy made, which of course implies a maximum
100 size of one half of the amount of system memory. The other part ('pageset')
101 is not atomically copied, and can therefore be as large or small as desired.
102
103 Second, we have constraints on the amount of storage available. In these
104 calculations, we may also consider any compression that will be done. The
105 cryptoapi module allows the user to configure an expected compression ratio.
106
107 Third, the user can specify an arbitrary limit on the image size, in
108 megabytes. This limit is treated as a soft limit, so that we don't fail the
109 attempt to suspend if we cannot meet this constraint.
110
111 c. Allocate the required memory and storage space.
112
113 Having done the initial freeze, we determine whether the above constraints
114 are met, and seek to allocate the metadata for the image. If the constraints
115 are not met, or we fail to allocate the required space for the metadata, we
116 seek to free the amount of memory that we calculate is needed and try again.
117 We allow up to four iterations of this loop before aborting the cycle. If we
118 do fail, it should only be because of a bug in TuxOnIce's calculations.
119
120 These steps are merged together in the prepare_image function, found in
121 prepare_image.c. The functions are merged because of the cyclical nature
122 of the problem of calculating how much memory and storage is needed. Since
123 the data structures containing the information about the image must
124 themselves take memory and use storage, the amount of memory and storage
125 required changes as we prepare the image. Since the changes are not large,
126 only one or two iterations will be required to achieve a solution.
127
128 The recursive nature of the algorithm is miminised by keeping user space
129 frozen while preparing the image, and by the fact that our records of which
130 pages are to be saved and which pageset they are saved in use bitmaps (so
131 that changes in number or fragmentation of the pages to be saved don't
132 feedback via changes in the amount of memory needed for metadata). The
133 recursiveness is thus limited to any extra slab pages allocated to store the
134 extents that record storage used, and the effects of seeking to free memory.
135
136 d. Write the image.
137
138 We previously mentioned the need to create an atomic copy of the data, and
139 the half-of-memory limitation that is implied in this. This limitation is
140 circumvented by dividing the memory to be saved into two parts, called
141 pagesets.
142
143 Pageset2 contains most of the page cache - the pages on the active and
144 inactive LRU lists that aren't needed or modified while TuxOnIce is
145 running, so they can be safely written without an atomic copy. They are
146 therefore saved first and reloaded last. While saving these pages,
147 TuxOnIce carefully ensures that the work of writing the pages doesn't make
148 the image inconsistent. With the support for Kernel (Video) Mode Setting
149 going into the kernel at the time of writing, we need to check for pages
150 on the LRU that are used by KMS, and exclude them from pageset2. They are
151 atomically copied as part of pageset 1.
152
153 Once pageset2 has been saved, we prepare to do the atomic copy of remaining
154 memory. As part of the preparation, we power down drivers, thereby providing
155 them with the opportunity to have their state recorded in the image. The
156 amount of memory allocated by drivers for this is usually negligible, but if
157 DRI is in use, video drivers may require significants amounts. Ideally we
158 would be able to query drivers while preparing the image as to the amount of
159 memory they will need. Unfortunately no such mechanism exists at the time of
160 writing. For this reason, TuxOnIce allows the user to set an
161 'extra_pages_allowance', which is used to seek to ensure sufficient memory
162 is available for drivers at this point. TuxOnIce also lets the user set this
163 value to 0. In this case, a test driver suspend is done while preparing the
164 image, and the difference (plus a margin) used instead. TuxOnIce will also
165 automatically restart the hibernation process (twice at most) if it finds
166 that the extra pages allowance is not sufficient. It will then use what was
167 actually needed (plus a margin, again). Failure to hibernate should thus
168 be an extremely rare occurence.
169
170 Having suspended the drivers, we save the CPU context before making an
171 atomic copy of pageset1, resuming the drivers and saving the atomic copy.
172 After saving the two pagesets, we just need to save our metadata before
173 powering down.
174
175 As we mentioned earlier, the contents of pageset2 pages aren't needed once
176 they've been saved. We therefore use them as the destination of our atomic
177 copy. In the unlikely event that pageset1 is larger, extra pages are
178 allocated while the image is being prepared. This is normally only a real
179 possibility when the system has just been booted and the page cache is
180 small.
181
182 This is where we need to be careful about syncing, however. Pageset2 will
183 probably contain filesystem meta data. If this is overwritten with pageset1
184 and then a sync occurs, the filesystem will be corrupted - at least until
185 resume time and another sync of the restored data. Since there is a
186 possibility that the user might not resume or (may it never be!) that
187 TuxOnIce might oops, we do our utmost to avoid syncing filesystems after
188 copying pageset1.
189
190 e. Power down.
191
192 Powering down uses standard kernel routines. TuxOnIce supports powering down
193 using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off.
194 Supporting suspend to ram (S3) as a power off option might sound strange,
195 but it allows the user to quickly get their system up and running again if
196 the battery doesn't run out (we just need to re-read the overwritten pages)
197 and if the battery does run out (or the user removes power), they can still
198 resume.
199
200 4. Data Structures.
201
202 TuxOnIce uses three main structures to store its metadata and configuration
203 information:
204
205 a) Pageflags bitmaps.
206
207 TuxOnIce records which pages will be in pageset1, pageset2, the destination
208 of the atomic copy and the source of the atomically restored image using
209 bitmaps. The code used is that written for swsusp, with small improvements
210 to match TuxOnIce's requirements.
211
212 The pageset1 bitmap is thus easily stored in the image header for use at
213 resume time.
214
215 As mentioned above, using bitmaps also means that the amount of memory and
216 storage required for recording the above information is constant. This
217 greatly simplifies the work of preparing the image. In earlier versions of
218 TuxOnIce, extents were used to record which pages would be stored. In that
219 case, however, eating memory could result in greater fragmentation of the
220 lists of pages, which in turn required more memory to store the extents and
221 more storage in the image header. These could in turn require further
222 freeing of memory, and another iteration. All of this complexity is removed
223 by having bitmaps.
224
225 Bitmaps also make a lot of sense because TuxOnIce only ever iterates
226 through the lists. There is therefore no cost to not being able to find the
227 nth page in order 0 time. We only need to worry about the cost of finding
228 the n+1th page, given the location of the nth page. Bitwise optimisations
229 help here.
230
231 b) Extents for block data.
232
233 TuxOnIce supports writing the image to multiple block devices. In the case
234 of swap, multiple partitions and/or files may be in use, and we happily use
235 them all (with the exception of compcache pages, which we allocate but do
236 not use). This use of multiple block devices is accomplished as follows:
237
238 Whatever the actual source of the allocated storage, the destination of the
239 image can be viewed in terms of one or more block devices, and on each
240 device, a list of sectors. To simplify matters, we only use contiguous,
241 PAGE_SIZE aligned sectors, like the swap code does.
242
243 Since sector numbers on each bdev may well not start at 0, it makes much
244 more sense to use extents here. Contiguous ranges of pages can thus be
245 represented in the extents by contiguous values.
246
247 Variations in block size are taken account of in transforming this data
248 into the parameters for bio submission.
249
250 We can thus implement a layer of abstraction wherein the core of TuxOnIce
251 doesn't have to worry about which device we're currently writing to or
252 where in the device we are. It simply requests that the next page in the
253 pageset or header be written, leaving the details to this lower layer.
254 The lower layer remembers where in the sequence of devices and blocks each
255 pageset starts. The header always starts at the beginning of the allocated
256 storage.
257
258 So extents are:
259
260 struct extent {
261 unsigned long minimum, maximum;
262 struct extent *next;
263 }
264
265 These are combined into chains of extents for a device:
266
267 struct extent_chain {
268 int size; /* size of the extent ie sum (max-min+1) */
269 int allocs, frees;
270 char *name;
271 struct extent *first, *last_touched;
272 };
273
274 For each bdev, we need to store a little more info:
275
276 struct suspend_bdev_info {
277 struct block_device *bdev;
278 dev_t dev_t;
279 int bmap_shift;
280 int blocks_per_page;
281 };
282
283 The dev_t is used to identify the device in the stored image. As a result,
284 we expect devices at resume time to have the same major and minor numbers
285 as they had while suspending. This is primarily a concern where the user
286 utilises LVM for storage, as they will need to dmsetup their partitions in
287 such a way as to maintain this consistency at resume time.
288
289 bmap_shift and blocks_per_page apply the effects of variations in blocks
290 per page settings for the filesystem and underlying bdev. For most
291 filesystems, these are the same, but for xfs, they can have independant
292 values.
293
294 Combining these two structures together, we have everything we need to
295 record what devices and what blocks on each device are being used to
296 store the image, and to submit i/o using bio_submit.
297
298 The last elements in the picture are a means of recording how the storage
299 is being used.
300
301 We do this first and foremost by implementing a layer of abstraction on
302 top of the devices and extent chains which allows us to view however many
303 devices there might be as one long storage tape, with a single 'head' that
304 tracks a 'current position' on the tape:
305
306 struct extent_iterate_state {
307 struct extent_chain *chains;
308 int num_chains;
309 int current_chain;
310 struct extent *current_extent;
311 unsigned long current_offset;
312 };
313
314 That is, *chains points to an array of size num_chains of extent chains.
315 For the filewriter, this is always a single chain. For the swapwriter, the
316 array is of size MAX_SWAPFILES.
317
318 current_chain, current_extent and current_offset thus point to the current
319 index in the chains array (and into a matching array of struct
320 suspend_bdev_info), the current extent in that chain (to optimise access),
321 and the current value in the offset.
322
323 The image is divided into three parts:
324 - The header
325 - Pageset 1
326 - Pageset 2
327
328 The header always starts at the first device and first block. We know its
329 size before we begin to save the image because we carefully account for
330 everything that will be stored in it.
331
332 The second pageset (LRU) is stored first. It begins on the next page after
333 the end of the header.
334
335 The first pageset is stored second. It's start location is only known once
336 pageset2 has been saved, since pageset2 may be compressed as it is written.
337 This location is thus recorded at the end of saving pageset2. It is page
338 aligned also.
339
340 Since this information is needed at resume time, and the location of extents
341 in memory will differ at resume time, this needs to be stored in a portable
342 way:
343
344 struct extent_iterate_saved_state {
345 int chain_num;
346 int extent_num;
347 unsigned long offset;
348 };
349
350 We can thus implement a layer of abstraction wherein the core of TuxOnIce
351 doesn't have to worry about which device we're currently writing to or
352 where in the device we are. It simply requests that the next page in the
353 pageset or header be written, leaving the details to this layer, and
354 invokes the routines to remember and restore the position, without having
355 to worry about the details of how the data is arranged on disk or such like.
356
357 c) Modules
358
359 One aim in designing TuxOnIce was to make it flexible. We wanted to allow
360 for the implementation of different methods of transforming a page to be
361 written to disk and different methods of getting the pages stored.
362
363 In early versions (the betas and perhaps Suspend1), compression support was
364 inlined in the image writing code, and the data structures and code for
365 managing swap were intertwined with the rest of the code. A number of people
366 had expressed interest in implementing image encryption, and alternative
367 methods of storing the image.
368
369 In order to achieve this, TuxOnIce was given a modular design.
370
371 A module is a single file which encapsulates the functionality needed
372 to transform a pageset of data (encryption or compression, for example),
373 or to write the pageset to a device. The former type of module is called
374 a 'page-transformer', the later a 'writer'.
375
376 Modules are linked together in pipeline fashion. There may be zero or more
377 page transformers in a pipeline, and there is always exactly one writer.
378 The pipeline follows this pattern:
379
380 ---------------------------------
381 | TuxOnIce Core |
382 ---------------------------------
383 |
384 |
385 ---------------------------------
386 | Page transformer 1 |
387 ---------------------------------
388 |
389 |
390 ---------------------------------
391 | Page transformer 2 |
392 ---------------------------------
393 |
394 |
395 ---------------------------------
396 | Writer |
397 ---------------------------------
398
399 During the writing of an image, the core code feeds pages one at a time
400 to the first module. This module performs whatever transformations it
401 implements on the incoming data, completely consuming the incoming data and
402 feeding output in a similar manner to the next module.
403
404 All routines are SMP safe, and the final result of the transformations is
405 written with an index (provided by the core) and size of the output by the
406 writer. As a result, we can have multithreaded I/O without needing to
407 worry about the sequence in which pages are written (or read).
408
409 During reading, the pipeline works in the reverse direction. The core code
410 calls the first module with the address of a buffer which should be filled.
411 (Note that the buffer size is always PAGE_SIZE at this time). This module
412 will in turn request data from the next module and so on down until the
413 writer is made to read from the stored image.
414
415 Part of definition of the structure of a module thus looks like this:
416
417 int (*rw_init) (int rw, int stream_number);
418 int (*rw_cleanup) (int rw);
419 int (*write_chunk) (struct page *buffer_page);
420 int (*read_chunk) (struct page *buffer_page, int sync);
421
422 It should be noted that the _cleanup routine may be called before the
423 full stream of data has been read or written. While writing the image,
424 the user may (depending upon settings) choose to abort suspending, and
425 if we are in the midst of writing the last portion of the image, a portion
426 of the second pageset may be reread. This may also happen if an error
427 occurs and we seek to abort the process of writing the image.
428
429 The modular design is also useful in a number of other ways. It provides
430 a means where by we can add support for:
431
432 - providing overall initialisation and cleanup routines;
433 - serialising configuration information in the image header;
434 - providing debugging information to the user;
435 - determining memory and image storage requirements;
436 - dis/enabling components at run-time;
437 - configuring the module (see below);
438
439 ...and routines for writers specific to their work:
440 - Parsing a resume= location;
441 - Determining whether an image exists;
442 - Marking a resume as having been attempted;
443 - Invalidating an image;
444
445 Since some parts of the core - the user interface and storage manager
446 support - have use for some of these functions, they are registered as
447 'miscellaneous' modules as well.
448
449 d) Sysfs data structures.
450
451 This brings us naturally to support for configuring TuxOnIce. We desired to
452 provide a way to make TuxOnIce as flexible and configurable as possible.
453 The user shouldn't have to reboot just because they want to now hibernate to
454 a file instead of a partition, for example.
455
456 To accomplish this, TuxOnIce implements a very generic means whereby the
457 core and modules can register new sysfs entries. All TuxOnIce entries use
458 a single _store and _show routine, both of which are found in
459 tuxonice_sysfs.c in the kernel/power directory. These routines handle the
460 most common operations - getting and setting the values of bits, integers,
461 longs, unsigned longs and strings in one place, and allow overrides for
462 customised get and set options as well as side-effect routines for all
463 reads and writes.
464
465 When combined with some simple macros, a new sysfs entry can then be defined
466 in just a couple of lines:
467
468 SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1,
469 2048, 0, NULL),
470
471 This defines a sysfs entry named "progress_granularity" which is rw and
472 allows the user to access an integer stored at &progress_granularity, giving
473 it a value between 1 and 2048 inclusive.
474
475 Sysfs entries are registered under /sys/power/tuxonice, and entries for
476 modules are located in a subdirectory named after the module.
477