Commit | Line | Data |
---|---|---|
e88d78f6 TZ |
1 | relay interface (formerly relayfs) |
2 | ================================== | |
3 | ||
4 | The relay interface provides a means for kernel applications to | |
5 | efficiently log and transfer large quantities of data from the kernel | |
6 | to userspace via user-defined 'relay channels'. | |
7 | ||
8 | A 'relay channel' is a kernel->user data relay mechanism implemented | |
9 | as a set of per-cpu kernel buffers ('channel buffers'), each | |
10 | represented as a regular file ('relay file') in user space. Kernel | |
11 | clients write into the channel buffers using efficient write | |
12 | functions; these automatically log into the current cpu's channel | |
13 | buffer. User space applications mmap() or read() from the relay files | |
14 | and retrieve the data as it becomes available. The relay files | |
15 | themselves are files created in a host filesystem, e.g. debugfs, and | |
16 | are associated with the channel buffers using the API described below. | |
17 | ||
18 | The format of the data logged into the channel buffers is completely | |
19 | up to the kernel client; the relay interface does however provide | |
20 | hooks which allow kernel clients to impose some structure on the | |
21 | buffer data. The relay interface doesn't implement any form of data | |
22 | filtering - this also is left to the kernel client. The purpose is to | |
23 | keep things as simple as possible. | |
24 | ||
25 | This document provides an overview of the relay interface API. The | |
26 | details of the function parameters are documented along with the | |
27 | functions in the relay interface code - please see that for details. | |
28 | ||
29 | Semantics | |
30 | ========= | |
31 | ||
32 | Each relay channel has one buffer per CPU, each buffer has one or more | |
33 | sub-buffers. Messages are written to the first sub-buffer until it is | |
34 | too full to contain a new message, in which case it it is written to | |
35 | the next (if available). Messages are never split across sub-buffers. | |
36 | At this point, userspace can be notified so it empties the first | |
37 | sub-buffer, while the kernel continues writing to the next. | |
38 | ||
39 | When notified that a sub-buffer is full, the kernel knows how many | |
40 | bytes of it are padding i.e. unused space occurring because a complete | |
41 | message couldn't fit into a sub-buffer. Userspace can use this | |
42 | knowledge to copy only valid data. | |
43 | ||
44 | After copying it, userspace can notify the kernel that a sub-buffer | |
45 | has been consumed. | |
46 | ||
47 | A relay channel can operate in a mode where it will overwrite data not | |
48 | yet collected by userspace, and not wait for it to be consumed. | |
49 | ||
50 | The relay channel itself does not provide for communication of such | |
51 | data between userspace and kernel, allowing the kernel side to remain | |
52 | simple and not impose a single interface on userspace. It does | |
53 | provide a set of examples and a separate helper though, described | |
54 | below. | |
55 | ||
56 | The read() interface both removes padding and internally consumes the | |
57 | read sub-buffers; thus in cases where read(2) is being used to drain | |
58 | the channel buffers, special-purpose communication between kernel and | |
59 | user isn't necessary for basic operation. | |
60 | ||
61 | One of the major goals of the relay interface is to provide a low | |
62 | overhead mechanism for conveying kernel data to userspace. While the | |
63 | read() interface is easy to use, it's not as efficient as the mmap() | |
64 | approach; the example code attempts to make the tradeoff between the | |
65 | two approaches as small as possible. | |
66 | ||
67 | klog and relay-apps example code | |
68 | ================================ | |
69 | ||
70 | The relay interface itself is ready to use, but to make things easier, | |
71 | a couple simple utility functions and a set of examples are provided. | |
72 | ||
73 | The relay-apps example tarball, available on the relay sourceforge | |
74 | site, contains a set of self-contained examples, each consisting of a | |
75 | pair of .c files containing boilerplate code for each of the user and | |
76 | kernel sides of a relay application. When combined these two sets of | |
77 | boilerplate code provide glue to easily stream data to disk, without | |
78 | having to bother with mundane housekeeping chores. | |
79 | ||
80 | The 'klog debugging functions' patch (klog.patch in the relay-apps | |
81 | tarball) provides a couple of high-level logging functions to the | |
82 | kernel which allow writing formatted text or raw data to a channel, | |
83 | regardless of whether a channel to write into exists or not, or even | |
84 | whether the relay interface is compiled into the kernel or not. These | |
85 | functions allow you to put unconditional 'trace' statements anywhere | |
86 | in the kernel or kernel modules; only when there is a 'klog handler' | |
87 | registered will data actually be logged (see the klog and kleak | |
88 | examples for details). | |
89 | ||
90 | It is of course possible to use the relay interface from scratch, | |
91 | i.e. without using any of the relay-apps example code or klog, but | |
92 | you'll have to implement communication between userspace and kernel, | |
93 | allowing both to convey the state of buffers (full, empty, amount of | |
94 | padding). The read() interface both removes padding and internally | |
95 | consumes the read sub-buffers; thus in cases where read(2) is being | |
96 | used to drain the channel buffers, special-purpose communication | |
97 | between kernel and user isn't necessary for basic operation. Things | |
98 | such as buffer-full conditions would still need to be communicated via | |
99 | some channel though. | |
100 | ||
101 | klog and the relay-apps examples can be found in the relay-apps | |
102 | tarball on http://relayfs.sourceforge.net | |
103 | ||
104 | The relay interface user space API | |
105 | ================================== | |
106 | ||
107 | The relay interface implements basic file operations for user space | |
108 | access to relay channel buffer data. Here are the file operations | |
109 | that are available and some comments regarding their behavior: | |
110 | ||
111 | open() enables user to open an _existing_ channel buffer. | |
112 | ||
113 | mmap() results in channel buffer being mapped into the caller's | |
114 | memory space. Note that you can't do a partial mmap - you | |
115 | must map the entire file, which is NRBUF * SUBBUFSIZE. | |
116 | ||
117 | read() read the contents of a channel buffer. The bytes read are | |
118 | 'consumed' by the reader, i.e. they won't be available | |
119 | again to subsequent reads. If the channel is being used | |
120 | in no-overwrite mode (the default), it can be read at any | |
121 | time even if there's an active kernel writer. If the | |
122 | channel is being used in overwrite mode and there are | |
123 | active channel writers, results may be unpredictable - | |
124 | users should make sure that all logging to the channel has | |
125 | ended before using read() with overwrite mode. Sub-buffer | |
126 | padding is automatically removed and will not be seen by | |
127 | the reader. | |
128 | ||
129 | sendfile() transfer data from a channel buffer to an output file | |
130 | descriptor. Sub-buffer padding is automatically removed | |
131 | and will not be seen by the reader. | |
132 | ||
133 | poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are | |
134 | notified when sub-buffer boundaries are crossed. | |
135 | ||
136 | close() decrements the channel buffer's refcount. When the refcount | |
137 | reaches 0, i.e. when no process or kernel client has the | |
138 | buffer open, the channel buffer is freed. | |
139 | ||
140 | In order for a user application to make use of relay files, the | |
141 | host filesystem must be mounted. For example, | |
142 | ||
254012fd | 143 | mount -t debugfs debugfs /sys/kernel/debug |
e88d78f6 TZ |
144 | |
145 | NOTE: the host filesystem doesn't need to be mounted for kernel | |
146 | clients to create or use channels - it only needs to be | |
147 | mounted when user space applications need access to the buffer | |
148 | data. | |
149 | ||
150 | ||
151 | The relay interface kernel API | |
152 | ============================== | |
153 | ||
154 | Here's a summary of the API the relay interface provides to in-kernel clients: | |
155 | ||
156 | TBD(curr. line MT:/API/) | |
157 | channel management functions: | |
158 | ||
159 | relay_open(base_filename, parent, subbuf_size, n_subbufs, | |
23c88752 | 160 | callbacks, private_data) |
e88d78f6 TZ |
161 | relay_close(chan) |
162 | relay_flush(chan) | |
163 | relay_reset(chan) | |
164 | ||
165 | channel management typically called on instigation of userspace: | |
166 | ||
167 | relay_subbufs_consumed(chan, cpu, subbufs_consumed) | |
168 | ||
169 | write functions: | |
170 | ||
171 | relay_write(chan, data, length) | |
172 | __relay_write(chan, data, length) | |
173 | relay_reserve(chan, length) | |
174 | ||
175 | callbacks: | |
176 | ||
177 | subbuf_start(buf, subbuf, prev_subbuf, prev_padding) | |
178 | buf_mapped(buf, filp) | |
179 | buf_unmapped(buf, filp) | |
180 | create_buf_file(filename, parent, mode, buf, is_global) | |
181 | remove_buf_file(dentry) | |
182 | ||
183 | helper functions: | |
184 | ||
185 | relay_buf_full(buf) | |
186 | subbuf_start_reserve(buf, length) | |
187 | ||
188 | ||
189 | Creating a channel | |
190 | ------------------ | |
191 | ||
192 | relay_open() is used to create a channel, along with its per-cpu | |
193 | channel buffers. Each channel buffer will have an associated file | |
194 | created for it in the host filesystem, which can be and mmapped or | |
195 | read from in user space. The files are named basename0...basenameN-1 | |
196 | where N is the number of online cpus, and by default will be created | |
197 | in the root of the filesystem (if the parent param is NULL). If you | |
198 | want a directory structure to contain your relay files, you should | |
199 | create it using the host filesystem's directory creation function, | |
200 | e.g. debugfs_create_dir(), and pass the parent directory to | |
201 | relay_open(). Users are responsible for cleaning up any directory | |
202 | structure they create, when the channel is closed - again the host | |
203 | filesystem's directory removal functions should be used for that, | |
204 | e.g. debugfs_remove(). | |
205 | ||
206 | In order for a channel to be created and the host filesystem's files | |
207 | associated with its channel buffers, the user must provide definitions | |
208 | for two callback functions, create_buf_file() and remove_buf_file(). | |
209 | create_buf_file() is called once for each per-cpu buffer from | |
210 | relay_open() and allows the user to create the file which will be used | |
211 | to represent the corresponding channel buffer. The callback should | |
212 | return the dentry of the file created to represent the channel buffer. | |
213 | remove_buf_file() must also be defined; it's responsible for deleting | |
214 | the file(s) created in create_buf_file() and is called during | |
215 | relay_close(). | |
216 | ||
217 | Here are some typical definitions for these callbacks, in this case | |
218 | using debugfs: | |
219 | ||
220 | /* | |
221 | * create_buf_file() callback. Creates relay file in debugfs. | |
222 | */ | |
223 | static struct dentry *create_buf_file_handler(const char *filename, | |
224 | struct dentry *parent, | |
225 | int mode, | |
226 | struct rchan_buf *buf, | |
227 | int *is_global) | |
228 | { | |
229 | return debugfs_create_file(filename, mode, parent, buf, | |
230 | &relay_file_operations); | |
231 | } | |
232 | ||
233 | /* | |
234 | * remove_buf_file() callback. Removes relay file from debugfs. | |
235 | */ | |
236 | static int remove_buf_file_handler(struct dentry *dentry) | |
237 | { | |
238 | debugfs_remove(dentry); | |
239 | ||
240 | return 0; | |
241 | } | |
242 | ||
243 | /* | |
244 | * relay interface callbacks | |
245 | */ | |
246 | static struct rchan_callbacks relay_callbacks = | |
247 | { | |
248 | .create_buf_file = create_buf_file_handler, | |
249 | .remove_buf_file = remove_buf_file_handler, | |
250 | }; | |
251 | ||
252 | And an example relay_open() invocation using them: | |
253 | ||
23c88752 | 254 | chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); |
e88d78f6 TZ |
255 | |
256 | If the create_buf_file() callback fails, or isn't defined, channel | |
257 | creation and thus relay_open() will fail. | |
258 | ||
259 | The total size of each per-cpu buffer is calculated by multiplying the | |
260 | number of sub-buffers by the sub-buffer size passed into relay_open(). | |
261 | The idea behind sub-buffers is that they're basically an extension of | |
262 | double-buffering to N buffers, and they also allow applications to | |
263 | easily implement random-access-on-buffer-boundary schemes, which can | |
264 | be important for some high-volume applications. The number and size | |
265 | of sub-buffers is completely dependent on the application and even for | |
266 | the same application, different conditions will warrant different | |
267 | values for these parameters at different times. Typically, the right | |
268 | values to use are best decided after some experimentation; in general, | |
269 | though, it's safe to assume that having only 1 sub-buffer is a bad | |
270 | idea - you're guaranteed to either overwrite data or lose events | |
271 | depending on the channel mode being used. | |
272 | ||
273 | The create_buf_file() implementation can also be defined in such a way | |
274 | as to allow the creation of a single 'global' buffer instead of the | |
275 | default per-cpu set. This can be useful for applications interested | |
276 | mainly in seeing the relative ordering of system-wide events without | |
277 | the need to bother with saving explicit timestamps for the purpose of | |
278 | merging/sorting per-cpu files in a postprocessing step. | |
279 | ||
280 | To have relay_open() create a global buffer, the create_buf_file() | |
281 | implementation should set the value of the is_global outparam to a | |
282 | non-zero value in addition to creating the file that will be used to | |
283 | represent the single buffer. In the case of a global buffer, | |
284 | create_buf_file() and remove_buf_file() will be called only once. The | |
285 | normal channel-writing functions, e.g. relay_write(), can still be | |
286 | used - writes from any cpu will transparently end up in the global | |
287 | buffer - but since it is a global buffer, callers should make sure | |
288 | they use the proper locking for such a buffer, either by wrapping | |
289 | writes in a spinlock, or by copying a write function from relay.h and | |
290 | creating a local version that internally does the proper locking. | |
291 | ||
23c88752 MD |
292 | The private_data passed into relay_open() allows clients to associate |
293 | user-defined data with a channel, and is immediately available | |
294 | (including in create_buf_file()) via chan->private_data or | |
295 | buf->chan->private_data. | |
296 | ||
e88d78f6 TZ |
297 | Channel 'modes' |
298 | --------------- | |
299 | ||
300 | relay channels can be used in either of two modes - 'overwrite' or | |
301 | 'no-overwrite'. The mode is entirely determined by the implementation | |
302 | of the subbuf_start() callback, as described below. The default if no | |
303 | subbuf_start() callback is defined is 'no-overwrite' mode. If the | |
304 | default mode suits your needs, and you plan to use the read() | |
305 | interface to retrieve channel data, you can ignore the details of this | |
306 | section, as it pertains mainly to mmap() implementations. | |
307 | ||
308 | In 'overwrite' mode, also known as 'flight recorder' mode, writes | |
309 | continuously cycle around the buffer and will never fail, but will | |
310 | unconditionally overwrite old data regardless of whether it's actually | |
311 | been consumed. In no-overwrite mode, writes will fail, i.e. data will | |
312 | be lost, if the number of unconsumed sub-buffers equals the total | |
313 | number of sub-buffers in the channel. It should be clear that if | |
314 | there is no consumer or if the consumer can't consume sub-buffers fast | |
315 | enough, data will be lost in either case; the only difference is | |
316 | whether data is lost from the beginning or the end of a buffer. | |
317 | ||
318 | As explained above, a relay channel is made of up one or more | |
319 | per-cpu channel buffers, each implemented as a circular buffer | |
320 | subdivided into one or more sub-buffers. Messages are written into | |
321 | the current sub-buffer of the channel's current per-cpu buffer via the | |
322 | write functions described below. Whenever a message can't fit into | |
323 | the current sub-buffer, because there's no room left for it, the | |
324 | client is notified via the subbuf_start() callback that a switch to a | |
325 | new sub-buffer is about to occur. The client uses this callback to 1) | |
326 | initialize the next sub-buffer if appropriate 2) finalize the previous | |
327 | sub-buffer if appropriate and 3) return a boolean value indicating | |
328 | whether or not to actually move on to the next sub-buffer. | |
329 | ||
330 | To implement 'no-overwrite' mode, the userspace client would provide | |
331 | an implementation of the subbuf_start() callback something like the | |
332 | following: | |
333 | ||
334 | static int subbuf_start(struct rchan_buf *buf, | |
335 | void *subbuf, | |
336 | void *prev_subbuf, | |
337 | unsigned int prev_padding) | |
338 | { | |
339 | if (prev_subbuf) | |
340 | *((unsigned *)prev_subbuf) = prev_padding; | |
341 | ||
342 | if (relay_buf_full(buf)) | |
343 | return 0; | |
344 | ||
345 | subbuf_start_reserve(buf, sizeof(unsigned int)); | |
346 | ||
347 | return 1; | |
348 | } | |
349 | ||
350 | If the current buffer is full, i.e. all sub-buffers remain unconsumed, | |
351 | the callback returns 0 to indicate that the buffer switch should not | |
352 | occur yet, i.e. until the consumer has had a chance to read the | |
353 | current set of ready sub-buffers. For the relay_buf_full() function | |
a982ac06 | 354 | to make sense, the consumer is responsible for notifying the relay |
e88d78f6 TZ |
355 | interface when sub-buffers have been consumed via |
356 | relay_subbufs_consumed(). Any subsequent attempts to write into the | |
357 | buffer will again invoke the subbuf_start() callback with the same | |
358 | parameters; only when the consumer has consumed one or more of the | |
359 | ready sub-buffers will relay_buf_full() return 0, in which case the | |
360 | buffer switch can continue. | |
361 | ||
362 | The implementation of the subbuf_start() callback for 'overwrite' mode | |
363 | would be very similar: | |
364 | ||
365 | static int subbuf_start(struct rchan_buf *buf, | |
366 | void *subbuf, | |
367 | void *prev_subbuf, | |
368 | unsigned int prev_padding) | |
369 | { | |
370 | if (prev_subbuf) | |
371 | *((unsigned *)prev_subbuf) = prev_padding; | |
372 | ||
373 | subbuf_start_reserve(buf, sizeof(unsigned int)); | |
374 | ||
375 | return 1; | |
376 | } | |
377 | ||
378 | In this case, the relay_buf_full() check is meaningless and the | |
379 | callback always returns 1, causing the buffer switch to occur | |
380 | unconditionally. It's also meaningless for the client to use the | |
381 | relay_subbufs_consumed() function in this mode, as it's never | |
382 | consulted. | |
383 | ||
384 | The default subbuf_start() implementation, used if the client doesn't | |
385 | define any callbacks, or doesn't define the subbuf_start() callback, | |
386 | implements the simplest possible 'no-overwrite' mode, i.e. it does | |
387 | nothing but return 0. | |
388 | ||
389 | Header information can be reserved at the beginning of each sub-buffer | |
390 | by calling the subbuf_start_reserve() helper function from within the | |
391 | subbuf_start() callback. This reserved area can be used to store | |
392 | whatever information the client wants. In the example above, room is | |
393 | reserved in each sub-buffer to store the padding count for that | |
394 | sub-buffer. This is filled in for the previous sub-buffer in the | |
395 | subbuf_start() implementation; the padding value for the previous | |
396 | sub-buffer is passed into the subbuf_start() callback along with a | |
397 | pointer to the previous sub-buffer, since the padding value isn't | |
398 | known until a sub-buffer is filled. The subbuf_start() callback is | |
399 | also called for the first sub-buffer when the channel is opened, to | |
400 | give the client a chance to reserve space in it. In this case the | |
401 | previous sub-buffer pointer passed into the callback will be NULL, so | |
402 | the client should check the value of the prev_subbuf pointer before | |
403 | writing into the previous sub-buffer. | |
404 | ||
405 | Writing to a channel | |
406 | -------------------- | |
407 | ||
408 | Kernel clients write data into the current cpu's channel buffer using | |
409 | relay_write() or __relay_write(). relay_write() is the main logging | |
410 | function - it uses local_irqsave() to protect the buffer and should be | |
411 | used if you might be logging from interrupt context. If you know | |
412 | you'll never be logging from interrupt context, you can use | |
413 | __relay_write(), which only disables preemption. These functions | |
414 | don't return a value, so you can't determine whether or not they | |
415 | failed - the assumption is that you wouldn't want to check a return | |
416 | value in the fast logging path anyway, and that they'll always succeed | |
417 | unless the buffer is full and no-overwrite mode is being used, in | |
418 | which case you can detect a failed write in the subbuf_start() | |
419 | callback by calling the relay_buf_full() helper function. | |
420 | ||
421 | relay_reserve() is used to reserve a slot in a channel buffer which | |
422 | can be written to later. This would typically be used in applications | |
423 | that need to write directly into a channel buffer without having to | |
424 | stage data in a temporary buffer beforehand. Because the actual write | |
425 | may not happen immediately after the slot is reserved, applications | |
426 | using relay_reserve() can keep a count of the number of bytes actually | |
427 | written, either in space reserved in the sub-buffers themselves or as | |
428 | a separate array. See the 'reserve' example in the relay-apps tarball | |
429 | at http://relayfs.sourceforge.net for an example of how this can be | |
430 | done. Because the write is under control of the client and is | |
431 | separated from the reserve, relay_reserve() doesn't protect the buffer | |
432 | at all - it's up to the client to provide the appropriate | |
433 | synchronization when using relay_reserve(). | |
434 | ||
435 | Closing a channel | |
436 | ----------------- | |
437 | ||
438 | The client calls relay_close() when it's finished using the channel. | |
439 | The channel and its associated buffers are destroyed when there are no | |
440 | longer any references to any of the channel buffers. relay_flush() | |
441 | forces a sub-buffer switch on all the channel buffers, and can be used | |
442 | to finalize and process the last sub-buffers before the channel is | |
443 | closed. | |
444 | ||
445 | Misc | |
446 | ---- | |
447 | ||
448 | Some applications may want to keep a channel around and re-use it | |
449 | rather than open and close a new channel for each use. relay_reset() | |
450 | can be used for this purpose - it resets a channel to its initial | |
451 | state without reallocating channel buffer memory or destroying | |
452 | existing mappings. It should however only be called when it's safe to | |
453 | do so, i.e. when the channel isn't currently being written to. | |
454 | ||
455 | Finally, there are a couple of utility callbacks that can be used for | |
456 | different purposes. buf_mapped() is called whenever a channel buffer | |
457 | is mmapped from user space and buf_unmapped() is called when it's | |
458 | unmapped. The client can use this notification to trigger actions | |
459 | within the kernel application, such as enabling/disabling logging to | |
460 | the channel. | |
461 | ||
462 | ||
463 | Resources | |
464 | ========= | |
465 | ||
466 | For news, example code, mailing list, etc. see the relay interface homepage: | |
467 | ||
468 | http://relayfs.sourceforge.net | |
469 | ||
470 | ||
471 | Credits | |
472 | ======= | |
473 | ||
474 | The ideas and specs for the relay interface came about as a result of | |
475 | discussions on tracing involving the following: | |
476 | ||
477 | Michel Dagenais <michel.dagenais@polymtl.ca> | |
478 | Richard Moore <richardj_moore@uk.ibm.com> | |
479 | Bob Wisniewski <bob@watson.ibm.com> | |
480 | Karim Yaghmour <karim@opersys.com> | |
481 | Tom Zanussi <zanussi@us.ibm.com> | |
482 | ||
483 | Also thanks to Hubertus Franke for a lot of useful suggestions and bug | |
484 | reports. |