Commit | Line | Data |
---|---|---|
7063fbf2 | 1 | |
6c28f2c0 | 2 | configfs - Userspace-driven kernel object configuration. |
7063fbf2 JB |
3 | |
4 | Joel Becker <joel.becker@oracle.com> | |
5 | ||
6 | Updated: 31 March 2005 | |
7 | ||
8 | Copyright (c) 2005 Oracle Corporation, | |
9 | Joel Becker <joel.becker@oracle.com> | |
10 | ||
11 | ||
12 | [What is configfs?] | |
13 | ||
14 | configfs is a ram-based filesystem that provides the converse of | |
15 | sysfs's functionality. Where sysfs is a filesystem-based view of | |
16 | kernel objects, configfs is a filesystem-based manager of kernel | |
17 | objects, or config_items. | |
18 | ||
19 | With sysfs, an object is created in kernel (for example, when a device | |
20 | is discovered) and it is registered with sysfs. Its attributes then | |
21 | appear in sysfs, allowing userspace to read the attributes via | |
22 | readdir(3)/read(2). It may allow some attributes to be modified via | |
23 | write(2). The important point is that the object is created and | |
24 | destroyed in kernel, the kernel controls the lifecycle of the sysfs | |
25 | representation, and sysfs is merely a window on all this. | |
26 | ||
27 | A configfs config_item is created via an explicit userspace operation: | |
28 | mkdir(2). It is destroyed via rmdir(2). The attributes appear at | |
29 | mkdir(2) time, and can be read or modified via read(2) and write(2). | |
30 | As with sysfs, readdir(3) queries the list of items and/or attributes. | |
31 | symlink(2) can be used to group items together. Unlike sysfs, the | |
32 | lifetime of the representation is completely driven by userspace. The | |
33 | kernel modules backing the items must respond to this. | |
34 | ||
35 | Both sysfs and configfs can and should exist together on the same | |
36 | system. One is not a replacement for the other. | |
37 | ||
38 | [Using configfs] | |
39 | ||
40 | configfs can be compiled as a module or into the kernel. You can access | |
41 | it by doing | |
42 | ||
43 | mount -t configfs none /config | |
44 | ||
45 | The configfs tree will be empty unless client modules are also loaded. | |
46 | These are modules that register their item types with configfs as | |
47 | subsystems. Once a client subsystem is loaded, it will appear as a | |
48 | subdirectory (or more than one) under /config. Like sysfs, the | |
49 | configfs tree is always there, whether mounted on /config or not. | |
50 | ||
51 | An item is created via mkdir(2). The item's attributes will also | |
52 | appear at this time. readdir(3) can determine what the attributes are, | |
53 | read(2) can query their default values, and write(2) can store new | |
54 | values. Like sysfs, attributes should be ASCII text files, preferably | |
55 | with only one value per file. The same efficiency caveats from sysfs | |
56 | apply. Don't mix more than one attribute in one attribute file. | |
57 | ||
58 | Like sysfs, configfs expects write(2) to store the entire buffer at | |
59 | once. When writing to configfs attributes, userspace processes should | |
60 | first read the entire file, modify the portions they wish to change, and | |
61 | then write the entire buffer back. Attribute files have a maximum size | |
62 | of one page (PAGE_SIZE, 4096 on i386). | |
63 | ||
64 | When an item needs to be destroyed, remove it with rmdir(2). An | |
65 | item cannot be destroyed if any other item has a link to it (via | |
66 | symlink(2)). Links can be removed via unlink(2). | |
67 | ||
68 | [Configuring FakeNBD: an Example] | |
69 | ||
70 | Imagine there's a Network Block Device (NBD) driver that allows you to | |
71 | access remote block devices. Call it FakeNBD. FakeNBD uses configfs | |
72 | for its configuration. Obviously, there will be a nice program that | |
73 | sysadmins use to configure FakeNBD, but somehow that program has to tell | |
74 | the driver about it. Here's where configfs comes in. | |
75 | ||
76 | When the FakeNBD driver is loaded, it registers itself with configfs. | |
77 | readdir(3) sees this just fine: | |
78 | ||
79 | # ls /config | |
80 | fakenbd | |
81 | ||
82 | A fakenbd connection can be created with mkdir(2). The name is | |
83 | arbitrary, but likely the tool will make some use of the name. Perhaps | |
84 | it is a uuid or a disk name: | |
85 | ||
86 | # mkdir /config/fakenbd/disk1 | |
87 | # ls /config/fakenbd/disk1 | |
88 | target device rw | |
89 | ||
90 | The target attribute contains the IP address of the server FakeNBD will | |
91 | connect to. The device attribute is the device on the server. | |
92 | Predictably, the rw attribute determines whether the connection is | |
93 | read-only or read-write. | |
94 | ||
95 | # echo 10.0.0.1 > /config/fakenbd/disk1/target | |
96 | # echo /dev/sda1 > /config/fakenbd/disk1/device | |
97 | # echo 1 > /config/fakenbd/disk1/rw | |
98 | ||
99 | That's it. That's all there is. Now the device is configured, via the | |
100 | shell no less. | |
101 | ||
102 | [Coding With configfs] | |
103 | ||
104 | Every object in configfs is a config_item. A config_item reflects an | |
105 | object in the subsystem. It has attributes that match values on that | |
106 | object. configfs handles the filesystem representation of that object | |
107 | and its attributes, allowing the subsystem to ignore all but the | |
108 | basic show/store interaction. | |
109 | ||
110 | Items are created and destroyed inside a config_group. A group is a | |
111 | collection of items that share the same attributes and operations. | |
112 | Items are created by mkdir(2) and removed by rmdir(2), but configfs | |
113 | handles that. The group has a set of operations to perform these tasks | |
114 | ||
115 | A subsystem is the top level of a client module. During initialization, | |
116 | the client module registers the subsystem with configfs, the subsystem | |
117 | appears as a directory at the top of the configfs filesystem. A | |
118 | subsystem is also a config_group, and can do everything a config_group | |
119 | can. | |
120 | ||
121 | [struct config_item] | |
122 | ||
123 | struct config_item { | |
124 | char *ci_name; | |
125 | char ci_namebuf[UOBJ_NAME_LEN]; | |
126 | struct kref ci_kref; | |
127 | struct list_head ci_entry; | |
128 | struct config_item *ci_parent; | |
129 | struct config_group *ci_group; | |
130 | struct config_item_type *ci_type; | |
131 | struct dentry *ci_dentry; | |
132 | }; | |
133 | ||
134 | void config_item_init(struct config_item *); | |
135 | void config_item_init_type_name(struct config_item *, | |
136 | const char *name, | |
137 | struct config_item_type *type); | |
138 | struct config_item *config_item_get(struct config_item *); | |
139 | void config_item_put(struct config_item *); | |
140 | ||
141 | Generally, struct config_item is embedded in a container structure, a | |
142 | structure that actually represents what the subsystem is doing. The | |
143 | config_item portion of that structure is how the object interacts with | |
144 | configfs. | |
145 | ||
146 | Whether statically defined in a source file or created by a parent | |
147 | config_group, a config_item must have one of the _init() functions | |
148 | called on it. This initializes the reference count and sets up the | |
149 | appropriate fields. | |
150 | ||
151 | All users of a config_item should have a reference on it via | |
152 | config_item_get(), and drop the reference when they are done via | |
153 | config_item_put(). | |
154 | ||
155 | By itself, a config_item cannot do much more than appear in configfs. | |
156 | Usually a subsystem wants the item to display and/or store attributes, | |
157 | among other things. For that, it needs a type. | |
158 | ||
159 | [struct config_item_type] | |
160 | ||
161 | struct configfs_item_operations { | |
162 | void (*release)(struct config_item *); | |
163 | ssize_t (*show_attribute)(struct config_item *, | |
164 | struct configfs_attribute *, | |
165 | char *); | |
166 | ssize_t (*store_attribute)(struct config_item *, | |
167 | struct configfs_attribute *, | |
168 | const char *, size_t); | |
169 | int (*allow_link)(struct config_item *src, | |
170 | struct config_item *target); | |
171 | int (*drop_link)(struct config_item *src, | |
172 | struct config_item *target); | |
173 | }; | |
174 | ||
175 | struct config_item_type { | |
176 | struct module *ct_owner; | |
177 | struct configfs_item_operations *ct_item_ops; | |
178 | struct configfs_group_operations *ct_group_ops; | |
179 | struct configfs_attribute **ct_attrs; | |
180 | }; | |
181 | ||
182 | The most basic function of a config_item_type is to define what | |
183 | operations can be performed on a config_item. All items that have been | |
184 | allocated dynamically will need to provide the ct_item_ops->release() | |
185 | method. This method is called when the config_item's reference count | |
186 | reaches zero. Items that wish to display an attribute need to provide | |
187 | the ct_item_ops->show_attribute() method. Similarly, storing a new | |
188 | attribute value uses the store_attribute() method. | |
189 | ||
190 | [struct configfs_attribute] | |
191 | ||
192 | struct configfs_attribute { | |
193 | char *ca_name; | |
194 | struct module *ca_owner; | |
195 | mode_t ca_mode; | |
196 | }; | |
197 | ||
198 | When a config_item wants an attribute to appear as a file in the item's | |
199 | configfs directory, it must define a configfs_attribute describing it. | |
200 | It then adds the attribute to the NULL-terminated array | |
201 | config_item_type->ct_attrs. When the item appears in configfs, the | |
202 | attribute file will appear with the configfs_attribute->ca_name | |
203 | filename. configfs_attribute->ca_mode specifies the file permissions. | |
204 | ||
205 | If an attribute is readable and the config_item provides a | |
206 | ct_item_ops->show_attribute() method, that method will be called | |
207 | whenever userspace asks for a read(2) on the attribute. The converse | |
208 | will happen for write(2). | |
209 | ||
210 | [struct config_group] | |
211 | ||
4ae0edc2 | 212 | A config_item cannot live in a vacuum. The only way one can be created |
7063fbf2 JB |
213 | is via mkdir(2) on a config_group. This will trigger creation of a |
214 | child item. | |
215 | ||
216 | struct config_group { | |
217 | struct config_item cg_item; | |
218 | struct list_head cg_children; | |
219 | struct configfs_subsystem *cg_subsys; | |
220 | struct config_group **default_groups; | |
221 | }; | |
222 | ||
223 | void config_group_init(struct config_group *group); | |
224 | void config_group_init_type_name(struct config_group *group, | |
225 | const char *name, | |
226 | struct config_item_type *type); | |
227 | ||
228 | ||
229 | The config_group structure contains a config_item. Properly configuring | |
230 | that item means that a group can behave as an item in its own right. | |
231 | However, it can do more: it can create child items or groups. This is | |
232 | accomplished via the group operations specified on the group's | |
233 | config_item_type. | |
234 | ||
235 | struct configfs_group_operations { | |
f89ab861 JB |
236 | struct config_item *(*make_item)(struct config_group *group, |
237 | const char *name); | |
238 | struct config_group *(*make_group)(struct config_group *group, | |
239 | const char *name); | |
7063fbf2 | 240 | int (*commit_item)(struct config_item *item); |
299894cc JB |
241 | void (*disconnect_notify)(struct config_group *group, |
242 | struct config_item *item); | |
7063fbf2 JB |
243 | void (*drop_item)(struct config_group *group, |
244 | struct config_item *item); | |
245 | }; | |
246 | ||
247 | A group creates child items by providing the | |
248 | ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new | |
249 | config_item (or more likely, its container structure), initializes it, | |
250 | and returns it to configfs. Configfs will then populate the filesystem | |
251 | tree to reflect the new item. | |
252 | ||
253 | If the subsystem wants the child to be a group itself, the subsystem | |
254 | provides ct_group_ops->make_group(). Everything else behaves the same, | |
255 | using the group _init() functions on the group. | |
256 | ||
257 | Finally, when userspace calls rmdir(2) on the item or group, | |
258 | ct_group_ops->drop_item() is called. As a config_group is also a | |
53cb4726 | 259 | config_item, it is not necessary for a separate drop_group() method. |
7063fbf2 JB |
260 | The subsystem must config_item_put() the reference that was initialized |
261 | upon item allocation. If a subsystem has no work to do, it may omit | |
262 | the ct_group_ops->drop_item() method, and configfs will call | |
263 | config_item_put() on the item on behalf of the subsystem. | |
264 | ||
265 | IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) | |
266 | is called, configfs WILL remove the item from the filesystem tree | |
267 | (assuming that it has no children to keep it busy). The subsystem is | |
268 | responsible for responding to this. If the subsystem has references to | |
269 | the item in other threads, the memory is safe. It may take some time | |
270 | for the item to actually disappear from the subsystem's usage. But it | |
271 | is gone from configfs. | |
272 | ||
299894cc JB |
273 | When drop_item() is called, the item's linkage has already been torn |
274 | down. It no longer has a reference on its parent and has no place in | |
275 | the item hierarchy. If a client needs to do some cleanup before this | |
276 | teardown happens, the subsystem can implement the | |
277 | ct_group_ops->disconnect_notify() method. The method is called after | |
278 | configfs has removed the item from the filesystem view but before the | |
279 | item is removed from its parent group. Like drop_item(), | |
280 | disconnect_notify() is void and cannot fail. Client subsystems should | |
281 | not drop any references here, as they still must do it in drop_item(). | |
282 | ||
7063fbf2 JB |
283 | A config_group cannot be removed while it still has child items. This |
284 | is implemented in the configfs rmdir(2) code. ->drop_item() will not be | |
285 | called, as the item has not been dropped. rmdir(2) will fail, as the | |
286 | directory is not empty. | |
287 | ||
288 | [struct configfs_subsystem] | |
289 | ||
4ae0edc2 | 290 | A subsystem must register itself, usually at module_init time. This |
7063fbf2 JB |
291 | tells configfs to make the subsystem appear in the file tree. |
292 | ||
293 | struct configfs_subsystem { | |
294 | struct config_group su_group; | |
e6bd07ae | 295 | struct mutex su_mutex; |
7063fbf2 JB |
296 | }; |
297 | ||
298 | int configfs_register_subsystem(struct configfs_subsystem *subsys); | |
299 | void configfs_unregister_subsystem(struct configfs_subsystem *subsys); | |
300 | ||
e6bd07ae | 301 | A subsystem consists of a toplevel config_group and a mutex. |
7063fbf2 JB |
302 | The group is where child config_items are created. For a subsystem, |
303 | this group is usually defined statically. Before calling | |
304 | configfs_register_subsystem(), the subsystem must have initialized the | |
305 | group via the usual group _init() functions, and it must also have | |
e6bd07ae | 306 | initialized the mutex. |
7063fbf2 JB |
307 | When the register call returns, the subsystem is live, and it |
308 | will be visible via configfs. At that point, mkdir(2) can be called and | |
309 | the subsystem must be ready for it. | |
310 | ||
311 | [An Example] | |
312 | ||
313 | The best example of these basic concepts is the simple_children | |
ecb3d28c JB |
314 | subsystem/group and the simple_child item in configfs_example_explicit.c |
315 | and configfs_example_macros.c. It shows a trivial object displaying and | |
316 | storing an attribute, and a simple group creating and destroying these | |
317 | children. | |
318 | ||
319 | The only difference between configfs_example_explicit.c and | |
320 | configfs_example_macros.c is how the attributes of the childless item | |
321 | are defined. The childless item has extended attributes, each with | |
322 | their own show()/store() operation. This follows a convention commonly | |
323 | used in sysfs. configfs_example_explicit.c creates these attributes | |
324 | by explicitly defining the structures involved. Conversely | |
325 | configfs_example_macros.c uses some convenience macros from configfs.h | |
326 | to define the attributes. These macros are similar to their sysfs | |
327 | counterparts. | |
7063fbf2 | 328 | |
e6bd07ae | 329 | [Hierarchy Navigation and the Subsystem Mutex] |
7063fbf2 JB |
330 | |
331 | There is an extra bonus that configfs provides. The config_groups and | |
332 | config_items are arranged in a hierarchy due to the fact that they | |
333 | appear in a filesystem. A subsystem is NEVER to touch the filesystem | |
334 | parts, but the subsystem might be interested in this hierarchy. For | |
335 | this reason, the hierarchy is mirrored via the config_group->cg_children | |
336 | and config_item->ci_parent structure members. | |
337 | ||
338 | A subsystem can navigate the cg_children list and the ci_parent pointer | |
339 | to see the tree created by the subsystem. This can race with configfs' | |
e6bd07ae | 340 | management of the hierarchy, so configfs uses the subsystem mutex to |
7063fbf2 JB |
341 | protect modifications. Whenever a subsystem wants to navigate the |
342 | hierarchy, it must do so under the protection of the subsystem | |
e6bd07ae | 343 | mutex. |
7063fbf2 | 344 | |
e6bd07ae | 345 | A subsystem will be prevented from acquiring the mutex while a newly |
7063fbf2 | 346 | allocated item has not been linked into this hierarchy. Similarly, it |
e6bd07ae | 347 | will not be able to acquire the mutex while a dropping item has not |
7063fbf2 JB |
348 | yet been unlinked. This means that an item's ci_parent pointer will |
349 | never be NULL while the item is in configfs, and that an item will only | |
350 | be in its parent's cg_children list for the same duration. This allows | |
351 | a subsystem to trust ci_parent and cg_children while they hold the | |
e6bd07ae | 352 | mutex. |
7063fbf2 JB |
353 | |
354 | [Item Aggregation Via symlink(2)] | |
355 | ||
356 | configfs provides a simple group via the group->item parent/child | |
357 | relationship. Often, however, a larger environment requires aggregation | |
358 | outside of the parent/child connection. This is implemented via | |
359 | symlink(2). | |
360 | ||
361 | A config_item may provide the ct_item_ops->allow_link() and | |
362 | ct_item_ops->drop_link() methods. If the ->allow_link() method exists, | |
363 | symlink(2) may be called with the config_item as the source of the link. | |
364 | These links are only allowed between configfs config_items. Any | |
365 | symlink(2) attempt outside the configfs filesystem will be denied. | |
366 | ||
367 | When symlink(2) is called, the source config_item's ->allow_link() | |
368 | method is called with itself and a target item. If the source item | |
369 | allows linking to target item, it returns 0. A source item may wish to | |
370 | reject a link if it only wants links to a certain type of object (say, | |
371 | in its own subsystem). | |
372 | ||
373 | When unlink(2) is called on the symbolic link, the source item is | |
374 | notified via the ->drop_link() method. Like the ->drop_item() method, | |
375 | this is a void function and cannot return failure. The subsystem is | |
376 | responsible for responding to the change. | |
377 | ||
378 | A config_item cannot be removed while it links to any other item, nor | |
379 | can it be removed while an item links to it. Dangling symlinks are not | |
380 | allowed in configfs. | |
381 | ||
382 | [Automatically Created Subgroups] | |
383 | ||
384 | A new config_group may want to have two types of child config_items. | |
385 | While this could be codified by magic names in ->make_item(), it is much | |
386 | more explicit to have a method whereby userspace sees this divergence. | |
387 | ||
388 | Rather than have a group where some items behave differently than | |
389 | others, configfs provides a method whereby one or many subgroups are | |
390 | automatically created inside the parent at its creation. Thus, | |
48cc7ec9 | 391 | mkdir("parent") results in "parent", "parent/subgroup1", up through |
7063fbf2 JB |
392 | "parent/subgroupN". Items of type 1 can now be created in |
393 | "parent/subgroup1", and items of type N can be created in | |
394 | "parent/subgroupN". | |
395 | ||
396 | These automatic subgroups, or default groups, do not preclude other | |
397 | children of the parent group. If ct_group_ops->make_group() exists, | |
398 | other child groups can be created on the parent group directly. | |
399 | ||
400 | A configfs subsystem specifies default groups by filling in the | |
401 | NULL-terminated array default_groups on the config_group structure. | |
402 | Each group in that array is populated in the configfs tree at the same | |
403 | time as the parent group. Similarly, they are removed at the same time | |
404 | as the parent. No extra notification is provided. When a ->drop_item() | |
405 | method call notifies the subsystem the parent group is going away, it | |
406 | also means every default group child associated with that parent group. | |
407 | ||
408 | As a consequence of this, default_groups cannot be removed directly via | |
409 | rmdir(2). They also are not considered when rmdir(2) on the parent | |
410 | group is checking for children. | |
411 | ||
25985edc | 412 | [Dependent Subsystems] |
631d1feb JB |
413 | |
414 | Sometimes other drivers depend on particular configfs items. For | |
415 | example, ocfs2 mounts depend on a heartbeat region item. If that | |
416 | region item is removed with rmdir(2), the ocfs2 mount must BUG or go | |
417 | readonly. Not happy. | |
418 | ||
419 | configfs provides two additional API calls: configfs_depend_item() and | |
420 | configfs_undepend_item(). A client driver can call | |
421 | configfs_depend_item() on an existing item to tell configfs that it is | |
422 | depended on. configfs will then return -EBUSY from rmdir(2) for that | |
423 | item. When the item is no longer depended on, the client driver calls | |
424 | configfs_undepend_item() on it. | |
425 | ||
426 | These API cannot be called underneath any configfs callbacks, as | |
427 | they will conflict. They can block and allocate. A client driver | |
428 | probably shouldn't calling them of its own gumption. Rather it should | |
429 | be providing an API that external subsystems call. | |
430 | ||
431 | How does this work? Imagine the ocfs2 mount process. When it mounts, | |
432 | it asks for a heartbeat region item. This is done via a call into the | |
433 | heartbeat code. Inside the heartbeat code, the region item is looked | |
434 | up. Here, the heartbeat code calls configfs_depend_item(). If it | |
435 | succeeds, then heartbeat knows the region is safe to give to ocfs2. | |
436 | If it fails, it was being torn down anyway, and heartbeat can gracefully | |
437 | pass up an error. | |
438 | ||
7063fbf2 JB |
439 | [Committable Items] |
440 | ||
441 | NOTE: Committable items are currently unimplemented. | |
442 | ||
443 | Some config_items cannot have a valid initial state. That is, no | |
444 | default values can be specified for the item's attributes such that the | |
445 | item can do its work. Userspace must configure one or more attributes, | |
446 | after which the subsystem can start whatever entity this item | |
447 | represents. | |
448 | ||
449 | Consider the FakeNBD device from above. Without a target address *and* | |
450 | a target device, the subsystem has no idea what block device to import. | |
451 | The simple example assumes that the subsystem merely waits until all the | |
452 | appropriate attributes are configured, and then connects. This will, | |
453 | indeed, work, but now every attribute store must check if the attributes | |
454 | are initialized. Every attribute store must fire off the connection if | |
455 | that condition is met. | |
456 | ||
457 | Far better would be an explicit action notifying the subsystem that the | |
458 | config_item is ready to go. More importantly, an explicit action allows | |
3f6dee9b | 459 | the subsystem to provide feedback as to whether the attributes are |
7063fbf2 JB |
460 | initialized in a way that makes sense. configfs provides this as |
461 | committable items. | |
462 | ||
463 | configfs still uses only normal filesystem operations. An item is | |
464 | committed via rename(2). The item is moved from a directory where it | |
465 | can be modified to a directory where it cannot. | |
466 | ||
467 | Any group that provides the ct_group_ops->commit_item() method has | |
468 | committable items. When this group appears in configfs, mkdir(2) will | |
469 | not work directly in the group. Instead, the group will have two | |
470 | subdirectories: "live" and "pending". The "live" directory does not | |
471 | support mkdir(2) or rmdir(2) either. It only allows rename(2). The | |
472 | "pending" directory does allow mkdir(2) and rmdir(2). An item is | |
473 | created in the "pending" directory. Its attributes can be modified at | |
474 | will. Userspace commits the item by renaming it into the "live" | |
d6bc8ac9 | 475 | directory. At this point, the subsystem receives the ->commit_item() |
7063fbf2 JB |
476 | callback. If all required attributes are filled to satisfaction, the |
477 | method returns zero and the item is moved to the "live" directory. | |
478 | ||
479 | As rmdir(2) does not work in the "live" directory, an item must be | |
480 | shutdown, or "uncommitted". Again, this is done via rename(2), this | |
481 | time from the "live" directory back to the "pending" one. The subsystem | |
482 | is notified by the ct_group_ops->uncommit_object() method. | |
483 | ||
484 |