|  | ============================== | 
|  | General notification mechanism | 
|  | ============================== | 
|  |  | 
|  | The general notification mechanism is built on top of the standard pipe driver | 
|  | whereby it effectively splices notification messages from the kernel into pipes | 
|  | opened by userspace.  This can be used in conjunction with:: | 
|  |  | 
|  | * Key/keyring notifications | 
|  |  | 
|  |  | 
|  | The notifications buffers can be enabled by: | 
|  |  | 
|  | "General setup"/"General notification queue" | 
|  | (CONFIG_WATCH_QUEUE) | 
|  |  | 
|  | This document has the following sections: | 
|  |  | 
|  | .. contents:: :local: | 
|  |  | 
|  |  | 
|  | Overview | 
|  | ======== | 
|  |  | 
|  | This facility appears as a pipe that is opened in a special mode.  The pipe's | 
|  | internal ring buffer is used to hold messages that are generated by the kernel. | 
|  | These messages are then read out by read().  Splice and similar are disabled on | 
|  | such pipes due to them wanting to, under some circumstances, revert their | 
|  | additions to the ring - which might end up interleaved with notification | 
|  | messages. | 
|  |  | 
|  | The owner of the pipe has to tell the kernel which sources it would like to | 
|  | watch through that pipe.  Only sources that have been connected to a pipe will | 
|  | insert messages into it.  Note that a source may be bound to multiple pipes and | 
|  | insert messages into all of them simultaneously. | 
|  |  | 
|  | Filters may also be emplaced on a pipe so that certain source types and | 
|  | subevents can be ignored if they're not of interest. | 
|  |  | 
|  | A message will be discarded if there isn't a slot available in the ring or if | 
|  | no preallocated message buffer is available.  In both of these cases, read() | 
|  | will insert a WATCH_META_LOSS_NOTIFICATION message into the output buffer after | 
|  | the last message currently in the buffer has been read. | 
|  |  | 
|  | Note that when producing a notification, the kernel does not wait for the | 
|  | consumers to collect it, but rather just continues on.  This means that | 
|  | notifications can be generated whilst spinlocks are held and also protects the | 
|  | kernel from being held up indefinitely by a userspace malfunction. | 
|  |  | 
|  |  | 
|  | Message Structure | 
|  | ================= | 
|  |  | 
|  | Notification messages begin with a short header:: | 
|  |  | 
|  | struct watch_notification { | 
|  | __u32	type:24; | 
|  | __u32	subtype:8; | 
|  | __u32	info; | 
|  | }; | 
|  |  | 
|  | "type" indicates the source of the notification record and "subtype" indicates | 
|  | the type of record from that source (see the Watch Sources section below).  The | 
|  | type may also be "WATCH_TYPE_META".  This is a special record type generated | 
|  | internally by the watch queue itself.  There are two subtypes: | 
|  |  | 
|  | * WATCH_META_REMOVAL_NOTIFICATION | 
|  | * WATCH_META_LOSS_NOTIFICATION | 
|  |  | 
|  | The first indicates that an object on which a watch was installed was removed | 
|  | or destroyed and the second indicates that some messages have been lost. | 
|  |  | 
|  | "info" indicates a bunch of things, including: | 
|  |  | 
|  | * The length of the message in bytes, including the header (mask with | 
|  | WATCH_INFO_LENGTH and shift by WATCH_INFO_LENGTH__SHIFT).  This indicates | 
|  | the size of the record, which may be between 8 and 127 bytes. | 
|  |  | 
|  | * The watch ID (mask with WATCH_INFO_ID and shift by WATCH_INFO_ID__SHIFT). | 
|  | This indicates that caller's ID of the watch, which may be between 0 | 
|  | and 255.  Multiple watches may share a queue, and this provides a means to | 
|  | distinguish them. | 
|  |  | 
|  | * A type-specific field (WATCH_INFO_TYPE_INFO).  This is set by the | 
|  | notification producer to indicate some meaning specific to the type and | 
|  | subtype. | 
|  |  | 
|  | Everything in info apart from the length can be used for filtering. | 
|  |  | 
|  | The header can be followed by supplementary information.  The format of this is | 
|  | at the discretion is defined by the type and subtype. | 
|  |  | 
|  |  | 
|  | Watch List (Notification Source) API | 
|  | ==================================== | 
|  |  | 
|  | A "watch list" is a list of watchers that are subscribed to a source of | 
|  | notifications.  A list may be attached to an object (say a key or a superblock) | 
|  | or may be global (say for device events).  From a userspace perspective, a | 
|  | non-global watch list is typically referred to by reference to the object it | 
|  | belongs to (such as using KEYCTL_NOTIFY and giving it a key serial number to | 
|  | watch that specific key). | 
|  |  | 
|  | To manage a watch list, the following functions are provided: | 
|  |  | 
|  | * :: | 
|  |  | 
|  | void init_watch_list(struct watch_list *wlist, | 
|  | void (*release_watch)(struct watch *wlist)); | 
|  |  | 
|  | Initialise a watch list.  If ``release_watch`` is not NULL, then this | 
|  | indicates a function that should be called when the watch_list object is | 
|  | destroyed to discard any references the watch list holds on the watched | 
|  | object. | 
|  |  | 
|  | * ``void remove_watch_list(struct watch_list *wlist);`` | 
|  |  | 
|  | This removes all of the watches subscribed to a watch_list and frees them | 
|  | and then destroys the watch_list object itself. | 
|  |  | 
|  |  | 
|  | Watch Queue (Notification Output) API | 
|  | ===================================== | 
|  |  | 
|  | A "watch queue" is the buffer allocated by an application that notification | 
|  | records will be written into.  The workings of this are hidden entirely inside | 
|  | of the pipe device driver, but it is necessary to gain a reference to it to set | 
|  | a watch.  These can be managed with: | 
|  |  | 
|  | * ``struct watch_queue *get_watch_queue(int fd);`` | 
|  |  | 
|  | Since watch queues are indicated to the kernel by the fd of the pipe that | 
|  | implements the buffer, userspace must hand that fd through a system call. | 
|  | This can be used to look up an opaque pointer to the watch queue from the | 
|  | system call. | 
|  |  | 
|  | * ``void put_watch_queue(struct watch_queue *wqueue);`` | 
|  |  | 
|  | This discards the reference obtained from ``get_watch_queue()``. | 
|  |  | 
|  |  | 
|  | Watch Subscription API | 
|  | ====================== | 
|  |  | 
|  | A "watch" is a subscription on a watch list, indicating the watch queue, and | 
|  | thus the buffer, into which notification records should be written.  The watch | 
|  | queue object may also carry filtering rules for that object, as set by | 
|  | userspace.  Some parts of the watch struct can be set by the driver:: | 
|  |  | 
|  | struct watch { | 
|  | union { | 
|  | u32		info_id;	/* ID to be OR'd in to info field */ | 
|  | ... | 
|  | }; | 
|  | void			*private;	/* Private data for the watched object */ | 
|  | u64			id;		/* Internal identifier */ | 
|  | ... | 
|  | }; | 
|  |  | 
|  | The ``info_id`` value should be an 8-bit number obtained from userspace and | 
|  | shifted by WATCH_INFO_ID__SHIFT.  This is OR'd into the WATCH_INFO_ID field of | 
|  | struct watch_notification::info when and if the notification is written into | 
|  | the associated watch queue buffer. | 
|  |  | 
|  | The ``private`` field is the driver's data associated with the watch_list and | 
|  | is cleaned up by the ``watch_list::release_watch()`` method. | 
|  |  | 
|  | The ``id`` field is the source's ID.  Notifications that are posted with a | 
|  | different ID are ignored. | 
|  |  | 
|  | The following functions are provided to manage watches: | 
|  |  | 
|  | * ``void init_watch(struct watch *watch, struct watch_queue *wqueue);`` | 
|  |  | 
|  | Initialise a watch object, setting its pointer to the watch queue, using | 
|  | appropriate barriering to avoid lockdep complaints. | 
|  |  | 
|  | * ``int add_watch_to_object(struct watch *watch, struct watch_list *wlist);`` | 
|  |  | 
|  | Subscribe a watch to a watch list (notification source).  The | 
|  | driver-settable fields in the watch struct must have been set before this | 
|  | is called. | 
|  |  | 
|  | * :: | 
|  |  | 
|  | int remove_watch_from_object(struct watch_list *wlist, | 
|  | struct watch_queue *wqueue, | 
|  | u64 id, false); | 
|  |  | 
|  | Remove a watch from a watch list, where the watch must match the specified | 
|  | watch queue (``wqueue``) and object identifier (``id``).  A notification | 
|  | (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue to | 
|  | indicate that the watch got removed. | 
|  |  | 
|  | * ``int remove_watch_from_object(struct watch_list *wlist, NULL, 0, true);`` | 
|  |  | 
|  | Remove all the watches from a watch list.  It is expected that this will be | 
|  | called preparatory to destruction and that the watch list will be | 
|  | inaccessible to new watches by this point.  A notification | 
|  | (``WATCH_META_REMOVAL_NOTIFICATION``) is sent to the watch queue of each | 
|  | subscribed watch to indicate that the watch got removed. | 
|  |  | 
|  |  | 
|  | Notification Posting API | 
|  | ======================== | 
|  |  | 
|  | To post a notification to watch list so that the subscribed watches can see it, | 
|  | the following function should be used:: | 
|  |  | 
|  | void post_watch_notification(struct watch_list *wlist, | 
|  | struct watch_notification *n, | 
|  | const struct cred *cred, | 
|  | u64 id); | 
|  |  | 
|  | The notification should be preformatted and a pointer to the header (``n``) | 
|  | should be passed in.  The notification may be larger than this and the size in | 
|  | units of buffer slots is noted in ``n->info & WATCH_INFO_LENGTH``. | 
|  |  | 
|  | The ``cred`` struct indicates the credentials of the source (subject) and is | 
|  | passed to the LSMs, such as SELinux, to allow or suppress the recording of the | 
|  | note in each individual queue according to the credentials of that queue | 
|  | (object). | 
|  |  | 
|  | The ``id`` is the ID of the source object (such as the serial number on a key). | 
|  | Only watches that have the same ID set in them will see this notification. | 
|  |  | 
|  |  | 
|  | Watch Sources | 
|  | ============= | 
|  |  | 
|  | Any particular buffer can be fed from multiple sources.  Sources include: | 
|  |  | 
|  | * WATCH_TYPE_KEY_NOTIFY | 
|  |  | 
|  | Notifications of this type indicate changes to keys and keyrings, including | 
|  | the changes of keyring contents or the attributes of keys. | 
|  |  | 
|  | See Documentation/security/keys/core.rst for more information. | 
|  |  | 
|  |  | 
|  | Event Filtering | 
|  | =============== | 
|  |  | 
|  | Once a watch queue has been created, a set of filters can be applied to limit | 
|  | the events that are received using:: | 
|  |  | 
|  | struct watch_notification_filter filter = { | 
|  | ... | 
|  | }; | 
|  | ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter) | 
|  |  | 
|  | The filter description is a variable of type:: | 
|  |  | 
|  | struct watch_notification_filter { | 
|  | __u32	nr_filters; | 
|  | __u32	__reserved; | 
|  | struct watch_notification_type_filter filters[]; | 
|  | }; | 
|  |  | 
|  | Where "nr_filters" is the number of filters in filters[] and "__reserved" | 
|  | should be 0.  The "filters" array has elements of the following type:: | 
|  |  | 
|  | struct watch_notification_type_filter { | 
|  | __u32	type; | 
|  | __u32	info_filter; | 
|  | __u32	info_mask; | 
|  | __u32	subtype_filter[8]; | 
|  | }; | 
|  |  | 
|  | Where: | 
|  |  | 
|  | * ``type`` is the event type to filter for and should be something like | 
|  | "WATCH_TYPE_KEY_NOTIFY" | 
|  |  | 
|  | * ``info_filter`` and ``info_mask`` act as a filter on the info field of the | 
|  | notification record.  The notification is only written into the buffer if:: | 
|  |  | 
|  | (watch.info & info_mask) == info_filter | 
|  |  | 
|  | This could be used, for example, to ignore events that are not exactly on | 
|  | the watched point in a mount tree. | 
|  |  | 
|  | * ``subtype_filter`` is a bitmask indicating the subtypes that are of | 
|  | interest.  Bit 0 of subtype_filter[0] corresponds to subtype 0, bit 1 to | 
|  | subtype 1, and so on. | 
|  |  | 
|  | If the argument to the ioctl() is NULL, then the filters will be removed and | 
|  | all events from the watched sources will come through. | 
|  |  | 
|  |  | 
|  | Userspace Code Example | 
|  | ====================== | 
|  |  | 
|  | A buffer is created with something like the following:: | 
|  |  | 
|  | pipe2(fds, O_TMPFILE); | 
|  | ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, 256); | 
|  |  | 
|  | It can then be set to receive keyring change notifications:: | 
|  |  | 
|  | keyctl(KEYCTL_WATCH_KEY, KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); | 
|  |  | 
|  | The notifications can then be consumed by something like the following:: | 
|  |  | 
|  | static void consumer(int rfd, struct watch_queue_buffer *buf) | 
|  | { | 
|  | unsigned char buffer[128]; | 
|  | ssize_t buf_len; | 
|  |  | 
|  | while (buf_len = read(rfd, buffer, sizeof(buffer)), | 
|  | buf_len > 0 | 
|  | ) { | 
|  | void *p = buffer; | 
|  | void *end = buffer + buf_len; | 
|  | while (p < end) { | 
|  | union { | 
|  | struct watch_notification n; | 
|  | unsigned char buf1[128]; | 
|  | } n; | 
|  | size_t largest, len; | 
|  |  | 
|  | largest = end - p; | 
|  | if (largest > 128) | 
|  | largest = 128; | 
|  | memcpy(&n, p, largest); | 
|  |  | 
|  | len = (n->info & WATCH_INFO_LENGTH) >> | 
|  | WATCH_INFO_LENGTH__SHIFT; | 
|  | if (len == 0 || len > largest) | 
|  | return; | 
|  |  | 
|  | switch (n.n.type) { | 
|  | case WATCH_TYPE_META: | 
|  | got_meta(&n.n); | 
|  | case WATCH_TYPE_KEY_NOTIFY: | 
|  | saw_key_change(&n.n); | 
|  | break; | 
|  | } | 
|  |  | 
|  | p += len; | 
|  | } | 
|  | } | 
|  | } |