|  | .. SPDX-License-Identifier: GPL-2.0 | 
|  |  | 
|  | ================================= | 
|  | NETWORK FILESYSTEM HELPER LIBRARY | 
|  | ================================= | 
|  |  | 
|  | .. Contents: | 
|  |  | 
|  | - Overview. | 
|  | - Buffered read helpers. | 
|  | - Read helper functions. | 
|  | - Read helper structures. | 
|  | - Read helper operations. | 
|  | - Read helper procedure. | 
|  | - Read helper cache API. | 
|  |  | 
|  |  | 
|  | Overview | 
|  | ======== | 
|  |  | 
|  | The network filesystem helper library is a set of functions designed to aid a | 
|  | network filesystem in implementing VM/VFS operations.  For the moment, that | 
|  | just includes turning various VM buffered read operations into requests to read | 
|  | from the server.  The helper library, however, can also interpose other | 
|  | services, such as local caching or local data encryption. | 
|  |  | 
|  | Note that the library module doesn't link against local caching directly, so | 
|  | access must be provided by the netfs. | 
|  |  | 
|  |  | 
|  | Buffered Read Helpers | 
|  | ===================== | 
|  |  | 
|  | The library provides a set of read helpers that handle the ->readpage(), | 
|  | ->readahead() and much of the ->write_begin() VM operations and translate them | 
|  | into a common call framework. | 
|  |  | 
|  | The following services are provided: | 
|  |  | 
|  | * Handles transparent huge pages (THPs). | 
|  |  | 
|  | * Insulates the netfs from VM interface changes. | 
|  |  | 
|  | * Allows the netfs to arbitrarily split reads up into pieces, even ones that | 
|  | don't match page sizes or page alignments and that may cross pages. | 
|  |  | 
|  | * Allows the netfs to expand a readahead request in both directions to meet | 
|  | its needs. | 
|  |  | 
|  | * Allows the netfs to partially fulfil a read, which will then be resubmitted. | 
|  |  | 
|  | * Handles local caching, allowing cached data and server-read data to be | 
|  | interleaved for a single request. | 
|  |  | 
|  | * Handles clearing of bufferage that aren't on the server. | 
|  |  | 
|  | * Handle retrying of reads that failed, switching reads from the cache to the | 
|  | server as necessary. | 
|  |  | 
|  | * In the future, this is a place that other services can be performed, such as | 
|  | local encryption of data to be stored remotely or in the cache. | 
|  |  | 
|  | From the network filesystem, the helpers require a table of operations.  This | 
|  | includes a mandatory method to issue a read operation along with a number of | 
|  | optional methods. | 
|  |  | 
|  |  | 
|  | Read Helper Functions | 
|  | --------------------- | 
|  |  | 
|  | Three read helpers are provided:: | 
|  |  | 
|  | * void netfs_readahead(struct readahead_control *ractl, | 
|  | const struct netfs_read_request_ops *ops, | 
|  | void *netfs_priv);`` | 
|  | * int netfs_readpage(struct file *file, | 
|  | struct page *page, | 
|  | const struct netfs_read_request_ops *ops, | 
|  | void *netfs_priv); | 
|  | * int netfs_write_begin(struct file *file, | 
|  | struct address_space *mapping, | 
|  | loff_t pos, | 
|  | unsigned int len, | 
|  | unsigned int flags, | 
|  | struct page **_page, | 
|  | void **_fsdata, | 
|  | const struct netfs_read_request_ops *ops, | 
|  | void *netfs_priv); | 
|  |  | 
|  | Each corresponds to a VM operation, with the addition of a couple of parameters | 
|  | for the use of the read helpers: | 
|  |  | 
|  | * ``ops`` | 
|  |  | 
|  | A table of operations through which the helpers can talk to the filesystem. | 
|  |  | 
|  | * ``netfs_priv`` | 
|  |  | 
|  | Filesystem private data (can be NULL). | 
|  |  | 
|  | Both of these values will be stored into the read request structure. | 
|  |  | 
|  | For ->readahead() and ->readpage(), the network filesystem should just jump | 
|  | into the corresponding read helper; whereas for ->write_begin(), it may be a | 
|  | little more complicated as the network filesystem might want to flush | 
|  | conflicting writes or track dirty data and needs to put the acquired page if an | 
|  | error occurs after calling the helper. | 
|  |  | 
|  | The helpers manage the read request, calling back into the network filesystem | 
|  | through the suppplied table of operations.  Waits will be performed as | 
|  | necessary before returning for helpers that are meant to be synchronous. | 
|  |  | 
|  | If an error occurs and netfs_priv is non-NULL, ops->cleanup() will be called to | 
|  | deal with it.  If some parts of the request are in progress when an error | 
|  | occurs, the request will get partially completed if sufficient data is read. | 
|  |  | 
|  | Additionally, there is:: | 
|  |  | 
|  | * void netfs_subreq_terminated(struct netfs_read_subrequest *subreq, | 
|  | ssize_t transferred_or_error, | 
|  | bool was_async); | 
|  |  | 
|  | which should be called to complete a read subrequest.  This is given the number | 
|  | of bytes transferred or a negative error code, plus a flag indicating whether | 
|  | the operation was asynchronous (ie. whether the follow-on processing can be | 
|  | done in the current context, given this may involve sleeping). | 
|  |  | 
|  |  | 
|  | Read Helper Structures | 
|  | ---------------------- | 
|  |  | 
|  | The read helpers make use of a couple of structures to maintain the state of | 
|  | the read.  The first is a structure that manages a read request as a whole:: | 
|  |  | 
|  | struct netfs_read_request { | 
|  | struct inode		*inode; | 
|  | struct address_space	*mapping; | 
|  | struct netfs_cache_resources cache_resources; | 
|  | void			*netfs_priv; | 
|  | loff_t			start; | 
|  | size_t			len; | 
|  | loff_t			i_size; | 
|  | const struct netfs_read_request_ops *netfs_ops; | 
|  | unsigned int		debug_id; | 
|  | ... | 
|  | }; | 
|  |  | 
|  | The above fields are the ones the netfs can use.  They are: | 
|  |  | 
|  | * ``inode`` | 
|  | * ``mapping`` | 
|  |  | 
|  | The inode and the address space of the file being read from.  The mapping | 
|  | may or may not point to inode->i_data. | 
|  |  | 
|  | * ``cache_resources`` | 
|  |  | 
|  | Resources for the local cache to use, if present. | 
|  |  | 
|  | * ``netfs_priv`` | 
|  |  | 
|  | The network filesystem's private data.  The value for this can be passed in | 
|  | to the helper functions or set during the request.  The ->cleanup() op will | 
|  | be called if this is non-NULL at the end. | 
|  |  | 
|  | * ``start`` | 
|  | * ``len`` | 
|  |  | 
|  | The file position of the start of the read request and the length.  These | 
|  | may be altered by the ->expand_readahead() op. | 
|  |  | 
|  | * ``i_size`` | 
|  |  | 
|  | The size of the file at the start of the request. | 
|  |  | 
|  | * ``netfs_ops`` | 
|  |  | 
|  | A pointer to the operation table.  The value for this is passed into the | 
|  | helper functions. | 
|  |  | 
|  | * ``debug_id`` | 
|  |  | 
|  | A number allocated to this operation that can be displayed in trace lines | 
|  | for reference. | 
|  |  | 
|  |  | 
|  | The second structure is used to manage individual slices of the overall read | 
|  | request:: | 
|  |  | 
|  | struct netfs_read_subrequest { | 
|  | struct netfs_read_request *rreq; | 
|  | loff_t			start; | 
|  | size_t			len; | 
|  | size_t			transferred; | 
|  | unsigned long		flags; | 
|  | unsigned short		debug_index; | 
|  | ... | 
|  | }; | 
|  |  | 
|  | Each subrequest is expected to access a single source, though the helpers will | 
|  | handle falling back from one source type to another.  The members are: | 
|  |  | 
|  | * ``rreq`` | 
|  |  | 
|  | A pointer to the read request. | 
|  |  | 
|  | * ``start`` | 
|  | * ``len`` | 
|  |  | 
|  | The file position of the start of this slice of the read request and the | 
|  | length. | 
|  |  | 
|  | * ``transferred`` | 
|  |  | 
|  | The amount of data transferred so far of the length of this slice.  The | 
|  | network filesystem or cache should start the operation this far into the | 
|  | slice.  If a short read occurs, the helpers will call again, having updated | 
|  | this to reflect the amount read so far. | 
|  |  | 
|  | * ``flags`` | 
|  |  | 
|  | Flags pertaining to the read.  There are two of interest to the filesystem | 
|  | or cache: | 
|  |  | 
|  | * ``NETFS_SREQ_CLEAR_TAIL`` | 
|  |  | 
|  | This can be set to indicate that the remainder of the slice, from | 
|  | transferred to len, should be cleared. | 
|  |  | 
|  | * ``NETFS_SREQ_SEEK_DATA_READ`` | 
|  |  | 
|  | This is a hint to the cache that it might want to try skipping ahead to | 
|  | the next data (ie. using SEEK_DATA). | 
|  |  | 
|  | * ``debug_index`` | 
|  |  | 
|  | A number allocated to this slice that can be displayed in trace lines for | 
|  | reference. | 
|  |  | 
|  |  | 
|  | Read Helper Operations | 
|  | ---------------------- | 
|  |  | 
|  | The network filesystem must provide the read helpers with a table of operations | 
|  | through which it can issue requests and negotiate:: | 
|  |  | 
|  | struct netfs_read_request_ops { | 
|  | void (*init_rreq)(struct netfs_read_request *rreq, struct file *file); | 
|  | bool (*is_cache_enabled)(struct inode *inode); | 
|  | int (*begin_cache_operation)(struct netfs_read_request *rreq); | 
|  | void (*expand_readahead)(struct netfs_read_request *rreq); | 
|  | bool (*clamp_length)(struct netfs_read_subrequest *subreq); | 
|  | void (*issue_op)(struct netfs_read_subrequest *subreq); | 
|  | bool (*is_still_valid)(struct netfs_read_request *rreq); | 
|  | int (*check_write_begin)(struct file *file, loff_t pos, unsigned len, | 
|  | struct page *page, void **_fsdata); | 
|  | void (*done)(struct netfs_read_request *rreq); | 
|  | void (*cleanup)(struct address_space *mapping, void *netfs_priv); | 
|  | }; | 
|  |  | 
|  | The operations are as follows: | 
|  |  | 
|  | * ``init_rreq()`` | 
|  |  | 
|  | [Optional] This is called to initialise the request structure.  It is given | 
|  | the file for reference and can modify the ->netfs_priv value. | 
|  |  | 
|  | * ``is_cache_enabled()`` | 
|  |  | 
|  | [Required] This is called by netfs_write_begin() to ask if the file is being | 
|  | cached.  It should return true if it is being cached and false otherwise. | 
|  |  | 
|  | * ``begin_cache_operation()`` | 
|  |  | 
|  | [Optional] This is called to ask the network filesystem to call into the | 
|  | cache (if present) to initialise the caching state for this read.  The netfs | 
|  | library module cannot access the cache directly, so the cache should call | 
|  | something like fscache_begin_read_operation() to do this. | 
|  |  | 
|  | The cache gets to store its state in ->cache_resources and must set a table | 
|  | of operations of its own there (though of a different type). | 
|  |  | 
|  | This should return 0 on success and an error code otherwise.  If an error is | 
|  | reported, the operation may proceed anyway, just without local caching (only | 
|  | out of memory and interruption errors cause failure here). | 
|  |  | 
|  | * ``expand_readahead()`` | 
|  |  | 
|  | [Optional] This is called to allow the filesystem to expand the size of a | 
|  | readahead read request.  The filesystem gets to expand the request in both | 
|  | directions, though it's not permitted to reduce it as the numbers may | 
|  | represent an allocation already made.  If local caching is enabled, it gets | 
|  | to expand the request first. | 
|  |  | 
|  | Expansion is communicated by changing ->start and ->len in the request | 
|  | structure.  Note that if any change is made, ->len must be increased by at | 
|  | least as much as ->start is reduced. | 
|  |  | 
|  | * ``clamp_length()`` | 
|  |  | 
|  | [Optional] This is called to allow the filesystem to reduce the size of a | 
|  | subrequest.  The filesystem can use this, for example, to chop up a request | 
|  | that has to be split across multiple servers or to put multiple reads in | 
|  | flight. | 
|  |  | 
|  | This should return 0 on success and an error code on error. | 
|  |  | 
|  | * ``issue_op()`` | 
|  |  | 
|  | [Required] The helpers use this to dispatch a subrequest to the server for | 
|  | reading.  In the subrequest, ->start, ->len and ->transferred indicate what | 
|  | data should be read from the server. | 
|  |  | 
|  | There is no return value; the netfs_subreq_terminated() function should be | 
|  | called to indicate whether or not the operation succeeded and how much data | 
|  | it transferred.  The filesystem also should not deal with setting pages | 
|  | uptodate, unlocking them or dropping their refs - the helpers need to deal | 
|  | with this as they have to coordinate with copying to the local cache. | 
|  |  | 
|  | Note that the helpers have the pages locked, but not pinned.  It is possible | 
|  | to use the ITER_XARRAY iov iterator to refer to the range of the inode that | 
|  | is being operated upon without the need to allocate large bvec tables. | 
|  |  | 
|  | * ``is_still_valid()`` | 
|  |  | 
|  | [Optional] This is called to find out if the data just read from the local | 
|  | cache is still valid.  It should return true if it is still valid and false | 
|  | if not.  If it's not still valid, it will be reread from the server. | 
|  |  | 
|  | * ``check_write_begin()`` | 
|  |  | 
|  | [Optional] This is called from the netfs_write_begin() helper once it has | 
|  | allocated/grabbed the page to be modified to allow the filesystem to flush | 
|  | conflicting state before allowing it to be modified. | 
|  |  | 
|  | It should return 0 if everything is now fine, -EAGAIN if the page should be | 
|  | regrabbed and any other error code to abort the operation. | 
|  |  | 
|  | * ``done`` | 
|  |  | 
|  | [Optional] This is called after the pages in the request have all been | 
|  | unlocked (and marked uptodate if applicable). | 
|  |  | 
|  | * ``cleanup`` | 
|  |  | 
|  | [Optional] This is called as the request is being deallocated so that the | 
|  | filesystem can clean up ->netfs_priv. | 
|  |  | 
|  |  | 
|  |  | 
|  | Read Helper Procedure | 
|  | --------------------- | 
|  |  | 
|  | The read helpers work by the following general procedure: | 
|  |  | 
|  | * Set up the request. | 
|  |  | 
|  | * For readahead, allow the local cache and then the network filesystem to | 
|  | propose expansions to the read request.  This is then proposed to the VM. | 
|  | If the VM cannot fully perform the expansion, a partially expanded read will | 
|  | be performed, though this may not get written to the cache in its entirety. | 
|  |  | 
|  | * Loop around slicing chunks off of the request to form subrequests: | 
|  |  | 
|  | * If a local cache is present, it gets to do the slicing, otherwise the | 
|  | helpers just try to generate maximal slices. | 
|  |  | 
|  | * The network filesystem gets to clamp the size of each slice if it is to be | 
|  | the source.  This allows rsize and chunking to be implemented. | 
|  |  | 
|  | * The helpers issue a read from the cache or a read from the server or just | 
|  | clears the slice as appropriate. | 
|  |  | 
|  | * The next slice begins at the end of the last one. | 
|  |  | 
|  | * As slices finish being read, they terminate. | 
|  |  | 
|  | * When all the subrequests have terminated, the subrequests are assessed and | 
|  | any that are short or have failed are reissued: | 
|  |  | 
|  | * Failed cache requests are issued against the server instead. | 
|  |  | 
|  | * Failed server requests just fail. | 
|  |  | 
|  | * Short reads against either source will be reissued against that source | 
|  | provided they have transferred some more data: | 
|  |  | 
|  | * The cache may need to skip holes that it can't do DIO from. | 
|  |  | 
|  | * If NETFS_SREQ_CLEAR_TAIL was set, a short read will be cleared to the | 
|  | end of the slice instead of reissuing. | 
|  |  | 
|  | * Once the data is read, the pages that have been fully read/cleared: | 
|  |  | 
|  | * Will be marked uptodate. | 
|  |  | 
|  | * If a cache is present, will be marked with PG_fscache. | 
|  |  | 
|  | * Unlocked | 
|  |  | 
|  | * Any pages that need writing to the cache will then have DIO writes issued. | 
|  |  | 
|  | * Synchronous operations will wait for reading to be complete. | 
|  |  | 
|  | * Writes to the cache will proceed asynchronously and the pages will have the | 
|  | PG_fscache mark removed when that completes. | 
|  |  | 
|  | * The request structures will be cleaned up when everything has completed. | 
|  |  | 
|  |  | 
|  | Read Helper Cache API | 
|  | --------------------- | 
|  |  | 
|  | When implementing a local cache to be used by the read helpers, two things are | 
|  | required: some way for the network filesystem to initialise the caching for a | 
|  | read request and a table of operations for the helpers to call. | 
|  |  | 
|  | The network filesystem's ->begin_cache_operation() method is called to set up a | 
|  | cache and this must call into the cache to do the work.  If using fscache, for | 
|  | example, the cache would call:: | 
|  |  | 
|  | int fscache_begin_read_operation(struct netfs_read_request *rreq, | 
|  | struct fscache_cookie *cookie); | 
|  |  | 
|  | passing in the request pointer and the cookie corresponding to the file. | 
|  |  | 
|  | The netfs_read_request object contains a place for the cache to hang its | 
|  | state:: | 
|  |  | 
|  | struct netfs_cache_resources { | 
|  | const struct netfs_cache_ops	*ops; | 
|  | void				*cache_priv; | 
|  | void				*cache_priv2; | 
|  | }; | 
|  |  | 
|  | This contains an operations table pointer and two private pointers.  The | 
|  | operation table looks like the following:: | 
|  |  | 
|  | struct netfs_cache_ops { | 
|  | void (*end_operation)(struct netfs_cache_resources *cres); | 
|  |  | 
|  | void (*expand_readahead)(struct netfs_cache_resources *cres, | 
|  | loff_t *_start, size_t *_len, loff_t i_size); | 
|  |  | 
|  | enum netfs_read_source (*prepare_read)(struct netfs_read_subrequest *subreq, | 
|  | loff_t i_size); | 
|  |  | 
|  | int (*read)(struct netfs_cache_resources *cres, | 
|  | loff_t start_pos, | 
|  | struct iov_iter *iter, | 
|  | bool seek_data, | 
|  | netfs_io_terminated_t term_func, | 
|  | void *term_func_priv); | 
|  |  | 
|  | int (*write)(struct netfs_cache_resources *cres, | 
|  | loff_t start_pos, | 
|  | struct iov_iter *iter, | 
|  | netfs_io_terminated_t term_func, | 
|  | void *term_func_priv); | 
|  | }; | 
|  |  | 
|  | With a termination handler function pointer:: | 
|  |  | 
|  | typedef void (*netfs_io_terminated_t)(void *priv, | 
|  | ssize_t transferred_or_error, | 
|  | bool was_async); | 
|  |  | 
|  | The methods defined in the table are: | 
|  |  | 
|  | * ``end_operation()`` | 
|  |  | 
|  | [Required] Called to clean up the resources at the end of the read request. | 
|  |  | 
|  | * ``expand_readahead()`` | 
|  |  | 
|  | [Optional] Called at the beginning of a netfs_readahead() operation to allow | 
|  | the cache to expand a request in either direction.  This allows the cache to | 
|  | size the request appropriately for the cache granularity. | 
|  |  | 
|  | The function is passed poiners to the start and length in its parameters, | 
|  | plus the size of the file for reference, and adjusts the start and length | 
|  | appropriately.  It should return one of: | 
|  |  | 
|  | * ``NETFS_FILL_WITH_ZEROES`` | 
|  | * ``NETFS_DOWNLOAD_FROM_SERVER`` | 
|  | * ``NETFS_READ_FROM_CACHE`` | 
|  | * ``NETFS_INVALID_READ`` | 
|  |  | 
|  | to indicate whether the slice should just be cleared or whether it should be | 
|  | downloaded from the server or read from the cache - or whether slicing | 
|  | should be given up at the current point. | 
|  |  | 
|  | * ``prepare_read()`` | 
|  |  | 
|  | [Required] Called to configure the next slice of a request.  ->start and | 
|  | ->len in the subrequest indicate where and how big the next slice can be; | 
|  | the cache gets to reduce the length to match its granularity requirements. | 
|  |  | 
|  | * ``read()`` | 
|  |  | 
|  | [Required] Called to read from the cache.  The start file offset is given | 
|  | along with an iterator to read to, which gives the length also.  It can be | 
|  | given a hint requesting that it seek forward from that start position for | 
|  | data. | 
|  |  | 
|  | Also provided is a pointer to a termination handler function and private | 
|  | data to pass to that function.  The termination function should be called | 
|  | with the number of bytes transferred or an error code, plus a flag | 
|  | indicating whether the termination is definitely happening in the caller's | 
|  | context. | 
|  |  | 
|  | * ``write()`` | 
|  |  | 
|  | [Required] Called to write to the cache.  The start file offset is given | 
|  | along with an iterator to write from, which gives the length also. | 
|  |  | 
|  | Also provided is a pointer to a termination handler function and private | 
|  | data to pass to that function.  The termination function should be called | 
|  | with the number of bytes transferred or an error code, plus a flag | 
|  | indicating whether the termination is definitely happening in the caller's | 
|  | context. | 
|  |  | 
|  | Note that these methods are passed a pointer to the cache resource structure, | 
|  | not the read request structure as they could be used in other situations where | 
|  | there isn't a read request structure as well, such as writing dirty data to the | 
|  | cache. |