|  | =============================== | 
|  | Documentation for /proc/sys/fs/ | 
|  | =============================== | 
|  |  | 
|  | kernel version 2.2.10 | 
|  |  | 
|  | Copyright (c) 1998, 1999,  Rik van Riel <riel@nl.linux.org> | 
|  |  | 
|  | Copyright (c) 2009,        Shen Feng<shen@cn.fujitsu.com> | 
|  |  | 
|  | For general info and legal blurb, please look in intro.rst. | 
|  |  | 
|  | ------------------------------------------------------------------------------ | 
|  |  | 
|  | This file contains documentation for the sysctl files in | 
|  | /proc/sys/fs/ and is valid for Linux kernel version 2.2. | 
|  |  | 
|  | The files in this directory can be used to tune and monitor | 
|  | miscellaneous and general things in the operation of the Linux | 
|  | kernel. Since some of the files _can_ be used to screw up your | 
|  | system, it is advisable to read both documentation and source | 
|  | before actually making adjustments. | 
|  |  | 
|  | 1. /proc/sys/fs | 
|  | =============== | 
|  |  | 
|  | Currently, these files are in /proc/sys/fs: | 
|  |  | 
|  | - aio-max-nr | 
|  | - aio-nr | 
|  | - dentry-state | 
|  | - dquot-max | 
|  | - dquot-nr | 
|  | - file-max | 
|  | - file-nr | 
|  | - inode-max | 
|  | - inode-nr | 
|  | - inode-state | 
|  | - nr_open | 
|  | - overflowuid | 
|  | - overflowgid | 
|  | - pipe-user-pages-hard | 
|  | - pipe-user-pages-soft | 
|  | - protected_fifos | 
|  | - protected_hardlinks | 
|  | - protected_regular | 
|  | - protected_symlinks | 
|  | - suid_dumpable | 
|  | - super-max | 
|  | - super-nr | 
|  |  | 
|  |  | 
|  | aio-nr & aio-max-nr | 
|  | ------------------- | 
|  |  | 
|  | aio-nr is the running total of the number of events specified on the | 
|  | io_setup system call for all currently active aio contexts.  If aio-nr | 
|  | reaches aio-max-nr then io_setup will fail with EAGAIN.  Note that | 
|  | raising aio-max-nr does not result in the pre-allocation or re-sizing | 
|  | of any kernel data structures. | 
|  |  | 
|  |  | 
|  | dentry-state | 
|  | ------------ | 
|  |  | 
|  | From linux/include/linux/dcache.h:: | 
|  |  | 
|  | struct dentry_stat_t dentry_stat { | 
|  | int nr_dentry; | 
|  | int nr_unused; | 
|  | int age_limit;         /* age in seconds */ | 
|  | int want_pages;        /* pages requested by system */ | 
|  | int nr_negative;       /* # of unused negative dentries */ | 
|  | int dummy;             /* Reserved for future use */ | 
|  | }; | 
|  |  | 
|  | Dentries are dynamically allocated and deallocated. | 
|  |  | 
|  | nr_dentry shows the total number of dentries allocated (active | 
|  | + unused). nr_unused shows the number of dentries that are not | 
|  | actively used, but are saved in the LRU list for future reuse. | 
|  |  | 
|  | Age_limit is the age in seconds after which dcache entries | 
|  | can be reclaimed when memory is short and want_pages is | 
|  | nonzero when shrink_dcache_pages() has been called and the | 
|  | dcache isn't pruned yet. | 
|  |  | 
|  | nr_negative shows the number of unused dentries that are also | 
|  | negative dentries which do not map to any files. Instead, | 
|  | they help speeding up rejection of non-existing files provided | 
|  | by the users. | 
|  |  | 
|  |  | 
|  | dquot-max & dquot-nr | 
|  | -------------------- | 
|  |  | 
|  | The file dquot-max shows the maximum number of cached disk | 
|  | quota entries. | 
|  |  | 
|  | The file dquot-nr shows the number of allocated disk quota | 
|  | entries and the number of free disk quota entries. | 
|  |  | 
|  | If the number of free cached disk quotas is very low and | 
|  | you have some awesome number of simultaneous system users, | 
|  | you might want to raise the limit. | 
|  |  | 
|  |  | 
|  | file-max & file-nr | 
|  | ------------------ | 
|  |  | 
|  | The value in file-max denotes the maximum number of file- | 
|  | handles that the Linux kernel will allocate. When you get lots | 
|  | of error messages about running out of file handles, you might | 
|  | want to increase this limit. | 
|  |  | 
|  | Historically,the kernel was able to allocate file handles | 
|  | dynamically, but not to free them again. The three values in | 
|  | file-nr denote the number of allocated file handles, the number | 
|  | of allocated but unused file handles, and the maximum number of | 
|  | file handles. Linux 2.6 always reports 0 as the number of free | 
|  | file handles -- this is not an error, it just means that the | 
|  | number of allocated file handles exactly matches the number of | 
|  | used file handles. | 
|  |  | 
|  | Attempts to allocate more file descriptors than file-max are | 
|  | reported with printk, look for "VFS: file-max limit <number> | 
|  | reached". | 
|  |  | 
|  |  | 
|  | nr_open | 
|  | ------- | 
|  |  | 
|  | This denotes the maximum number of file-handles a process can | 
|  | allocate. Default value is 1024*1024 (1048576) which should be | 
|  | enough for most machines. Actual limit depends on RLIMIT_NOFILE | 
|  | resource limit. | 
|  |  | 
|  |  | 
|  | inode-max, inode-nr & inode-state | 
|  | --------------------------------- | 
|  |  | 
|  | As with file handles, the kernel allocates the inode structures | 
|  | dynamically, but can't free them yet. | 
|  |  | 
|  | The value in inode-max denotes the maximum number of inode | 
|  | handlers. This value should be 3-4 times larger than the value | 
|  | in file-max, since stdin, stdout and network sockets also | 
|  | need an inode struct to handle them. When you regularly run | 
|  | out of inodes, you need to increase this value. | 
|  |  | 
|  | The file inode-nr contains the first two items from | 
|  | inode-state, so we'll skip to that file... | 
|  |  | 
|  | Inode-state contains three actual numbers and four dummies. | 
|  | The actual numbers are, in order of appearance, nr_inodes, | 
|  | nr_free_inodes and preshrink. | 
|  |  | 
|  | Nr_inodes stands for the number of inodes the system has | 
|  | allocated, this can be slightly more than inode-max because | 
|  | Linux allocates them one pageful at a time. | 
|  |  | 
|  | Nr_free_inodes represents the number of free inodes (?) and | 
|  | preshrink is nonzero when the nr_inodes > inode-max and the | 
|  | system needs to prune the inode list instead of allocating | 
|  | more. | 
|  |  | 
|  |  | 
|  | overflowgid & overflowuid | 
|  | ------------------------- | 
|  |  | 
|  | Some filesystems only support 16-bit UIDs and GIDs, although in Linux | 
|  | UIDs and GIDs are 32 bits. When one of these filesystems is mounted | 
|  | with writes enabled, any UID or GID that would exceed 65535 is translated | 
|  | to a fixed value before being written to disk. | 
|  |  | 
|  | These sysctls allow you to change the value of the fixed UID and GID. | 
|  | The default is 65534. | 
|  |  | 
|  |  | 
|  | pipe-user-pages-hard | 
|  | -------------------- | 
|  |  | 
|  | Maximum total number of pages a non-privileged user may allocate for pipes. | 
|  | Once this limit is reached, no new pipes may be allocated until usage goes | 
|  | below the limit again. When set to 0, no limit is applied, which is the default | 
|  | setting. | 
|  |  | 
|  |  | 
|  | pipe-user-pages-soft | 
|  | -------------------- | 
|  |  | 
|  | Maximum total number of pages a non-privileged user may allocate for pipes | 
|  | before the pipe size gets limited to a single page. Once this limit is reached, | 
|  | new pipes will be limited to a single page in size for this user in order to | 
|  | limit total memory usage, and trying to increase them using fcntl() will be | 
|  | denied until usage goes below the limit again. The default value allows to | 
|  | allocate up to 1024 pipes at their default size. When set to 0, no limit is | 
|  | applied. | 
|  |  | 
|  |  | 
|  | protected_fifos | 
|  | --------------- | 
|  |  | 
|  | The intent of this protection is to avoid unintentional writes to | 
|  | an attacker-controlled FIFO, where a program expected to create a regular | 
|  | file. | 
|  |  | 
|  | When set to "0", writing to FIFOs is unrestricted. | 
|  |  | 
|  | When set to "1" don't allow O_CREAT open on FIFOs that we don't own | 
|  | in world writable sticky directories, unless they are owned by the | 
|  | owner of the directory. | 
|  |  | 
|  | When set to "2" it also applies to group writable sticky directories. | 
|  |  | 
|  | This protection is based on the restrictions in Openwall. | 
|  |  | 
|  |  | 
|  | protected_hardlinks | 
|  | -------------------- | 
|  |  | 
|  | A long-standing class of security issues is the hardlink-based | 
|  | time-of-check-time-of-use race, most commonly seen in world-writable | 
|  | directories like /tmp. The common method of exploitation of this flaw | 
|  | is to cross privilege boundaries when following a given hardlink (i.e. a | 
|  | root process follows a hardlink created by another user). Additionally, | 
|  | on systems without separated partitions, this stops unauthorized users | 
|  | from "pinning" vulnerable setuid/setgid files against being upgraded by | 
|  | the administrator, or linking to special files. | 
|  |  | 
|  | When set to "0", hardlink creation behavior is unrestricted. | 
|  |  | 
|  | When set to "1" hardlinks cannot be created by users if they do not | 
|  | already own the source file, or do not have read/write access to it. | 
|  |  | 
|  | This protection is based on the restrictions in Openwall and grsecurity. | 
|  |  | 
|  |  | 
|  | protected_regular | 
|  | ----------------- | 
|  |  | 
|  | This protection is similar to protected_fifos, but it | 
|  | avoids writes to an attacker-controlled regular file, where a program | 
|  | expected to create one. | 
|  |  | 
|  | When set to "0", writing to regular files is unrestricted. | 
|  |  | 
|  | When set to "1" don't allow O_CREAT open on regular files that we | 
|  | don't own in world writable sticky directories, unless they are | 
|  | owned by the owner of the directory. | 
|  |  | 
|  | When set to "2" it also applies to group writable sticky directories. | 
|  |  | 
|  |  | 
|  | protected_symlinks | 
|  | ------------------ | 
|  |  | 
|  | A long-standing class of security issues is the symlink-based | 
|  | time-of-check-time-of-use race, most commonly seen in world-writable | 
|  | directories like /tmp. The common method of exploitation of this flaw | 
|  | is to cross privilege boundaries when following a given symlink (i.e. a | 
|  | root process follows a symlink belonging to another user). For a likely | 
|  | incomplete list of hundreds of examples across the years, please see: | 
|  | https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp | 
|  |  | 
|  | When set to "0", symlink following behavior is unrestricted. | 
|  |  | 
|  | When set to "1" symlinks are permitted to be followed only when outside | 
|  | a sticky world-writable directory, or when the uid of the symlink and | 
|  | follower match, or when the directory owner matches the symlink's owner. | 
|  |  | 
|  | This protection is based on the restrictions in Openwall and grsecurity. | 
|  |  | 
|  |  | 
|  | suid_dumpable: | 
|  | -------------- | 
|  |  | 
|  | This value can be used to query and set the core dump mode for setuid | 
|  | or otherwise protected/tainted binaries. The modes are | 
|  |  | 
|  | =   ==========  =============================================================== | 
|  | 0   (default)	traditional behaviour. Any process which has changed | 
|  | privilege levels or is execute only will not be dumped. | 
|  | 1   (debug)	all processes dump core when possible. The core dump is | 
|  | owned by the current user and no security is applied. This is | 
|  | intended for system debugging situations only. | 
|  | Ptrace is unchecked. | 
|  | This is insecure as it allows regular users to examine the | 
|  | memory contents of privileged processes. | 
|  | 2   (suidsafe)	any binary which normally would not be dumped is dumped | 
|  | anyway, but only if the "core_pattern" kernel sysctl is set to | 
|  | either a pipe handler or a fully qualified path. (For more | 
|  | details on this limitation, see CVE-2006-2451.) This mode is | 
|  | appropriate when administrators are attempting to debug | 
|  | problems in a normal environment, and either have a core dump | 
|  | pipe handler that knows to treat privileged core dumps with | 
|  | care, or specific directory defined for catching core dumps. | 
|  | If a core dump happens without a pipe handler or fully | 
|  | qualified path, a message will be emitted to syslog warning | 
|  | about the lack of a correct setting. | 
|  | =   ==========  =============================================================== | 
|  |  | 
|  |  | 
|  | super-max & super-nr | 
|  | -------------------- | 
|  |  | 
|  | These numbers control the maximum number of superblocks, and | 
|  | thus the maximum number of mounted filesystems the kernel | 
|  | can have. You only need to increase super-max if you need to | 
|  | mount more filesystems than the current value in super-max | 
|  | allows you to. | 
|  |  | 
|  |  | 
|  | aio-nr & aio-max-nr | 
|  | ------------------- | 
|  |  | 
|  | aio-nr shows the current system-wide number of asynchronous io | 
|  | requests.  aio-max-nr allows you to change the maximum value | 
|  | aio-nr can grow to. | 
|  |  | 
|  |  | 
|  | mount-max | 
|  | --------- | 
|  |  | 
|  | This denotes the maximum number of mounts that may exist | 
|  | in a mount namespace. | 
|  |  | 
|  |  | 
|  |  | 
|  | 2. /proc/sys/fs/binfmt_misc | 
|  | =========================== | 
|  |  | 
|  | Documentation for the files in /proc/sys/fs/binfmt_misc is | 
|  | in Documentation/admin-guide/binfmt-misc.rst. | 
|  |  | 
|  |  | 
|  | 3. /proc/sys/fs/mqueue - POSIX message queues filesystem | 
|  | ======================================================== | 
|  |  | 
|  |  | 
|  | The "mqueue"  filesystem provides  the necessary kernel features to enable the | 
|  | creation of a  user space  library that  implements  the  POSIX message queues | 
|  | API (as noted by the  MSG tag in the  POSIX 1003.1-2001 version  of the System | 
|  | Interfaces specification.) | 
|  |  | 
|  | The "mqueue" filesystem contains values for determining/setting  the amount of | 
|  | resources used by the file system. | 
|  |  | 
|  | /proc/sys/fs/mqueue/queues_max is a read/write  file for  setting/getting  the | 
|  | maximum number of message queues allowed on the system. | 
|  |  | 
|  | /proc/sys/fs/mqueue/msg_max  is  a  read/write file  for  setting/getting  the | 
|  | maximum number of messages in a queue value.  In fact it is the limiting value | 
|  | for another (user) limit which is set in mq_open invocation. This attribute of | 
|  | a queue must be less or equal then msg_max. | 
|  |  | 
|  | /proc/sys/fs/mqueue/msgsize_max is  a read/write  file for setting/getting the | 
|  | maximum  message size value (it is every  message queue's attribute set during | 
|  | its creation). | 
|  |  | 
|  | /proc/sys/fs/mqueue/msg_default is  a read/write  file for setting/getting the | 
|  | default number of messages in a queue value if attr parameter of mq_open(2) is | 
|  | NULL. If it exceed msg_max, the default value is initialized msg_max. | 
|  |  | 
|  | /proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting | 
|  | the default message size value if attr parameter of mq_open(2) is NULL. If it | 
|  | exceed msgsize_max, the default value is initialized msgsize_max. | 
|  |  | 
|  | 4. /proc/sys/fs/epoll - Configuration options for the epoll interface | 
|  | ===================================================================== | 
|  |  | 
|  | This directory contains configuration options for the epoll(7) interface. | 
|  |  | 
|  | max_user_watches | 
|  | ---------------- | 
|  |  | 
|  | Every epoll file descriptor can store a number of files to be monitored | 
|  | for event readiness. Each one of these monitored files constitutes a "watch". | 
|  | This configuration option sets the maximum number of "watches" that are | 
|  | allowed for each user. | 
|  | Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes | 
|  | on a 64bit one. | 
|  | The current default value for  max_user_watches  is the 1/25 (4%) of the | 
|  | available low memory, divided for the "watch" cost in bytes. |