| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662 |
- .. SPDX-License-Identifier: GPL-2.0
- ===================================
- Cache on Already Mounted Filesystem
- ===================================
- .. Contents:
- (*) Overview.
- (*) Requirements.
- (*) Configuration.
- (*) Starting the cache.
- (*) Things to avoid.
- (*) Cache culling.
- (*) Cache structure.
- (*) Security model and SELinux.
- (*) A note on security.
- (*) Statistical information.
- (*) Debugging.
- (*) On-demand Read.
- Overview
- ========
- CacheFiles is a caching backend that's meant to use as a cache a directory on
- an already mounted filesystem of a local type (such as Ext3).
- CacheFiles uses a userspace daemon to do some of the cache management - such as
- reaping stale nodes and culling. This is called cachefilesd and lives in
- /sbin.
- The filesystem and data integrity of the cache are only as good as those of the
- filesystem providing the backing services. Note that CacheFiles does not
- attempt to journal anything since the journalling interfaces of the various
- filesystems are very specific in nature.
- CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
- to communication with the daemon. Only one thing may have this open at once,
- and while it is open, a cache is at least partially in existence. The daemon
- opens this and sends commands down it to control the cache.
- CacheFiles is currently limited to a single cache.
- CacheFiles attempts to maintain at least a certain percentage of free space on
- the filesystem, shrinking the cache by culling the objects it contains to make
- space if necessary - see the "Cache Culling" section. This means it can be
- placed on the same medium as a live set of data, and will expand to make use of
- spare space and automatically contract when the set of data requires more
- space.
- Requirements
- ============
- The use of CacheFiles and its daemon requires the following features to be
- available in the system and in the cache filesystem:
- - dnotify.
- - extended attributes (xattrs).
- - openat() and friends.
- - bmap() support on files in the filesystem (FIBMAP ioctl).
- - The use of bmap() to detect a partial page at the end of the file.
- It is strongly recommended that the "dir_index" option is enabled on Ext3
- filesystems being used as a cache.
- Configuration
- =============
- The cache is configured by a script in /etc/cachefilesd.conf. These commands
- set up cache ready for use. The following script commands are available:
- brun <N>%, bcull <N>%, bstop <N>%, frun <N>%, fcull <N>%, fstop <N>%
- Configure the culling limits. Optional. See the section on culling
- The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
- The commands beginning with a 'b' are file space (block) limits, those
- beginning with an 'f' are file count limits.
- dir <path>
- Specify the directory containing the root of the cache. Mandatory.
- tag <name>
- Specify a tag to FS-Cache to use in distinguishing multiple caches.
- Optional. The default is "CacheFiles".
- debug <mask>
- Specify a numeric bitmask to control debugging in the kernel module.
- Optional. The default is zero (all off). The following values can be
- OR'd into the mask to collect various information:
- == =================================================
- 1 Turn on trace of function entry (_enter() macros)
- 2 Turn on trace of function exit (_leave() macros)
- 4 Turn on trace of internal debug points (_debug())
- == =================================================
- This mask can also be set through sysfs, eg::
- echo 5 > /sys/module/cachefiles/parameters/debug
- Starting the Cache
- ==================
- The cache is started by running the daemon. The daemon opens the cache device,
- configures the cache and tells it to begin caching. At that point the cache
- binds to fscache and the cache becomes live.
- The daemon is run as follows::
- /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>]
- The flags are:
- ``-d``
- Increase the debugging level. This can be specified multiple times and
- is cumulative with itself.
- ``-s``
- Send messages to stderr instead of syslog.
- ``-n``
- Don't daemonise and go into background.
- ``-f <configfile>``
- Use an alternative configuration file rather than the default one.
- Things to Avoid
- ===============
- Do not mount other things within the cache as this will cause problems. The
- kernel module contains its own very cut-down path walking facility that ignores
- mountpoints, but the daemon can't avoid them.
- Do not create, rename or unlink files and directories in the cache while the
- cache is active, as this may cause the state to become uncertain.
- Renaming files in the cache might make objects appear to be other objects (the
- filename is part of the lookup key).
- Do not change or remove the extended attributes attached to cache files by the
- cache as this will cause the cache state management to get confused.
- Do not create files or directories in the cache, lest the cache get confused or
- serve incorrect data.
- Do not chmod files in the cache. The module creates things with minimal
- permissions to prevent random users being able to access them directly.
- Cache Culling
- =============
- The cache may need culling occasionally to make space. This involves
- discarding objects from the cache that have been used less recently than
- anything else. Culling is based on the access time of data objects. Empty
- directories are culled if not in use.
- Cache culling is done on the basis of the percentage of blocks and the
- percentage of files available in the underlying filesystem. There are six
- "limits":
- brun, frun
- If the amount of free space and the number of available files in the cache
- rises above both these limits, then culling is turned off.
- bcull, fcull
- If the amount of available space or the number of available files in the
- cache falls below either of these limits, then culling is started.
- bstop, fstop
- If the amount of available space or the number of available files in the
- cache falls below either of these limits, then no further allocation of
- disk space or files is permitted until culling has raised things above
- these limits again.
- These must be configured thusly::
- 0 <= bstop < bcull < brun < 100
- 0 <= fstop < fcull < frun < 100
- Note that these are percentages of available space and available files, and do
- _not_ appear as 100 minus the percentage displayed by the "df" program.
- The userspace daemon scans the cache to build up a table of cullable objects.
- These are then culled in least recently used order. A new scan of the cache is
- started as soon as space is made in the table. Objects will be skipped if
- their atimes have changed or if the kernel module says it is still using them.
- Cache Structure
- ===============
- The CacheFiles module will create two directories in the directory it was
- given:
- * cache/
- * graveyard/
- The active cache objects all reside in the first directory. The CacheFiles
- kernel module moves any retired or culled objects that it can't simply unlink
- to the graveyard from which the daemon will actually delete them.
- The daemon uses dnotify to monitor the graveyard directory, and will delete
- anything that appears therein.
- The module represents index objects as directories with the filename "I..." or
- "J...". Note that the "cache/" directory is itself a special index.
- Data objects are represented as files if they have no children, or directories
- if they do. Their filenames all begin "D..." or "E...". If represented as a
- directory, data objects will have a file in the directory called "data" that
- actually holds the data.
- Special objects are similar to data objects, except their filenames begin
- "S..." or "T...".
- If an object has children, then it will be represented as a directory.
- Immediately in the representative directory are a collection of directories
- named for hash values of the child object keys with an '@' prepended. Into
- this directory, if possible, will be placed the representations of the child
- objects::
- /INDEX /INDEX /INDEX /DATA FILES
- /=========/==========/=================================/================
- cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
- cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
- cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
- cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
- If the key is so long that it exceeds NAME_MAX with the decorations added on to
- it, then it will be cut into pieces, the first few of which will be used to
- make a nest of directories, and the last one of which will be the objects
- inside the last directory. The names of the intermediate directories will have
- '+' prepended::
- J1223/@23/+xy...z/+kl...m/Epqr
- Note that keys are raw data, and not only may they exceed NAME_MAX in size,
- they may also contain things like '/' and NUL characters, and so they may not
- be suitable for turning directly into a filename.
- To handle this, CacheFiles will use a suitably printable filename directly and
- "base-64" encode ones that aren't directly suitable. The two versions of
- object filenames indicate the encoding:
- =============== =============== ===============
- OBJECT TYPE PRINTABLE ENCODED
- =============== =============== ===============
- Index "I..." "J..."
- Data "D..." "E..."
- Special "S..." "T..."
- =============== =============== ===============
- Intermediate directories are always "@" or "+" as appropriate.
- Each object in the cache has an extended attribute label that holds the object
- type ID (required to distinguish special objects) and the auxiliary data from
- the netfs. The latter is used to detect stale objects in the cache and update
- or retire them.
- Note that CacheFiles will erase from the cache any file it doesn't recognise or
- any file of an incorrect type (such as a FIFO file or a device file).
- Security Model and SELinux
- ==========================
- CacheFiles is implemented to deal properly with the LSM security features of
- the Linux kernel and the SELinux facility.
- One of the problems that CacheFiles faces is that it is generally acting on
- behalf of a process, and running in that process's context, and that includes a
- security context that is not appropriate for accessing the cache - either
- because the files in the cache are inaccessible to that process, or because if
- the process creates a file in the cache, that file may be inaccessible to other
- processes.
- The way CacheFiles works is to temporarily change the security context (fsuid,
- fsgid and actor security label) that the process acts as - without changing the
- security context of the process when it the target of an operation performed by
- some other process (so signalling and suchlike still work correctly).
- When the CacheFiles module is asked to bind to its cache, it:
- (1) Finds the security label attached to the root cache directory and uses
- that as the security label with which it will create files. By default,
- this is::
- cachefiles_var_t
- (2) Finds the security label of the process which issued the bind request
- (presumed to be the cachefilesd daemon), which by default will be::
- cachefilesd_t
- and asks LSM to supply a security ID as which it should act given the
- daemon's label. By default, this will be::
- cachefiles_kernel_t
- SELinux transitions the daemon's security ID to the module's security ID
- based on a rule of this form in the policy::
- type_transition <daemon's-ID> kernel_t : process <module's-ID>;
- For instance::
- type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
- The module's security ID gives it permission to create, move and remove files
- and directories in the cache, to find and access directories and files in the
- cache, to set and access extended attributes on cache objects, and to read and
- write files in the cache.
- The daemon's security ID gives it only a very restricted set of permissions: it
- may scan directories, stat files and erase files and directories. It may
- not read or write files in the cache, and so it is precluded from accessing the
- data cached therein; nor is it permitted to create new files in the cache.
- There are policy source files available in:
- https://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2
- and later versions. In that tarball, see the files::
- cachefilesd.te
- cachefilesd.fc
- cachefilesd.if
- They are built and installed directly by the RPM.
- If a non-RPM based system is being used, then copy the above files to their own
- directory and run::
- make -f /usr/share/selinux/devel/Makefile
- semodule -i cachefilesd.pp
- You will need checkpolicy and selinux-policy-devel installed prior to the
- build.
- By default, the cache is located in /var/fscache, but if it is desirable that
- it should be elsewhere, than either the above policy files must be altered, or
- an auxiliary policy must be installed to label the alternate location of the
- cache.
- For instructions on how to add an auxiliary policy to enable the cache to be
- located elsewhere when SELinux is in enforcing mode, please see::
- /usr/share/doc/cachefilesd-*/move-cache.txt
- When the cachefilesd rpm is installed; alternatively, the document can be found
- in the sources.
- A Note on Security
- ==================
- CacheFiles makes use of the split security in the task_struct. It allocates
- its own task_security structure, and redirects current->cred to point to it
- when it acts on behalf of another process, in that process's context.
- The reason it does this is that it calls vfs_mkdir() and suchlike rather than
- bypassing security and calling inode ops directly. Therefore the VFS and LSM
- may deny the CacheFiles access to the cache data because under some
- circumstances the caching code is running in the security context of whatever
- process issued the original syscall on the netfs.
- Furthermore, should CacheFiles create a file or directory, the security
- parameters with that object is created (UID, GID, security label) would be
- derived from that process that issued the system call, thus potentially
- preventing other processes from accessing the cache - including CacheFiles's
- cache management daemon (cachefilesd).
- What is required is to temporarily override the security of the process that
- issued the system call. We can't, however, just do an in-place change of the
- security data as that affects the process as an object, not just as a subject.
- This means it may lose signals or ptrace events for example, and affects what
- the process looks like in /proc.
- So CacheFiles makes use of a logical split in the security between the
- objective security (task->real_cred) and the subjective security (task->cred).
- The objective security holds the intrinsic security properties of a process and
- is never overridden. This is what appears in /proc, and is what is used when a
- process is the target of an operation by some other process (SIGKILL for
- example).
- The subjective security holds the active security properties of a process, and
- may be overridden. This is not seen externally, and is used when a process
- acts upon another object, for example SIGKILLing another process or opening a
- file.
- LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
- for CacheFiles to run in a context of a specific security label, or to create
- files and directories with another security label.
- Statistical Information
- =======================
- If FS-Cache is compiled with the following option enabled::
- CONFIG_CACHEFILES_HISTOGRAM=y
- then it will gather certain statistics and display them through a proc file.
- /proc/fs/cachefiles/histogram
- ::
- cat /proc/fs/cachefiles/histogram
- JIFS SECS LOOKUPS MKDIRS CREATES
- ===== ===== ========= ========= =========
- This shows the breakdown of the number of times each amount of time
- between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The
- columns are as follows:
- ======= =======================================================
- COLUMN TIME MEASUREMENT
- ======= =======================================================
- LOOKUPS Length of time to perform a lookup on the backing fs
- MKDIRS Length of time to perform a mkdir on the backing fs
- CREATES Length of time to perform a create on the backing fs
- ======= =======================================================
- Each row shows the number of events that took a particular range of times.
- Each step is 1 jiffy in size. The JIFS column indicates the particular
- jiffy range covered, and the SECS field the equivalent number of seconds.
- Debugging
- =========
- If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime
- debugging enabled by adjusting the value in::
- /sys/module/cachefiles/parameters/debug
- This is a bitmask of debugging streams to enable:
- ======= ======= =============================== =======================
- BIT VALUE STREAM POINT
- ======= ======= =============================== =======================
- 0 1 General Function entry trace
- 1 2 Function exit trace
- 2 4 General
- ======= ======= =============================== =======================
- The appropriate set of values should be OR'd together and the result written to
- the control file. For example::
- echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug
- will turn on all function entry debugging.
- On-demand Read
- ==============
- When working in its original mode, CacheFiles serves as a local cache for a
- remote networking fs - while in on-demand read mode, CacheFiles can boost the
- scenario where on-demand read semantics are needed, e.g. container image
- distribution.
- The essential difference between these two modes is seen when a cache miss
- occurs: In the original mode, the netfs will fetch the data from the remote
- server and then write it to the cache file; in on-demand read mode, fetching
- the data and writing it into the cache is delegated to a user daemon.
- ``CONFIG_CACHEFILES_ONDEMAND`` should be enabled to support on-demand read mode.
- Protocol Communication
- ----------------------
- The on-demand read mode uses a simple protocol for communication between kernel
- and user daemon. The protocol can be modeled as::
- kernel --[request]--> user daemon --[reply]--> kernel
- CacheFiles will send requests to the user daemon when needed. The user daemon
- should poll the devnode ('/dev/cachefiles') to check if there's a pending
- request to be processed. A POLLIN event will be returned when there's a pending
- request.
- The user daemon then reads the devnode to fetch a request to process. It should
- be noted that each read only gets one request. When it has finished processing
- the request, the user daemon should write the reply to the devnode.
- Each request starts with a message header of the form::
- struct cachefiles_msg {
- __u32 msg_id;
- __u32 opcode;
- __u32 len;
- __u32 object_id;
- __u8 data[];
- };
- where:
- * ``msg_id`` is a unique ID identifying this request among all pending
- requests.
- * ``opcode`` indicates the type of this request.
- * ``object_id`` is a unique ID identifying the cache file operated on.
- * ``data`` indicates the payload of this request.
- * ``len`` indicates the whole length of this request, including the
- header and following type-specific payload.
- Turning on On-demand Mode
- -------------------------
- An optional parameter becomes available to the "bind" command::
- bind [ondemand]
- When the "bind" command is given no argument, it defaults to the original mode.
- When it is given the "ondemand" argument, i.e. "bind ondemand", on-demand read
- mode will be enabled.
- The OPEN Request
- ----------------
- When the netfs opens a cache file for the first time, a request with the
- CACHEFILES_OP_OPEN opcode, a.k.a an OPEN request will be sent to the user
- daemon. The payload format is of the form::
- struct cachefiles_open {
- __u32 volume_key_size;
- __u32 cookie_key_size;
- __u32 fd;
- __u32 flags;
- __u8 data[];
- };
- where:
- * ``data`` contains the volume_key followed directly by the cookie_key.
- The volume key is a NUL-terminated string; the cookie key is binary
- data.
- * ``volume_key_size`` indicates the size of the volume key in bytes.
- * ``cookie_key_size`` indicates the size of the cookie key in bytes.
- * ``fd`` indicates an anonymous fd referring to the cache file, through
- which the user daemon can perform write/llseek file operations on the
- cache file.
- The user daemon can use the given (volume_key, cookie_key) pair to distinguish
- the requested cache file. With the given anonymous fd, the user daemon can
- fetch the data and write it to the cache file in the background, even when
- kernel has not triggered a cache miss yet.
- Be noted that each cache file has a unique object_id, while it may have multiple
- anonymous fds. The user daemon may duplicate anonymous fds from the initial
- anonymous fd indicated by the @fd field through dup(). Thus each object_id can
- be mapped to multiple anonymous fds, while the usr daemon itself needs to
- maintain the mapping.
- When implementing a user daemon, please be careful of RLIMIT_NOFILE,
- ``/proc/sys/fs/nr_open`` and ``/proc/sys/fs/file-max``. Typically these needn't
- be huge since they're related to the number of open device blobs rather than
- open files of each individual filesystem.
- The user daemon should reply the OPEN request by issuing a "copen" (complete
- open) command on the devnode::
- copen <msg_id>,<cache_size>
- where:
- * ``msg_id`` must match the msg_id field of the OPEN request.
- * When >= 0, ``cache_size`` indicates the size of the cache file;
- when < 0, ``cache_size`` indicates any error code encountered by the
- user daemon.
- The CLOSE Request
- -----------------
- When a cookie withdrawn, a CLOSE request (opcode CACHEFILES_OP_CLOSE) will be
- sent to the user daemon. This tells the user daemon to close all anonymous fds
- associated with the given object_id. The CLOSE request has no extra payload,
- and shouldn't be replied.
- The READ Request
- ----------------
- When a cache miss is encountered in on-demand read mode, CacheFiles will send a
- READ request (opcode CACHEFILES_OP_READ) to the user daemon. This tells the user
- daemon to fetch the contents of the requested file range. The payload is of the
- form::
- struct cachefiles_read {
- __u64 off;
- __u64 len;
- };
- where:
- * ``off`` indicates the starting offset of the requested file range.
- * ``len`` indicates the length of the requested file range.
- When it receives a READ request, the user daemon should fetch the requested data
- and write it to the cache file identified by object_id.
- When it has finished processing the READ request, the user daemon should reply
- by using the CACHEFILES_IOC_READ_COMPLETE ioctl on one of the anonymous fds
- associated with the object_id given in the READ request. The ioctl is of the
- form::
- ioctl(fd, CACHEFILES_IOC_READ_COMPLETE, msg_id);
- where:
- * ``fd`` is one of the anonymous fds associated with the object_id
- given.
- * ``msg_id`` must match the msg_id field of the READ request.
|