| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368 |
- .. SPDX-License-Identifier: GPL-2.0
- ======================================
- EROFS - Enhanced Read-Only File System
- ======================================
- Overview
- ========
- EROFS filesystem stands for Enhanced Read-Only File System. It aims to form a
- generic read-only filesystem solution for various read-only use cases instead
- of just focusing on storage space saving without considering any side effects
- of runtime performance.
- It is designed to meet the needs of flexibility, feature extendability and user
- payload friendly, etc. Apart from those, it is still kept as a simple
- random-access friendly high-performance filesystem to get rid of unneeded I/O
- amplification and memory-resident overhead compared to similar approaches.
- It is implemented to be a better choice for the following scenarios:
- - read-only storage media or
- - part of a fully trusted read-only solution, which means it needs to be
- immutable and bit-for-bit identical to the official golden image for
- their releases due to security or other considerations and
- - hope to minimize extra storage space with guaranteed end-to-end performance
- by using compact layout, transparent file compression and direct access,
- especially for those embedded devices with limited memory and high-density
- hosts with numerous containers.
- Here are the main features of EROFS:
- - Little endian on-disk design;
- - Block-based distribution and file-based distribution over fscache are
- supported;
- - Support multiple devices to refer to external blobs, which can be used
- for container images;
- - 32-bit block addresses for each device, therefore 16TiB address space at
- most with 4KiB block size for now;
- - Two inode layouts for different requirements:
- ===================== ============ ======================================
- compact (v1) extended (v2)
- ===================== ============ ======================================
- Inode metadata size 32 bytes 64 bytes
- Max file size 4 GiB 16 EiB (also limited by max. vol size)
- Max uids/gids 65536 4294967296
- Per-inode timestamp no yes (64 + 32-bit timestamp)
- Max hardlinks 65536 4294967296
- Metadata reserved 8 bytes 18 bytes
- ===================== ============ ======================================
- - Support extended attributes as an option;
- - Support a bloom filter that speeds up negative extended attribute lookups;
- - Support POSIX.1e ACLs by using extended attributes;
- - Support transparent data compression as an option:
- LZ4, MicroLZMA and DEFLATE algorithms can be used on a per-file basis; In
- addition, inplace decompression is also supported to avoid bounce compressed
- buffers and unnecessary page cache thrashing.
- - Support chunk-based data deduplication and rolling-hash compressed data
- deduplication;
- - Support tailpacking inline compared to byte-addressed unaligned metadata
- or smaller block size alternatives;
- - Support merging tail-end data into a special inode as fragments.
- - Support large folios to make use of THPs (Transparent Hugepages);
- - Support direct I/O on uncompressed files to avoid double caching for loop
- devices;
- - Support FSDAX on uncompressed images for secure containers and ramdisks in
- order to get rid of unnecessary page cache.
- - Support file-based on-demand loading with the Fscache infrastructure.
- The following git tree provides the file system user-space tools under
- development, such as a formatting tool (mkfs.erofs), an on-disk consistency &
- compatibility checking tool (fsck.erofs), and a debugging tool (dump.erofs):
- - git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
- For more information, please also refer to the documentation site:
- - https://erofs.docs.kernel.org
- Bugs and patches are welcome, please kindly help us and send to the following
- linux-erofs mailing list:
- - linux-erofs mailing list <linux-erofs@lists.ozlabs.org>
- Mount options
- =============
- =================== =========================================================
- (no)user_xattr Setup Extended User Attributes. Note: xattr is enabled
- by default if CONFIG_EROFS_FS_XATTR is selected.
- (no)acl Setup POSIX Access Control List. Note: acl is enabled
- by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
- cache_strategy=%s Select a strategy for cached decompression from now on:
- ========== =============================================
- disabled In-place I/O decompression only;
- readahead Cache the last incomplete compressed physical
- cluster for further reading. It still does
- in-place I/O decompression for the rest
- compressed physical clusters;
- readaround Cache the both ends of incomplete compressed
- physical clusters for further reading.
- It still does in-place I/O decompression
- for the rest compressed physical clusters.
- ========== =============================================
- dax={always,never} Use direct access (no page cache). See
- Documentation/filesystems/dax.rst.
- dax A legacy option which is an alias for ``dax=always``.
- device=%s Specify a path to an extra device to be used together.
- fsid=%s Specify a filesystem image ID for Fscache back-end.
- domain_id=%s Specify a domain ID in fscache mode so that different images
- with the same blobs under a given domain ID can share storage.
- =================== =========================================================
- Sysfs Entries
- =============
- Information about mounted erofs file systems can be found in /sys/fs/erofs.
- Each mounted filesystem will have a directory in /sys/fs/erofs based on its
- device name (i.e., /sys/fs/erofs/sda).
- (see also Documentation/ABI/testing/sysfs-fs-erofs)
- On-disk details
- ===============
- Summary
- -------
- Different from other read-only file systems, an EROFS volume is designed
- to be as simple as possible::
- |-> aligned with the block size
- ____________________________________________________________
- | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |
- |_|__|_|_____|__________|_____|______|__________|_____|______|
- 0 +1K
- All data areas should be aligned with the block size, but metadata areas
- may not. All metadatas can be now observed in two different spaces (views):
- 1. Inode metadata space
- Each valid inode should be aligned with an inode slot, which is a fixed
- value (32 bytes) and designed to be kept in line with compact inode size.
- Each inode can be directly found with the following formula:
- inode offset = meta_blkaddr * block_size + 32 * nid
- ::
- |-> aligned with 8B
- |-> followed closely
- + meta_blkaddr blocks |-> another slot
- _____________________________________________________________________
- | ... | inode | xattrs | extents | data inline | ... | inode ...
- |________|_______|(optional)|(optional)|__(optional)_|_____|__________
- |-> aligned with the inode slot size
- . .
- . .
- . .
- . .
- . .
- . .
- .____________________________________________________|-> aligned with 4B
- | xattr_ibody_header | shared xattrs | inline xattrs |
- |____________________|_______________|_______________|
- |-> 12 bytes <-|->x * 4 bytes<-| .
- . . .
- . . .
- . . .
- ._______________________________.______________________.
- | id | id | id | id | ... | id | ent | ... | ent| ... |
- |____|____|____|____|______|____|_____|_____|____|_____|
- |-> aligned with 4B
- |-> aligned with 4B
- Inode could be 32 or 64 bytes, which can be distinguished from a common
- field which all inode versions have -- i_format::
- __________________ __________________
- | i_format | | i_format |
- |__________________| |__________________|
- | ... | | ... |
- | | | |
- |__________________| 32 bytes | |
- | |
- |__________________| 64 bytes
- Xattrs, extents, data inline are placed after the corresponding inode with
- proper alignment, and they could be optional for different data mappings.
- _currently_ total 5 data layouts are supported:
- == ====================================================================
- 0 flat file data without data inline (no extent);
- 1 fixed-sized output data compression (with non-compacted indexes);
- 2 flat file data with tail packing data inline (no extent);
- 3 fixed-sized output data compression (with compacted indexes, v5.3+);
- 4 chunk-based file (v5.15+).
- == ====================================================================
- The size of the optional xattrs is indicated by i_xattr_count in inode
- header. Large xattrs or xattrs shared by many different files can be
- stored in shared xattrs metadata rather than inlined right after inode.
- 2. Shared xattrs metadata space
- Shared xattrs space is similar to the above inode space, started with
- a specific block indicated by xattr_blkaddr, organized one by one with
- proper align.
- Each share xattr can also be directly found by the following formula:
- xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
- ::
- |-> aligned by 4 bytes
- + xattr_blkaddr blocks |-> aligned with 4 bytes
- _________________________________________________________________________
- | ... | xattr_entry | xattr data | ... | xattr_entry | xattr data ...
- |________|_____________|_____________|_____|______________|_______________
- Directories
- -----------
- All directories are now organized in a compact on-disk format. Note that
- each directory block is divided into index and name areas in order to support
- random file lookup, and all directory entries are _strictly_ recorded in
- alphabetical order in order to support improved prefix binary search
- algorithm (could refer to the related source code).
- ::
- ___________________________
- / |
- / ______________|________________
- / / | nameoff1 | nameoffN-1
- ____________.______________._______________v________________v__________
- | dirent | dirent | ... | dirent | filename | filename | ... | filename |
- |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
- \ ^
- \ | * could have
- \ | trailing '\0'
- \________________________| nameoff0
- Directory block
- Note that apart from the offset of the first filename, nameoff0 also indicates
- the total number of directory entries in this block since it is no need to
- introduce another on-disk field at all.
- Chunk-based files
- -----------------
- In order to support chunk-based data deduplication, a new inode data layout has
- been supported since Linux v5.15: Files are split in equal-sized data chunks
- with ``extents`` area of the inode metadata indicating how to get the chunk
- data: these can be simply as a 4-byte block address array or in the 8-byte
- chunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more
- details.)
- By the way, chunk-based files are all uncompressed for now.
- Long extended attribute name prefixes
- -------------------------------------
- There are use cases where extended attributes with different values can have
- only a few common prefixes (such as overlayfs xattrs). The predefined prefixes
- work inefficiently in both image size and runtime performance in such cases.
- The long xattr name prefixes feature is introduced to address this issue. The
- overall idea is that, apart from the existing predefined prefixes, the xattr
- entry could also refer to user-specified long xattr name prefixes, e.g.
- "trusted.overlay.".
- When referring to a long xattr name prefix, the highest bit (bit 7) of
- erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6) as a whole
- represent the index of the referred long name prefix among all long name
- prefixes. Therefore, only the trailing part of the name apart from the long
- xattr name prefix is stored in erofs_xattr_entry.e_name, which could be empty if
- the full xattr name matches exactly as its long xattr name prefix.
- All long xattr prefixes are stored one by one in the packed inode as long as
- the packed inode is valid, or in the meta inode otherwise. The
- xattr_prefix_count (of the on-disk superblock) indicates the total number of
- long xattr name prefixes, while (xattr_prefix_start * 4) indicates the start
- offset of long name prefixes in the packed/meta inode. Note that, long extended
- attribute name prefixes are disabled if xattr_prefix_count is 0.
- Each long name prefix is stored in the format: ALIGN({__le16 len, data}, 4),
- where len represents the total size of the data part. The data part is actually
- represented by 'struct erofs_xattr_long_prefix', where base_index represents the
- index of the predefined xattr name prefix, e.g. EROFS_XATTR_INDEX_TRUSTED for
- "trusted.overlay." long name prefix, while the infix string keeps the string
- after stripping the short prefix, e.g. "overlay." for the example above.
- Data compression
- ----------------
- EROFS implements fixed-sized output compression which generates fixed-sized
- compressed data blocks from variable-sized input in contrast to other existing
- fixed-sized input solutions. Relatively higher compression ratios can be gotten
- by using fixed-sized output compression since nowadays popular data compression
- algorithms are mostly LZ77-based and such fixed-sized output approach can be
- benefited from the historical dictionary (aka. sliding window).
- In details, original (uncompressed) data is turned into several variable-sized
- extents and in the meanwhile, compressed into physical clusters (pclusters).
- In order to record each variable-sized extent, logical clusters (lclusters) are
- introduced as the basic unit of compress indexes to indicate whether a new
- extent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now
- fixed in block size, as illustrated below::
- |<- variable-sized extent ->|<- VLE ->|
- clusterofs clusterofs clusterofs
- | | |
- _________v_________________________________v_______________________v________
- ... | . | | . | | . ...
- ____|____._________|______________|________.___ _|______________|__.________
- |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-|
- (HEAD) (NONHEAD) (HEAD) (NONHEAD) .
- . CBLKCNT . .
- . . .
- . . .
- _______._____________________________.______________._________________
- ... | | | | ...
- _______|______________|______________|______________|_________________
- |-> big pcluster <-|-> pcluster <-|
- A physical cluster can be seen as a container of physical compressed blocks
- which contains compressed data. Previously, only lcluster-sized (4KB) pclusters
- were supported. After big pcluster feature is introduced (available since
- Linux v5.13), pcluster can be a multiple of lcluster size.
- For each HEAD lcluster, clusterofs is recorded to indicate where a new extent
- starts and blkaddr is used to seek the compressed data. For each NONHEAD
- lcluster, delta0 and delta1 are available instead of blkaddr to indicate the
- distance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is
- also a HEAD lcluster except that its data is uncompressed. See the comments
- around "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
- If big pcluster is enabled, pcluster size in lclusters needs to be recorded as
- well. Let the delta0 of the first NONHEAD lcluster store the compressed block
- count with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy
- to understand its delta0 is constantly 1, as illustrated below::
- __________________________________________________________
- | HEAD | NONHEAD | NONHEAD | ... | NONHEAD | HEAD | HEAD |
- |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_|
- |<----- a big pcluster (with CBLKCNT) ------>|<-- -->|
- a lcluster-sized pcluster (without CBLKCNT) ^
- If another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT,
- but it's easy to know the size of such pcluster is 1 lcluster as well.
- Since Linux v6.1, each pcluster can be used for multiple variable-sized extents,
- therefore it can be used for compressed data deduplication.
|