| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633 |
- .. SPDX-License-Identifier: GPL-2.0-only
- ================
- Design of dm-vdo
- ================
- The dm-vdo (virtual data optimizer) target provides inline deduplication,
- compression, zero-block elimination, and thin provisioning. A dm-vdo target
- can be backed by up to 256TB of storage, and can present a logical size of
- up to 4PB. This target was originally developed at Permabit Technology
- Corp. starting in 2009. It was first released in 2013 and has been used in
- production environments ever since. It was made open-source in 2017 after
- Permabit was acquired by Red Hat. This document describes the design of
- dm-vdo. For usage, see vdo.rst in the same directory as this file.
- Because deduplication rates fall drastically as the block size increases, a
- vdo target has a maximum block size of 4K. However, it can achieve
- deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
- reference a single 4K of actual storage. It can achieve compression rates
- of 14:1. All zero blocks consume no storage at all.
- Theory of Operation
- ===================
- The design of dm-vdo is based on the idea that deduplication is a two-part
- problem. The first is to recognize duplicate data. The second is to avoid
- storing multiple copies of those duplicates. Therefore, dm-vdo has two main
- parts: a deduplication index (called UDS) that is used to discover
- duplicate data, and a data store with a reference counted block map that
- maps from logical block addresses to the actual storage location of the
- data.
- Zones and Threading
- -------------------
- Due to the complexity of data optimization, the number of metadata
- structures involved in a single write operation to a vdo target is larger
- than most other targets. Furthermore, because vdo must operate on small
- block sizes in order to achieve good deduplication rates, acceptable
- performance can only be achieved through parallelism. Therefore, vdo's
- design attempts to be lock-free.
- Most of a vdo's main data structures are designed to be easily divided into
- "zones" such that any given bio must only access a single zone of any zoned
- structure. Safety with minimal locking is achieved by ensuring that during
- normal operation, each zone is assigned to a specific thread, and only that
- thread will access the portion of the data structure in that zone.
- Associated with each thread is a work queue. Each bio is associated with a
- request object (the "data_vio") which will be added to a work queue when
- the next phase of its operation requires access to the structures in the
- zone associated with that queue.
- Another way of thinking about this arrangement is that the work queue for
- each zone has an implicit lock on the structures it manages for all its
- operations, because vdo guarantees that no other thread will alter those
- structures.
- Although each structure is divided into zones, this division is not
- reflected in the on-disk representation of each data structure. Therefore,
- the number of zones for each structure, and hence the number of threads,
- can be reconfigured each time a vdo target is started.
- The Deduplication Index
- -----------------------
- In order to identify duplicate data efficiently, vdo was designed to
- leverage some common characteristics of duplicate data. From empirical
- observations, we gathered two key insights. The first is that in most data
- sets with significant amounts of duplicate data, the duplicates tend to
- have temporal locality. When a duplicate appears, it is more likely that
- other duplicates will be detected, and that those duplicates will have been
- written at about the same time. This is why the index keeps records in
- temporal order. The second insight is that new data is more likely to
- duplicate recent data than it is to duplicate older data and in general,
- there are diminishing returns to looking further back in time. Therefore,
- when the index is full, it should cull its oldest records to make space for
- new ones. Another important idea behind the design of the index is that the
- ultimate goal of deduplication is to reduce storage costs. Since there is a
- trade-off between the storage saved and the resources expended to achieve
- those savings, vdo does not attempt to find every last duplicate block. It
- is sufficient to find and eliminate most of the redundancy.
- Each block of data is hashed to produce a 16-byte block name. An index
- record consists of this block name paired with the presumed location of
- that data on the underlying storage. However, it is not possible to
- guarantee that the index is accurate. In the most common case, this occurs
- because it is too costly to update the index when a block is over-written
- or discarded. Doing so would require either storing the block name along
- with the blocks, which is difficult to do efficiently in block-based
- storage, or reading and rehashing each block before overwriting it.
- Inaccuracy can also result from a hash collision where two different blocks
- have the same name. In practice, this is extremely unlikely, but because
- vdo does not use a cryptographic hash, a malicious workload could be
- constructed. Because of these inaccuracies, vdo treats the locations in the
- index as hints, and reads each indicated block to verify that it is indeed
- a duplicate before sharing the existing block with a new one.
- Records are collected into groups called chapters. New records are added to
- the newest chapter, called the open chapter. This chapter is stored in a
- format optimized for adding and modifying records, and the content of the
- open chapter is not finalized until it runs out of space for new records.
- When the open chapter fills up, it is closed and a new open chapter is
- created to collect new records.
- Closing a chapter converts it to a different format which is optimized for
- reading. The records are written to a series of record pages based on the
- order in which they were received. This means that records with temporal
- locality should be on a small number of pages, reducing the I/O required to
- retrieve them. The chapter also compiles an index that indicates which
- record page contains any given name. This index means that a request for a
- name can determine exactly which record page may contain that record,
- without having to load the entire chapter from storage. This index uses
- only a subset of the block name as its key, so it cannot guarantee that an
- index entry refers to the desired block name. It can only guarantee that if
- there is a record for this name, it will be on the indicated page. Closed
- chapters are read-only structures and their contents are never altered in
- any way.
- Once enough records have been written to fill up all the available index
- space, the oldest chapter is removed to make space for new chapters. Any
- time a request finds a matching record in the index, that record is copied
- into the open chapter. This ensures that useful block names remain available
- in the index, while unreferenced block names are forgotten over time.
- In order to find records in older chapters, the index also maintains a
- higher level structure called the volume index, which contains entries
- mapping each block name to the chapter containing its newest record. This
- mapping is updated as records for the block name are copied or updated,
- ensuring that only the newest record for a given block name can be found.
- An older record for a block name will no longer be found even though it has
- not been deleted from its chapter. Like the chapter index, the volume index
- uses only a subset of the block name as its key and can not definitively
- say that a record exists for a name. It can only say which chapter would
- contain the record if a record exists. The volume index is stored entirely
- in memory and is saved to storage only when the vdo target is shut down.
- From the viewpoint of a request for a particular block name, it will first
- look up the name in the volume index. This search will either indicate that
- the name is new, or which chapter to search. If it returns a chapter, the
- request looks up its name in the chapter index. This will indicate either
- that the name is new, or which record page to search. Finally, if it is not
- new, the request will look for its name in the indicated record page.
- This process may require up to two page reads per request (one for the
- chapter index page and one for the request page). However, recently
- accessed pages are cached so that these page reads can be amortized across
- many block name requests.
- The volume index and the chapter indexes are implemented using a
- memory-efficient structure called a delta index. Instead of storing the
- entire block name (the key) for each entry, the entries are sorted by name
- and only the difference between adjacent keys (the delta) is stored.
- Because we expect the hashes to be randomly distributed, the size of the
- deltas follows an exponential distribution. Because of this distribution,
- the deltas are expressed using a Huffman code to take up even less space.
- The entire sorted list of keys is called a delta list. This structure
- allows the index to use many fewer bytes per entry than a traditional hash
- table, but it is slightly more expensive to look up entries, because a
- request must read every entry in a delta list to add up the deltas in order
- to find the record it needs. The delta index reduces this lookup cost by
- splitting its key space into many sub-lists, each starting at a fixed key
- value, so that each individual list is short.
- The default index size can hold 64 million records, corresponding to about
- 256GB of data. This means that the index can identify duplicate data if the
- original data was written within the last 256GB of writes. This range is
- called the deduplication window. If new writes duplicate data that is older
- than that, the index will not be able to find it because the records of the
- older data have been removed. This means that if an application writes a
- 200 GB file to a vdo target and then immediately writes it again, the two
- copies will deduplicate perfectly. Doing the same with a 500 GB file will
- result in no deduplication, because the beginning of the file will no
- longer be in the index by the time the second write begins (assuming there
- is no duplication within the file itself).
- If an application anticipates a data workload that will see useful
- deduplication beyond the 256GB threshold, vdo can be configured to use a
- larger index with a correspondingly larger deduplication window. (This
- configuration can only be set when the target is created, not altered
- later. It is important to consider the expected workload for a vdo target
- before configuring it.) There are two ways to do this.
- One way is to increase the memory size of the index, which also increases
- the amount of backing storage required. Doubling the size of the index will
- double the length of the deduplication window at the expense of doubling
- the storage size and the memory requirements.
- The other option is to enable sparse indexing. Sparse indexing increases
- the deduplication window by a factor of 10, at the expense of also
- increasing the storage size by a factor of 10. However with sparse
- indexing, the memory requirements do not increase. The trade-off is
- slightly more computation per request and a slight decrease in the amount
- of deduplication detected. For most workloads with significant amounts of
- duplicate data, sparse indexing will detect 97-99% of the deduplication
- that a standard index will detect.
- The vio and data_vio Structures
- -------------------------------
- A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
- fields and data to track vdo-specific information. A struct vio maintains a
- pointer to a bio but also tracks other fields specific to the operation of
- vdo. The vio is kept separate from its related bio because there are many
- circumstances where vdo completes the bio but must continue to do work
- related to deduplication or compression.
- Metadata reads and writes, and other writes that originate within vdo, use
- a struct vio directly. Application reads and writes use a larger structure
- called a data_vio to track information about their progress. A struct
- data_vio contain a struct vio and also includes several other fields
- related to deduplication and other vdo features. The data_vio is the
- primary unit of application work in vdo. Each data_vio proceeds through a
- set of steps to handle the application data, after which it is reset and
- returned to a pool of data_vios for reuse.
- There is a fixed pool of 2048 data_vios. This number was chosen to bound
- the amount of work that is required to recover from a crash. In addition,
- benchmarks have indicated that increasing the size of the pool does not
- significantly improve performance.
- The Data Store
- --------------
- The data store is implemented by three main data structures, all of which
- work in concert to reduce or amortize metadata updates across as many data
- writes as possible.
- *The Slab Depot*
- Most of the vdo volume belongs to the slab depot. The depot contains a
- collection of slabs. The slabs can be up to 32GB, and are divided into
- three sections. Most of a slab consists of a linear sequence of 4K blocks.
- These blocks are used either to store data, or to hold portions of the
- block map (see below). In addition to the data blocks, each slab has a set
- of reference counters, using 1 byte for each data block. Finally each slab
- has a journal.
- Reference updates are written to the slab journal. Slab journal blocks are
- written out either when they are full, or when the recovery journal
- requests they do so in order to allow the main recovery journal (see below)
- to free up space. The slab journal is used both to ensure that the main
- recovery journal can regularly free up space, and also to amortize the cost
- of updating individual reference blocks. The reference counters are kept in
- memory and are written out, a block at a time in oldest-dirtied-order, only
- when there is a need to reclaim slab journal space. The write operations
- are performed in the background as needed so they do not add latency to
- particular I/O operations.
- Each slab is independent of every other. They are assigned to "physical
- zones" in round-robin fashion. If there are P physical zones, then slab n
- is assigned to zone n mod P.
- The slab depot maintains an additional small data structure, the "slab
- summary," which is used to reduce the amount of work needed to come back
- online after a crash. The slab summary maintains an entry for each slab
- indicating whether or not the slab has ever been used, whether all of its
- reference count updates have been persisted to storage, and approximately
- how full it is. During recovery, each physical zone will attempt to recover
- at least one slab, stopping whenever it has recovered a slab which has some
- free blocks. Once each zone has some space, or has determined that none is
- available, the target can resume normal operation in a degraded mode. Read
- and write requests can be serviced, perhaps with degraded performance,
- while the remainder of the dirty slabs are recovered.
- *The Block Map*
- The block map contains the logical to physical mapping. It can be thought
- of as an array with one entry per logical address. Each entry is 5 bytes,
- 36 bits of which contain the physical block number which holds the data for
- the given logical address. The other 4 bits are used to indicate the nature
- of the mapping. Of the 16 possible states, one represents a logical address
- which is unmapped (i.e. it has never been written, or has been discarded),
- one represents an uncompressed block, and the other 14 states are used to
- indicate that the mapped data is compressed, and which of the compression
- slots in the compressed block contains the data for this logical address.
- In practice, the array of mapping entries is divided into "block map
- pages," each of which fits in a single 4K block. Each block map page
- consists of a header and 812 mapping entries. Each mapping page is actually
- a leaf of a radix tree which consists of block map pages at each level.
- There are 60 radix trees which are assigned to "logical zones" in round
- robin fashion. (If there are L logical zones, tree n will belong to zone n
- mod L.) At each level, the trees are interleaved, so logical addresses
- 0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
- on. The interleaving is maintained all the way up to the 60 root nodes.
- Choosing 60 trees results in an evenly distributed number of trees per zone
- for a large number of possible logical zone counts. The storage for the 60
- tree roots is allocated at format time. All other block map pages are
- allocated out of the slabs as needed. This flexible allocation avoids the
- need to pre-allocate space for the entire set of logical mappings and also
- makes growing the logical size of a vdo relatively easy.
- In operation, the block map maintains two caches. It is prohibitive to keep
- the entire leaf level of the trees in memory, so each logical zone
- maintains its own cache of leaf pages. The size of this cache is
- configurable at target start time. The second cache is allocated at start
- time, and is large enough to hold all the non-leaf pages of the entire
- block map. This cache is populated as pages are needed.
- *The Recovery Journal*
- The recovery journal is used to amortize updates across the block map and
- slab depot. Each write request causes an entry to be made in the journal.
- Entries are either "data remappings" or "block map remappings." For a data
- remapping, the journal records the logical address affected and its old and
- new physical mappings. For a block map remapping, the journal records the
- block map page number and the physical block allocated for it. Block map
- pages are never reclaimed or repurposed, so the old mapping is always 0.
- Each journal entry is an intent record summarizing the metadata updates
- that are required for a data_vio. The recovery journal issues a flush
- before each journal block write to ensure that the physical data for the
- new block mappings in that block are stable on storage, and journal block
- writes are all issued with the FUA bit set to ensure the recovery journal
- entries themselves are stable. The journal entry and the data write it
- represents must be stable on disk before the other metadata structures may
- be updated to reflect the operation. These entries allow the vdo device to
- reconstruct the logical to physical mappings after an unexpected
- interruption such as a loss of power.
- *Write Path*
- All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
- as vdo has done enough work to guarantee that it can complete the write
- eventually. Generally, the data for acknowledged but unflushed write I/O
- can be treated as though it is cached in memory. If an application
- requires data to be stable on storage, it must issue a flush or write the
- data with the FUA bit set like any other asynchronous I/O. Shutting down
- the vdo target will also flush any remaining I/O.
- Application write bios follow the steps outlined below.
- 1. A data_vio is obtained from the data_vio pool and associated with the
- application bio. If there are no data_vios available, the incoming bio
- will block until a data_vio is available. This provides back pressure
- to the application. The data_vio pool is protected by a spin lock.
- The newly acquired data_vio is reset and the bio's data is copied into
- the data_vio if it is a write and the data is not all zeroes. The data
- must be copied because the application bio can be acknowledged before
- the data_vio processing is complete, which means later processing steps
- will no longer have access to the application bio. The application bio
- may also be smaller than 4K, in which case the data_vio will have
- already read the underlying block and the data is instead copied over
- the relevant portion of the larger block.
- 2. The data_vio places a claim (the "logical lock") on the logical address
- of the bio. It is vital to prevent simultaneous modifications of the
- same logical address, because deduplication involves sharing blocks.
- This claim is implemented as an entry in a hashtable where the key is
- the logical address and the value is a pointer to the data_vio
- currently handling that address.
- If a data_vio looks in the hashtable and finds that another data_vio is
- already operating on that logical address, it waits until the previous
- operation finishes. It also sends a message to inform the current
- lock holder that it is waiting. Most notably, a new data_vio waiting
- for a logical lock will flush the previous lock holder out of the
- compression packer (step 8d) rather than allowing it to continue
- waiting to be packed.
- This stage requires the data_vio to get an implicit lock on the
- appropriate logical zone to prevent concurrent modifications of the
- hashtable. This implicit locking is handled by the zone divisions
- described above.
- 3. The data_vio traverses the block map tree to ensure that all the
- necessary internal tree nodes have been allocated, by trying to find
- the leaf page for its logical address. If any interior tree page is
- missing, it is allocated at this time out of the same physical storage
- pool used to store application data.
- a. If any page-node in the tree has not yet been allocated, it must be
- allocated before the write can continue. This step requires the
- data_vio to lock the page-node that needs to be allocated. This
- lock, like the logical block lock in step 2, is a hashtable entry
- that causes other data_vios to wait for the allocation process to
- complete.
- The implicit logical zone lock is released while the allocation is
- happening, in order to allow other operations in the same logical
- zone to proceed. The details of allocation are the same as in
- step 4. Once a new node has been allocated, that node is added to
- the tree using a similar process to adding a new data block mapping.
- The data_vio journals the intent to add the new node to the block
- map tree (step 10), updates the reference count of the new block
- (step 11), and reacquires the implicit logical zone lock to add the
- new mapping to the parent tree node (step 12). Once the tree is
- updated, the data_vio proceeds down the tree. Any other data_vios
- waiting on this allocation also proceed.
- b. In the steady-state case, the block map tree nodes will already be
- allocated, so the data_vio just traverses the tree until it finds
- the required leaf node. The location of the mapping (the "block map
- slot") is recorded in the data_vio so that later steps do not need
- to traverse the tree again. The data_vio then releases the implicit
- logical zone lock.
- 4. If the block is a zero block, skip to step 9. Otherwise, an attempt is
- made to allocate a free data block. This allocation ensures that the
- data_vio can write its data somewhere even if deduplication and
- compression are not possible. This stage gets an implicit lock on a
- physical zone to search for free space within that zone.
- The data_vio will search each slab in a zone until it finds a free
- block or decides there are none. If the first zone has no free space,
- it will proceed to search the next physical zone by taking the implicit
- lock for that zone and releasing the previous one until it finds a
- free block or runs out of zones to search. The data_vio will acquire a
- struct pbn_lock (the "physical block lock") on the free block. The
- struct pbn_lock also has several fields to record the various kinds of
- claims that data_vios can have on physical blocks. The pbn_lock is
- added to a hashtable like the logical block locks in step 2. This
- hashtable is also covered by the implicit physical zone lock. The
- reference count of the free block is updated to prevent any other
- data_vio from considering it free. The reference counters are a
- sub-component of the slab and are thus also covered by the implicit
- physical zone lock.
- 5. If an allocation was obtained, the data_vio has all the resources it
- needs to complete the write. The application bio can safely be
- acknowledged at this point. The acknowledgment happens on a separate
- thread to prevent the application callback from blocking other data_vio
- operations.
- If an allocation could not be obtained, the data_vio continues to
- attempt to deduplicate or compress the data, but the bio is not
- acknowledged because the vdo device may be out of space.
- 6. At this point vdo must determine where to store the application data.
- The data_vio's data is hashed and the hash (the "record name") is
- recorded in the data_vio.
- 7. The data_vio reserves or joins a struct hash_lock, which manages all of
- the data_vios currently writing the same data. Active hash locks are
- tracked in a hashtable similar to the way logical block locks are
- tracked in step 2. This hashtable is covered by the implicit lock on
- the hash zone.
- If there is no existing hash lock for this data_vio's record_name, the
- data_vio obtains a hash lock from the pool, adds it to the hashtable,
- and sets itself as the new hash lock's "agent." The hash_lock pool is
- also covered by the implicit hash zone lock. The hash lock agent will
- do all the work to decide where the application data will be
- written. If a hash lock for the data_vio's record_name already exists,
- and the data_vio's data is the same as the agent's data, the new
- data_vio will wait for the agent to complete its work and then share
- its result.
- In the rare case that a hash lock exists for the data_vio's hash but
- the data does not match the hash lock's agent, the data_vio skips to
- step 8h and attempts to write its data directly. This can happen if two
- different data blocks produce the same hash, for example.
- 8. The hash lock agent attempts to deduplicate or compress its data with
- the following steps.
- a. The agent initializes and sends its embedded deduplication request
- (struct uds_request) to the deduplication index. This does not
- require the data_vio to get any locks because the index components
- manage their own locking. The data_vio waits until it either gets a
- response from the index or times out.
- b. If the deduplication index returns advice, the data_vio attempts to
- obtain a physical block lock on the indicated physical address, in
- order to read the data and verify that it is the same as the
- data_vio's data, and that it can accept more references. If the
- physical address is already locked by another data_vio, the data at
- that address may soon be overwritten so it is not safe to use the
- address for deduplication.
- c. If the data matches and the physical block can add references, the
- agent and any other data_vios waiting on it will record this
- physical block as their new physical address and proceed to step 9
- to record their new mapping. If there are more data_vios in the hash
- lock than there are references available, one of the remaining
- data_vios becomes the new agent and continues to step 8d as if no
- valid advice was returned.
- d. If no usable duplicate block was found, the agent first checks that
- it has an allocated physical block (from step 3) that it can write
- to. If the agent does not have an allocation, some other data_vio in
- the hash lock that does have an allocation takes over as agent. If
- none of the data_vios have an allocated physical block, these writes
- are out of space, so they proceed to step 13 for cleanup.
- e. The agent attempts to compress its data. If the data does not
- compress, the data_vio will continue to step 8h to write its data
- directly.
- If the compressed size is small enough, the agent will release the
- implicit hash zone lock and go to the packer (struct packer) where
- it will be placed in a bin (struct packer_bin) along with other
- data_vios. All compression operations require the implicit lock on
- the packer zone.
- The packer can combine up to 14 compressed blocks in a single 4k
- data block. Compression is only helpful if vdo can pack at least 2
- data_vios into a single data block. This means that a data_vio may
- wait in the packer for an arbitrarily long time for other data_vios
- to fill out the compressed block. There is a mechanism for vdo to
- evict waiting data_vios when continuing to wait would cause
- problems. Circumstances causing an eviction include an application
- flush, device shutdown, or a subsequent data_vio trying to overwrite
- the same logical block address. A data_vio may also be evicted from
- the packer if it cannot be paired with any other compressed block
- before more compressible blocks need to use its bin. An evicted
- data_vio will proceed to step 8h to write its data directly.
- f. If the agent fills a packer bin, either because all 14 of its slots
- are used or because it has no remaining space, it is written out
- using the allocated physical block from one of its data_vios. Step
- 8d has already ensured that an allocation is available.
- g. Each data_vio sets the compressed block as its new physical address.
- The data_vio obtains an implicit lock on the physical zone and
- acquires the struct pbn_lock for the compressed block, which is
- modified to be a shared lock. Then it releases the implicit physical
- zone lock and proceeds to step 8i.
- h. Any data_vio evicted from the packer will have an allocation from
- step 3. It will write its data to that allocated physical block.
- i. After the data is written, if the data_vio is the agent of a hash
- lock, it will reacquire the implicit hash zone lock and share its
- physical address with as many other data_vios in the hash lock as
- possible. Each data_vio will then proceed to step 9 to record its
- new mapping.
- j. If the agent actually wrote new data (whether compressed or not),
- the deduplication index is updated to reflect the location of the
- new data. The agent then releases the implicit hash zone lock.
- 9. The data_vio determines the previous mapping of the logical address.
- There is a cache for block map leaf pages (the "block map cache"),
- because there are usually too many block map leaf nodes to store
- entirely in memory. If the desired leaf page is not in the cache, the
- data_vio will reserve a slot in the cache and load the desired page
- into it, possibly evicting an older cached page. The data_vio then
- finds the current physical address for this logical address (the "old
- physical mapping"), if any, and records it. This step requires a lock
- on the block map cache structures, covered by the implicit logical zone
- lock.
- 10. The data_vio makes an entry in the recovery journal containing the
- logical block address, the old physical mapping, and the new physical
- mapping. Making this journal entry requires holding the implicit
- recovery journal lock. The data_vio will wait in the journal until all
- recovery blocks up to the one containing its entry have been written
- and flushed to ensure the transaction is stable on storage.
- 11. Once the recovery journal entry is stable, the data_vio makes two slab
- journal entries: an increment entry for the new mapping, and a
- decrement entry for the old mapping. These two operations each require
- holding a lock on the affected physical slab, covered by its implicit
- physical zone lock. For correctness during recovery, the slab journal
- entries in any given slab journal must be in the same order as the
- corresponding recovery journal entries. Therefore, if the two entries
- are in different zones, they are made concurrently, and if they are in
- the same zone, the increment is always made before the decrement in
- order to avoid underflow. After each slab journal entry is made in
- memory, the associated reference count is also updated in memory.
- 12. Once both of the reference count updates are done, the data_vio
- acquires the implicit logical zone lock and updates the
- logical-to-physical mapping in the block map to point to the new
- physical block. At this point the write operation is complete.
- 13. If the data_vio has a hash lock, it acquires the implicit hash zone
- lock and releases its hash lock to the pool.
- The data_vio then acquires the implicit physical zone lock and releases
- the struct pbn_lock it holds for its allocated block. If it had an
- allocation that it did not use, it also sets the reference count for
- that block back to zero to free it for use by subsequent data_vios.
- The data_vio then acquires the implicit logical zone lock and releases
- the logical block lock acquired in step 2.
- The application bio is then acknowledged if it has not previously been
- acknowledged, and the data_vio is returned to the pool.
- *Read Path*
- An application read bio follows a much simpler set of steps. It does steps
- 1 and 2 in the write path to obtain a data_vio and lock its logical
- address. If there is already a write data_vio in progress for that logical
- address that is guaranteed to complete, the read data_vio will copy the
- data from the write data_vio and return it. Otherwise, it will look up the
- logical-to-physical mapping by traversing the block map tree as in step 3,
- and then read and possibly decompress the indicated data at the indicated
- physical block address. A read data_vio will not allocate block map tree
- nodes if they are missing. If the interior block map nodes do not exist
- yet, the logical block map address must still be unmapped and the read
- data_vio will return all zeroes. A read data_vio handles cleanup and
- acknowledgment as in step 13, although it only needs to release the logical
- lock and return itself to the pool.
- *Small Writes*
- All storage within vdo is managed as 4KB blocks, but it can accept writes
- as small as 512 bytes. Processing a write that is smaller than 4K requires
- a read-modify-write operation that reads the relevant 4K block, copies the
- new data over the approriate sectors of the block, and then launches a
- write operation for the modified data block. The read and write stages of
- this operation are nearly identical to the normal read and write
- operations, and a single data_vio is used throughout this operation.
- *Recovery*
- When a vdo is restarted after a crash, it will attempt to recover from the
- recovery journal. During the pre-resume phase of the next start, the
- recovery journal is read. The increment portion of valid entries are played
- into the block map. Next, valid entries are played, in order as required,
- into the slab journals. Finally, each physical zone attempts to replay at
- least one slab journal to reconstruct the reference counts of one slab.
- Once each zone has some free space (or has determined that it has none),
- the vdo comes back online, while the remainder of the slab journals are
- used to reconstruct the rest of the reference counts in the background.
- *Read-only Rebuild*
- If a vdo encounters an unrecoverable error, it will enter read-only mode.
- This mode indicates that some previously acknowledged data may have been
- lost. The vdo may be instructed to rebuild as best it can in order to
- return to a writable state. However, this is never done automatically due
- to the possibility that data has been lost. During a read-only rebuild, the
- block map is recovered from the recovery journal as before. However, the
- reference counts are not rebuilt from the slab journals. Instead, the
- reference counts are zeroed, the entire block map is traversed, and the
- reference counts are updated from the block mappings. While this may lose
- some data, it ensures that the block map and reference counts are
- consistent with each other. This allows vdo to resume normal operation and
- accept further writes.
|