vdo-design.rst 35 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633
  1. .. SPDX-License-Identifier: GPL-2.0-only
  2. ================
  3. Design of dm-vdo
  4. ================
  5. The dm-vdo (virtual data optimizer) target provides inline deduplication,
  6. compression, zero-block elimination, and thin provisioning. A dm-vdo target
  7. can be backed by up to 256TB of storage, and can present a logical size of
  8. up to 4PB. This target was originally developed at Permabit Technology
  9. Corp. starting in 2009. It was first released in 2013 and has been used in
  10. production environments ever since. It was made open-source in 2017 after
  11. Permabit was acquired by Red Hat. This document describes the design of
  12. dm-vdo. For usage, see vdo.rst in the same directory as this file.
  13. Because deduplication rates fall drastically as the block size increases, a
  14. vdo target has a maximum block size of 4K. However, it can achieve
  15. deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
  16. reference a single 4K of actual storage. It can achieve compression rates
  17. of 14:1. All zero blocks consume no storage at all.
  18. Theory of Operation
  19. ===================
  20. The design of dm-vdo is based on the idea that deduplication is a two-part
  21. problem. The first is to recognize duplicate data. The second is to avoid
  22. storing multiple copies of those duplicates. Therefore, dm-vdo has two main
  23. parts: a deduplication index (called UDS) that is used to discover
  24. duplicate data, and a data store with a reference counted block map that
  25. maps from logical block addresses to the actual storage location of the
  26. data.
  27. Zones and Threading
  28. -------------------
  29. Due to the complexity of data optimization, the number of metadata
  30. structures involved in a single write operation to a vdo target is larger
  31. than most other targets. Furthermore, because vdo must operate on small
  32. block sizes in order to achieve good deduplication rates, acceptable
  33. performance can only be achieved through parallelism. Therefore, vdo's
  34. design attempts to be lock-free.
  35. Most of a vdo's main data structures are designed to be easily divided into
  36. "zones" such that any given bio must only access a single zone of any zoned
  37. structure. Safety with minimal locking is achieved by ensuring that during
  38. normal operation, each zone is assigned to a specific thread, and only that
  39. thread will access the portion of the data structure in that zone.
  40. Associated with each thread is a work queue. Each bio is associated with a
  41. request object (the "data_vio") which will be added to a work queue when
  42. the next phase of its operation requires access to the structures in the
  43. zone associated with that queue.
  44. Another way of thinking about this arrangement is that the work queue for
  45. each zone has an implicit lock on the structures it manages for all its
  46. operations, because vdo guarantees that no other thread will alter those
  47. structures.
  48. Although each structure is divided into zones, this division is not
  49. reflected in the on-disk representation of each data structure. Therefore,
  50. the number of zones for each structure, and hence the number of threads,
  51. can be reconfigured each time a vdo target is started.
  52. The Deduplication Index
  53. -----------------------
  54. In order to identify duplicate data efficiently, vdo was designed to
  55. leverage some common characteristics of duplicate data. From empirical
  56. observations, we gathered two key insights. The first is that in most data
  57. sets with significant amounts of duplicate data, the duplicates tend to
  58. have temporal locality. When a duplicate appears, it is more likely that
  59. other duplicates will be detected, and that those duplicates will have been
  60. written at about the same time. This is why the index keeps records in
  61. temporal order. The second insight is that new data is more likely to
  62. duplicate recent data than it is to duplicate older data and in general,
  63. there are diminishing returns to looking further back in time. Therefore,
  64. when the index is full, it should cull its oldest records to make space for
  65. new ones. Another important idea behind the design of the index is that the
  66. ultimate goal of deduplication is to reduce storage costs. Since there is a
  67. trade-off between the storage saved and the resources expended to achieve
  68. those savings, vdo does not attempt to find every last duplicate block. It
  69. is sufficient to find and eliminate most of the redundancy.
  70. Each block of data is hashed to produce a 16-byte block name. An index
  71. record consists of this block name paired with the presumed location of
  72. that data on the underlying storage. However, it is not possible to
  73. guarantee that the index is accurate. In the most common case, this occurs
  74. because it is too costly to update the index when a block is over-written
  75. or discarded. Doing so would require either storing the block name along
  76. with the blocks, which is difficult to do efficiently in block-based
  77. storage, or reading and rehashing each block before overwriting it.
  78. Inaccuracy can also result from a hash collision where two different blocks
  79. have the same name. In practice, this is extremely unlikely, but because
  80. vdo does not use a cryptographic hash, a malicious workload could be
  81. constructed. Because of these inaccuracies, vdo treats the locations in the
  82. index as hints, and reads each indicated block to verify that it is indeed
  83. a duplicate before sharing the existing block with a new one.
  84. Records are collected into groups called chapters. New records are added to
  85. the newest chapter, called the open chapter. This chapter is stored in a
  86. format optimized for adding and modifying records, and the content of the
  87. open chapter is not finalized until it runs out of space for new records.
  88. When the open chapter fills up, it is closed and a new open chapter is
  89. created to collect new records.
  90. Closing a chapter converts it to a different format which is optimized for
  91. reading. The records are written to a series of record pages based on the
  92. order in which they were received. This means that records with temporal
  93. locality should be on a small number of pages, reducing the I/O required to
  94. retrieve them. The chapter also compiles an index that indicates which
  95. record page contains any given name. This index means that a request for a
  96. name can determine exactly which record page may contain that record,
  97. without having to load the entire chapter from storage. This index uses
  98. only a subset of the block name as its key, so it cannot guarantee that an
  99. index entry refers to the desired block name. It can only guarantee that if
  100. there is a record for this name, it will be on the indicated page. Closed
  101. chapters are read-only structures and their contents are never altered in
  102. any way.
  103. Once enough records have been written to fill up all the available index
  104. space, the oldest chapter is removed to make space for new chapters. Any
  105. time a request finds a matching record in the index, that record is copied
  106. into the open chapter. This ensures that useful block names remain available
  107. in the index, while unreferenced block names are forgotten over time.
  108. In order to find records in older chapters, the index also maintains a
  109. higher level structure called the volume index, which contains entries
  110. mapping each block name to the chapter containing its newest record. This
  111. mapping is updated as records for the block name are copied or updated,
  112. ensuring that only the newest record for a given block name can be found.
  113. An older record for a block name will no longer be found even though it has
  114. not been deleted from its chapter. Like the chapter index, the volume index
  115. uses only a subset of the block name as its key and can not definitively
  116. say that a record exists for a name. It can only say which chapter would
  117. contain the record if a record exists. The volume index is stored entirely
  118. in memory and is saved to storage only when the vdo target is shut down.
  119. From the viewpoint of a request for a particular block name, it will first
  120. look up the name in the volume index. This search will either indicate that
  121. the name is new, or which chapter to search. If it returns a chapter, the
  122. request looks up its name in the chapter index. This will indicate either
  123. that the name is new, or which record page to search. Finally, if it is not
  124. new, the request will look for its name in the indicated record page.
  125. This process may require up to two page reads per request (one for the
  126. chapter index page and one for the request page). However, recently
  127. accessed pages are cached so that these page reads can be amortized across
  128. many block name requests.
  129. The volume index and the chapter indexes are implemented using a
  130. memory-efficient structure called a delta index. Instead of storing the
  131. entire block name (the key) for each entry, the entries are sorted by name
  132. and only the difference between adjacent keys (the delta) is stored.
  133. Because we expect the hashes to be randomly distributed, the size of the
  134. deltas follows an exponential distribution. Because of this distribution,
  135. the deltas are expressed using a Huffman code to take up even less space.
  136. The entire sorted list of keys is called a delta list. This structure
  137. allows the index to use many fewer bytes per entry than a traditional hash
  138. table, but it is slightly more expensive to look up entries, because a
  139. request must read every entry in a delta list to add up the deltas in order
  140. to find the record it needs. The delta index reduces this lookup cost by
  141. splitting its key space into many sub-lists, each starting at a fixed key
  142. value, so that each individual list is short.
  143. The default index size can hold 64 million records, corresponding to about
  144. 256GB of data. This means that the index can identify duplicate data if the
  145. original data was written within the last 256GB of writes. This range is
  146. called the deduplication window. If new writes duplicate data that is older
  147. than that, the index will not be able to find it because the records of the
  148. older data have been removed. This means that if an application writes a
  149. 200 GB file to a vdo target and then immediately writes it again, the two
  150. copies will deduplicate perfectly. Doing the same with a 500 GB file will
  151. result in no deduplication, because the beginning of the file will no
  152. longer be in the index by the time the second write begins (assuming there
  153. is no duplication within the file itself).
  154. If an application anticipates a data workload that will see useful
  155. deduplication beyond the 256GB threshold, vdo can be configured to use a
  156. larger index with a correspondingly larger deduplication window. (This
  157. configuration can only be set when the target is created, not altered
  158. later. It is important to consider the expected workload for a vdo target
  159. before configuring it.) There are two ways to do this.
  160. One way is to increase the memory size of the index, which also increases
  161. the amount of backing storage required. Doubling the size of the index will
  162. double the length of the deduplication window at the expense of doubling
  163. the storage size and the memory requirements.
  164. The other option is to enable sparse indexing. Sparse indexing increases
  165. the deduplication window by a factor of 10, at the expense of also
  166. increasing the storage size by a factor of 10. However with sparse
  167. indexing, the memory requirements do not increase. The trade-off is
  168. slightly more computation per request and a slight decrease in the amount
  169. of deduplication detected. For most workloads with significant amounts of
  170. duplicate data, sparse indexing will detect 97-99% of the deduplication
  171. that a standard index will detect.
  172. The vio and data_vio Structures
  173. -------------------------------
  174. A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
  175. fields and data to track vdo-specific information. A struct vio maintains a
  176. pointer to a bio but also tracks other fields specific to the operation of
  177. vdo. The vio is kept separate from its related bio because there are many
  178. circumstances where vdo completes the bio but must continue to do work
  179. related to deduplication or compression.
  180. Metadata reads and writes, and other writes that originate within vdo, use
  181. a struct vio directly. Application reads and writes use a larger structure
  182. called a data_vio to track information about their progress. A struct
  183. data_vio contain a struct vio and also includes several other fields
  184. related to deduplication and other vdo features. The data_vio is the
  185. primary unit of application work in vdo. Each data_vio proceeds through a
  186. set of steps to handle the application data, after which it is reset and
  187. returned to a pool of data_vios for reuse.
  188. There is a fixed pool of 2048 data_vios. This number was chosen to bound
  189. the amount of work that is required to recover from a crash. In addition,
  190. benchmarks have indicated that increasing the size of the pool does not
  191. significantly improve performance.
  192. The Data Store
  193. --------------
  194. The data store is implemented by three main data structures, all of which
  195. work in concert to reduce or amortize metadata updates across as many data
  196. writes as possible.
  197. *The Slab Depot*
  198. Most of the vdo volume belongs to the slab depot. The depot contains a
  199. collection of slabs. The slabs can be up to 32GB, and are divided into
  200. three sections. Most of a slab consists of a linear sequence of 4K blocks.
  201. These blocks are used either to store data, or to hold portions of the
  202. block map (see below). In addition to the data blocks, each slab has a set
  203. of reference counters, using 1 byte for each data block. Finally each slab
  204. has a journal.
  205. Reference updates are written to the slab journal. Slab journal blocks are
  206. written out either when they are full, or when the recovery journal
  207. requests they do so in order to allow the main recovery journal (see below)
  208. to free up space. The slab journal is used both to ensure that the main
  209. recovery journal can regularly free up space, and also to amortize the cost
  210. of updating individual reference blocks. The reference counters are kept in
  211. memory and are written out, a block at a time in oldest-dirtied-order, only
  212. when there is a need to reclaim slab journal space. The write operations
  213. are performed in the background as needed so they do not add latency to
  214. particular I/O operations.
  215. Each slab is independent of every other. They are assigned to "physical
  216. zones" in round-robin fashion. If there are P physical zones, then slab n
  217. is assigned to zone n mod P.
  218. The slab depot maintains an additional small data structure, the "slab
  219. summary," which is used to reduce the amount of work needed to come back
  220. online after a crash. The slab summary maintains an entry for each slab
  221. indicating whether or not the slab has ever been used, whether all of its
  222. reference count updates have been persisted to storage, and approximately
  223. how full it is. During recovery, each physical zone will attempt to recover
  224. at least one slab, stopping whenever it has recovered a slab which has some
  225. free blocks. Once each zone has some space, or has determined that none is
  226. available, the target can resume normal operation in a degraded mode. Read
  227. and write requests can be serviced, perhaps with degraded performance,
  228. while the remainder of the dirty slabs are recovered.
  229. *The Block Map*
  230. The block map contains the logical to physical mapping. It can be thought
  231. of as an array with one entry per logical address. Each entry is 5 bytes,
  232. 36 bits of which contain the physical block number which holds the data for
  233. the given logical address. The other 4 bits are used to indicate the nature
  234. of the mapping. Of the 16 possible states, one represents a logical address
  235. which is unmapped (i.e. it has never been written, or has been discarded),
  236. one represents an uncompressed block, and the other 14 states are used to
  237. indicate that the mapped data is compressed, and which of the compression
  238. slots in the compressed block contains the data for this logical address.
  239. In practice, the array of mapping entries is divided into "block map
  240. pages," each of which fits in a single 4K block. Each block map page
  241. consists of a header and 812 mapping entries. Each mapping page is actually
  242. a leaf of a radix tree which consists of block map pages at each level.
  243. There are 60 radix trees which are assigned to "logical zones" in round
  244. robin fashion. (If there are L logical zones, tree n will belong to zone n
  245. mod L.) At each level, the trees are interleaved, so logical addresses
  246. 0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
  247. on. The interleaving is maintained all the way up to the 60 root nodes.
  248. Choosing 60 trees results in an evenly distributed number of trees per zone
  249. for a large number of possible logical zone counts. The storage for the 60
  250. tree roots is allocated at format time. All other block map pages are
  251. allocated out of the slabs as needed. This flexible allocation avoids the
  252. need to pre-allocate space for the entire set of logical mappings and also
  253. makes growing the logical size of a vdo relatively easy.
  254. In operation, the block map maintains two caches. It is prohibitive to keep
  255. the entire leaf level of the trees in memory, so each logical zone
  256. maintains its own cache of leaf pages. The size of this cache is
  257. configurable at target start time. The second cache is allocated at start
  258. time, and is large enough to hold all the non-leaf pages of the entire
  259. block map. This cache is populated as pages are needed.
  260. *The Recovery Journal*
  261. The recovery journal is used to amortize updates across the block map and
  262. slab depot. Each write request causes an entry to be made in the journal.
  263. Entries are either "data remappings" or "block map remappings." For a data
  264. remapping, the journal records the logical address affected and its old and
  265. new physical mappings. For a block map remapping, the journal records the
  266. block map page number and the physical block allocated for it. Block map
  267. pages are never reclaimed or repurposed, so the old mapping is always 0.
  268. Each journal entry is an intent record summarizing the metadata updates
  269. that are required for a data_vio. The recovery journal issues a flush
  270. before each journal block write to ensure that the physical data for the
  271. new block mappings in that block are stable on storage, and journal block
  272. writes are all issued with the FUA bit set to ensure the recovery journal
  273. entries themselves are stable. The journal entry and the data write it
  274. represents must be stable on disk before the other metadata structures may
  275. be updated to reflect the operation. These entries allow the vdo device to
  276. reconstruct the logical to physical mappings after an unexpected
  277. interruption such as a loss of power.
  278. *Write Path*
  279. All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
  280. as vdo has done enough work to guarantee that it can complete the write
  281. eventually. Generally, the data for acknowledged but unflushed write I/O
  282. can be treated as though it is cached in memory. If an application
  283. requires data to be stable on storage, it must issue a flush or write the
  284. data with the FUA bit set like any other asynchronous I/O. Shutting down
  285. the vdo target will also flush any remaining I/O.
  286. Application write bios follow the steps outlined below.
  287. 1. A data_vio is obtained from the data_vio pool and associated with the
  288. application bio. If there are no data_vios available, the incoming bio
  289. will block until a data_vio is available. This provides back pressure
  290. to the application. The data_vio pool is protected by a spin lock.
  291. The newly acquired data_vio is reset and the bio's data is copied into
  292. the data_vio if it is a write and the data is not all zeroes. The data
  293. must be copied because the application bio can be acknowledged before
  294. the data_vio processing is complete, which means later processing steps
  295. will no longer have access to the application bio. The application bio
  296. may also be smaller than 4K, in which case the data_vio will have
  297. already read the underlying block and the data is instead copied over
  298. the relevant portion of the larger block.
  299. 2. The data_vio places a claim (the "logical lock") on the logical address
  300. of the bio. It is vital to prevent simultaneous modifications of the
  301. same logical address, because deduplication involves sharing blocks.
  302. This claim is implemented as an entry in a hashtable where the key is
  303. the logical address and the value is a pointer to the data_vio
  304. currently handling that address.
  305. If a data_vio looks in the hashtable and finds that another data_vio is
  306. already operating on that logical address, it waits until the previous
  307. operation finishes. It also sends a message to inform the current
  308. lock holder that it is waiting. Most notably, a new data_vio waiting
  309. for a logical lock will flush the previous lock holder out of the
  310. compression packer (step 8d) rather than allowing it to continue
  311. waiting to be packed.
  312. This stage requires the data_vio to get an implicit lock on the
  313. appropriate logical zone to prevent concurrent modifications of the
  314. hashtable. This implicit locking is handled by the zone divisions
  315. described above.
  316. 3. The data_vio traverses the block map tree to ensure that all the
  317. necessary internal tree nodes have been allocated, by trying to find
  318. the leaf page for its logical address. If any interior tree page is
  319. missing, it is allocated at this time out of the same physical storage
  320. pool used to store application data.
  321. a. If any page-node in the tree has not yet been allocated, it must be
  322. allocated before the write can continue. This step requires the
  323. data_vio to lock the page-node that needs to be allocated. This
  324. lock, like the logical block lock in step 2, is a hashtable entry
  325. that causes other data_vios to wait for the allocation process to
  326. complete.
  327. The implicit logical zone lock is released while the allocation is
  328. happening, in order to allow other operations in the same logical
  329. zone to proceed. The details of allocation are the same as in
  330. step 4. Once a new node has been allocated, that node is added to
  331. the tree using a similar process to adding a new data block mapping.
  332. The data_vio journals the intent to add the new node to the block
  333. map tree (step 10), updates the reference count of the new block
  334. (step 11), and reacquires the implicit logical zone lock to add the
  335. new mapping to the parent tree node (step 12). Once the tree is
  336. updated, the data_vio proceeds down the tree. Any other data_vios
  337. waiting on this allocation also proceed.
  338. b. In the steady-state case, the block map tree nodes will already be
  339. allocated, so the data_vio just traverses the tree until it finds
  340. the required leaf node. The location of the mapping (the "block map
  341. slot") is recorded in the data_vio so that later steps do not need
  342. to traverse the tree again. The data_vio then releases the implicit
  343. logical zone lock.
  344. 4. If the block is a zero block, skip to step 9. Otherwise, an attempt is
  345. made to allocate a free data block. This allocation ensures that the
  346. data_vio can write its data somewhere even if deduplication and
  347. compression are not possible. This stage gets an implicit lock on a
  348. physical zone to search for free space within that zone.
  349. The data_vio will search each slab in a zone until it finds a free
  350. block or decides there are none. If the first zone has no free space,
  351. it will proceed to search the next physical zone by taking the implicit
  352. lock for that zone and releasing the previous one until it finds a
  353. free block or runs out of zones to search. The data_vio will acquire a
  354. struct pbn_lock (the "physical block lock") on the free block. The
  355. struct pbn_lock also has several fields to record the various kinds of
  356. claims that data_vios can have on physical blocks. The pbn_lock is
  357. added to a hashtable like the logical block locks in step 2. This
  358. hashtable is also covered by the implicit physical zone lock. The
  359. reference count of the free block is updated to prevent any other
  360. data_vio from considering it free. The reference counters are a
  361. sub-component of the slab and are thus also covered by the implicit
  362. physical zone lock.
  363. 5. If an allocation was obtained, the data_vio has all the resources it
  364. needs to complete the write. The application bio can safely be
  365. acknowledged at this point. The acknowledgment happens on a separate
  366. thread to prevent the application callback from blocking other data_vio
  367. operations.
  368. If an allocation could not be obtained, the data_vio continues to
  369. attempt to deduplicate or compress the data, but the bio is not
  370. acknowledged because the vdo device may be out of space.
  371. 6. At this point vdo must determine where to store the application data.
  372. The data_vio's data is hashed and the hash (the "record name") is
  373. recorded in the data_vio.
  374. 7. The data_vio reserves or joins a struct hash_lock, which manages all of
  375. the data_vios currently writing the same data. Active hash locks are
  376. tracked in a hashtable similar to the way logical block locks are
  377. tracked in step 2. This hashtable is covered by the implicit lock on
  378. the hash zone.
  379. If there is no existing hash lock for this data_vio's record_name, the
  380. data_vio obtains a hash lock from the pool, adds it to the hashtable,
  381. and sets itself as the new hash lock's "agent." The hash_lock pool is
  382. also covered by the implicit hash zone lock. The hash lock agent will
  383. do all the work to decide where the application data will be
  384. written. If a hash lock for the data_vio's record_name already exists,
  385. and the data_vio's data is the same as the agent's data, the new
  386. data_vio will wait for the agent to complete its work and then share
  387. its result.
  388. In the rare case that a hash lock exists for the data_vio's hash but
  389. the data does not match the hash lock's agent, the data_vio skips to
  390. step 8h and attempts to write its data directly. This can happen if two
  391. different data blocks produce the same hash, for example.
  392. 8. The hash lock agent attempts to deduplicate or compress its data with
  393. the following steps.
  394. a. The agent initializes and sends its embedded deduplication request
  395. (struct uds_request) to the deduplication index. This does not
  396. require the data_vio to get any locks because the index components
  397. manage their own locking. The data_vio waits until it either gets a
  398. response from the index or times out.
  399. b. If the deduplication index returns advice, the data_vio attempts to
  400. obtain a physical block lock on the indicated physical address, in
  401. order to read the data and verify that it is the same as the
  402. data_vio's data, and that it can accept more references. If the
  403. physical address is already locked by another data_vio, the data at
  404. that address may soon be overwritten so it is not safe to use the
  405. address for deduplication.
  406. c. If the data matches and the physical block can add references, the
  407. agent and any other data_vios waiting on it will record this
  408. physical block as their new physical address and proceed to step 9
  409. to record their new mapping. If there are more data_vios in the hash
  410. lock than there are references available, one of the remaining
  411. data_vios becomes the new agent and continues to step 8d as if no
  412. valid advice was returned.
  413. d. If no usable duplicate block was found, the agent first checks that
  414. it has an allocated physical block (from step 3) that it can write
  415. to. If the agent does not have an allocation, some other data_vio in
  416. the hash lock that does have an allocation takes over as agent. If
  417. none of the data_vios have an allocated physical block, these writes
  418. are out of space, so they proceed to step 13 for cleanup.
  419. e. The agent attempts to compress its data. If the data does not
  420. compress, the data_vio will continue to step 8h to write its data
  421. directly.
  422. If the compressed size is small enough, the agent will release the
  423. implicit hash zone lock and go to the packer (struct packer) where
  424. it will be placed in a bin (struct packer_bin) along with other
  425. data_vios. All compression operations require the implicit lock on
  426. the packer zone.
  427. The packer can combine up to 14 compressed blocks in a single 4k
  428. data block. Compression is only helpful if vdo can pack at least 2
  429. data_vios into a single data block. This means that a data_vio may
  430. wait in the packer for an arbitrarily long time for other data_vios
  431. to fill out the compressed block. There is a mechanism for vdo to
  432. evict waiting data_vios when continuing to wait would cause
  433. problems. Circumstances causing an eviction include an application
  434. flush, device shutdown, or a subsequent data_vio trying to overwrite
  435. the same logical block address. A data_vio may also be evicted from
  436. the packer if it cannot be paired with any other compressed block
  437. before more compressible blocks need to use its bin. An evicted
  438. data_vio will proceed to step 8h to write its data directly.
  439. f. If the agent fills a packer bin, either because all 14 of its slots
  440. are used or because it has no remaining space, it is written out
  441. using the allocated physical block from one of its data_vios. Step
  442. 8d has already ensured that an allocation is available.
  443. g. Each data_vio sets the compressed block as its new physical address.
  444. The data_vio obtains an implicit lock on the physical zone and
  445. acquires the struct pbn_lock for the compressed block, which is
  446. modified to be a shared lock. Then it releases the implicit physical
  447. zone lock and proceeds to step 8i.
  448. h. Any data_vio evicted from the packer will have an allocation from
  449. step 3. It will write its data to that allocated physical block.
  450. i. After the data is written, if the data_vio is the agent of a hash
  451. lock, it will reacquire the implicit hash zone lock and share its
  452. physical address with as many other data_vios in the hash lock as
  453. possible. Each data_vio will then proceed to step 9 to record its
  454. new mapping.
  455. j. If the agent actually wrote new data (whether compressed or not),
  456. the deduplication index is updated to reflect the location of the
  457. new data. The agent then releases the implicit hash zone lock.
  458. 9. The data_vio determines the previous mapping of the logical address.
  459. There is a cache for block map leaf pages (the "block map cache"),
  460. because there are usually too many block map leaf nodes to store
  461. entirely in memory. If the desired leaf page is not in the cache, the
  462. data_vio will reserve a slot in the cache and load the desired page
  463. into it, possibly evicting an older cached page. The data_vio then
  464. finds the current physical address for this logical address (the "old
  465. physical mapping"), if any, and records it. This step requires a lock
  466. on the block map cache structures, covered by the implicit logical zone
  467. lock.
  468. 10. The data_vio makes an entry in the recovery journal containing the
  469. logical block address, the old physical mapping, and the new physical
  470. mapping. Making this journal entry requires holding the implicit
  471. recovery journal lock. The data_vio will wait in the journal until all
  472. recovery blocks up to the one containing its entry have been written
  473. and flushed to ensure the transaction is stable on storage.
  474. 11. Once the recovery journal entry is stable, the data_vio makes two slab
  475. journal entries: an increment entry for the new mapping, and a
  476. decrement entry for the old mapping. These two operations each require
  477. holding a lock on the affected physical slab, covered by its implicit
  478. physical zone lock. For correctness during recovery, the slab journal
  479. entries in any given slab journal must be in the same order as the
  480. corresponding recovery journal entries. Therefore, if the two entries
  481. are in different zones, they are made concurrently, and if they are in
  482. the same zone, the increment is always made before the decrement in
  483. order to avoid underflow. After each slab journal entry is made in
  484. memory, the associated reference count is also updated in memory.
  485. 12. Once both of the reference count updates are done, the data_vio
  486. acquires the implicit logical zone lock and updates the
  487. logical-to-physical mapping in the block map to point to the new
  488. physical block. At this point the write operation is complete.
  489. 13. If the data_vio has a hash lock, it acquires the implicit hash zone
  490. lock and releases its hash lock to the pool.
  491. The data_vio then acquires the implicit physical zone lock and releases
  492. the struct pbn_lock it holds for its allocated block. If it had an
  493. allocation that it did not use, it also sets the reference count for
  494. that block back to zero to free it for use by subsequent data_vios.
  495. The data_vio then acquires the implicit logical zone lock and releases
  496. the logical block lock acquired in step 2.
  497. The application bio is then acknowledged if it has not previously been
  498. acknowledged, and the data_vio is returned to the pool.
  499. *Read Path*
  500. An application read bio follows a much simpler set of steps. It does steps
  501. 1 and 2 in the write path to obtain a data_vio and lock its logical
  502. address. If there is already a write data_vio in progress for that logical
  503. address that is guaranteed to complete, the read data_vio will copy the
  504. data from the write data_vio and return it. Otherwise, it will look up the
  505. logical-to-physical mapping by traversing the block map tree as in step 3,
  506. and then read and possibly decompress the indicated data at the indicated
  507. physical block address. A read data_vio will not allocate block map tree
  508. nodes if they are missing. If the interior block map nodes do not exist
  509. yet, the logical block map address must still be unmapped and the read
  510. data_vio will return all zeroes. A read data_vio handles cleanup and
  511. acknowledgment as in step 13, although it only needs to release the logical
  512. lock and return itself to the pool.
  513. *Small Writes*
  514. All storage within vdo is managed as 4KB blocks, but it can accept writes
  515. as small as 512 bytes. Processing a write that is smaller than 4K requires
  516. a read-modify-write operation that reads the relevant 4K block, copies the
  517. new data over the approriate sectors of the block, and then launches a
  518. write operation for the modified data block. The read and write stages of
  519. this operation are nearly identical to the normal read and write
  520. operations, and a single data_vio is used throughout this operation.
  521. *Recovery*
  522. When a vdo is restarted after a crash, it will attempt to recover from the
  523. recovery journal. During the pre-resume phase of the next start, the
  524. recovery journal is read. The increment portion of valid entries are played
  525. into the block map. Next, valid entries are played, in order as required,
  526. into the slab journals. Finally, each physical zone attempts to replay at
  527. least one slab journal to reconstruct the reference counts of one slab.
  528. Once each zone has some free space (or has determined that it has none),
  529. the vdo comes back online, while the remainder of the slab journals are
  530. used to reconstruct the rest of the reference counts in the background.
  531. *Read-only Rebuild*
  532. If a vdo encounters an unrecoverable error, it will enter read-only mode.
  533. This mode indicates that some previously acknowledged data may have been
  534. lost. The vdo may be instructed to rebuild as best it can in order to
  535. return to a writable state. However, this is never done automatically due
  536. to the possibility that data has been lost. During a read-only rebuild, the
  537. block map is recovered from the recovery journal as before. However, the
  538. reference counts are not rebuilt from the slab journals. Instead, the
  539. reference counts are zeroed, the entire block map is traversed, and the
  540. reference counts are updated from the block mappings. While this may lose
  541. some data, it ensures that the block map and reference counts are
  542. consistent with each other. This allows vdo to resume normal operation and
  543. accept further writes.