dm-zoned.txt 6.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
  1. dm-zoned
  2. ========
  3. The dm-zoned device mapper target exposes a zoned block device (ZBC and
  4. ZAC compliant devices) as a regular block device without any write
  5. pattern constraints. In effect, it implements a drive-managed zoned
  6. block device which hides from the user (a file system or an application
  7. doing raw block device accesses) the sequential write constraints of
  8. host-managed zoned block devices and can mitigate the potential
  9. device-side performance degradation due to excessive random writes on
  10. host-aware zoned block devices.
  11. For a more detailed description of the zoned block device models and
  12. their constraints see (for SCSI devices):
  13. http://www.t10.org/drafts.htm#ZBC_Family
  14. and (for ATA devices):
  15. http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
  16. The dm-zoned implementation is simple and minimizes system overhead (CPU
  17. and memory usage as well as storage capacity loss). For a 10TB
  18. host-managed disk with 256 MB zones, dm-zoned memory usage per disk
  19. instance is at most 4.5 MB and as little as 5 zones will be used
  20. internally for storing metadata and performaing reclaim operations.
  21. dm-zoned target devices are formatted and checked using the dmzadm
  22. utility available at:
  23. https://github.com/hgst/dm-zoned-tools
  24. Algorithm
  25. =========
  26. dm-zoned implements an on-disk buffering scheme to handle non-sequential
  27. write accesses to the sequential zones of a zoned block device.
  28. Conventional zones are used for caching as well as for storing internal
  29. metadata.
  30. The zones of the device are separated into 2 types:
  31. 1) Metadata zones: these are conventional zones used to store metadata.
  32. Metadata zones are not reported as useable capacity to the user.
  33. 2) Data zones: all remaining zones, the vast majority of which will be
  34. sequential zones used exclusively to store user data. The conventional
  35. zones of the device may be used also for buffering user random writes.
  36. Data in these zones may be directly mapped to the conventional zone, but
  37. later moved to a sequential zone so that the conventional zone can be
  38. reused for buffering incoming random writes.
  39. dm-zoned exposes a logical device with a sector size of 4096 bytes,
  40. irrespective of the physical sector size of the backend zoned block
  41. device being used. This allows reducing the amount of metadata needed to
  42. manage valid blocks (blocks written).
  43. The on-disk metadata format is as follows:
  44. 1) The first block of the first conventional zone found contains the
  45. super block which describes the on disk amount and position of metadata
  46. blocks.
  47. 2) Following the super block, a set of blocks is used to describe the
  48. mapping of the logical device blocks. The mapping is done per chunk of
  49. blocks, with the chunk size equal to the zoned block device size. The
  50. mapping table is indexed by chunk number and each mapping entry
  51. indicates the zone number of the device storing the chunk of data. Each
  52. mapping entry may also indicate if the zone number of a conventional
  53. zone used to buffer random modification to the data zone.
  54. 3) A set of blocks used to store bitmaps indicating the validity of
  55. blocks in the data zones follows the mapping table. A valid block is
  56. defined as a block that was written and not discarded. For a buffered
  57. data chunk, a block is always valid only in the data zone mapping the
  58. chunk or in the buffer zone of the chunk.
  59. For a logical chunk mapped to a conventional zone, all write operations
  60. are processed by directly writing to the zone. If the mapping zone is a
  61. sequential zone, the write operation is processed directly only if the
  62. write offset within the logical chunk is equal to the write pointer
  63. offset within of the sequential data zone (i.e. the write operation is
  64. aligned on the zone write pointer). Otherwise, write operations are
  65. processed indirectly using a buffer zone. In that case, an unused
  66. conventional zone is allocated and assigned to the chunk being
  67. accessed. Writing a block to the buffer zone of a chunk will
  68. automatically invalidate the same block in the sequential zone mapping
  69. the chunk. If all blocks of the sequential zone become invalid, the zone
  70. is freed and the chunk buffer zone becomes the primary zone mapping the
  71. chunk, resulting in native random write performance similar to a regular
  72. block device.
  73. Read operations are processed according to the block validity
  74. information provided by the bitmaps. Valid blocks are read either from
  75. the sequential zone mapping a chunk, or if the chunk is buffered, from
  76. the buffer zone assigned. If the accessed chunk has no mapping, or the
  77. accessed blocks are invalid, the read buffer is zeroed and the read
  78. operation terminated.
  79. After some time, the limited number of convnetional zones available may
  80. be exhausted (all used to map chunks or buffer sequential zones) and
  81. unaligned writes to unbuffered chunks become impossible. To avoid this
  82. situation, a reclaim process regularly scans used conventional zones and
  83. tries to reclaim the least recently used zones by copying the valid
  84. blocks of the buffer zone to a free sequential zone. Once the copy
  85. completes, the chunk mapping is updated to point to the sequential zone
  86. and the buffer zone freed for reuse.
  87. Metadata Protection
  88. ===================
  89. To protect metadata against corruption in case of sudden power loss or
  90. system crash, 2 sets of metadata zones are used. One set, the primary
  91. set, is used as the main metadata region, while the secondary set is
  92. used as a staging area. Modified metadata is first written to the
  93. secondary set and validated by updating the super block in the secondary
  94. set, a generation counter is used to indicate that this set contains the
  95. newest metadata. Once this operation completes, in place of metadata
  96. block updates can be done in the primary metadata set. This ensures that
  97. one of the set is always consistent (all modifications committed or none
  98. at all). Flush operations are used as a commit point. Upon reception of
  99. a flush request, metadata modification activity is temporarily blocked
  100. (for both incoming BIO processing and reclaim process) and all dirty
  101. metadata blocks are staged and updated. Normal operation is then
  102. resumed. Flushing metadata thus only temporarily delays write and
  103. discard requests. Read requests can be processed concurrently while
  104. metadata flush is being executed.
  105. Usage
  106. =====
  107. A zoned block device must first be formatted using the dmzadm tool. This
  108. will analyze the device zone configuration, determine where to place the
  109. metadata sets on the device and initialize the metadata sets.
  110. Ex:
  111. dmzadm --format /dev/sdxx
  112. For a formatted device, the target can be created normally with the
  113. dmsetup utility. The only parameter that dm-zoned requires is the
  114. underlying zoned block device name. Ex:
  115. echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | dmsetup create dmz-`basename ${dev}`