diff --git a/doc/ftl.md b/doc/ftl.md index 2ce99d790..b8bd7faeb 100644 --- a/doc/ftl.md +++ b/doc/ftl.md @@ -1,8 +1,9 @@ # Flash Translation Layer {#ftl} -The Flash Translation Layer library provides block device access on top of non-block SSDs -implementing Open Channel interface. It handles the logical to physical address mapping, responds to -the asynchronous media management events, and manages the defragmentation process. +The Flash Translation Layer library provides block device access on top of devices +implementing bdev_zone interface. +It handles the logical to physical address mapping, responds to the asynchronous +media management events, and manages the defragmentation process. # Terminology {#ftl_terminology} @@ -10,32 +11,32 @@ the asynchronous media management events, and manages the defragmentation proces * Shorthand: L2P -Contains the mapping of the logical addresses (LBA) to their on-disk physical location (PPA). The -LBAs are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks +Contains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs +are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks are calculated during device formation and are subtracted from the available address space). The -spare blocks account for chunks going offline throughout the lifespan of the device as well as +spare blocks account for zones going offline throughout the lifespan of the device as well as provide necessary buffer for data [defragmentation](#ftl_reloc). ## Band {#ftl_band} -Band describes a collection of chunks, each belonging to a different parallel unit. All writes to -the band follow the same pattern - a batch of logical blocks is written to one chunk, another batch +A band describes a collection of zones, each belonging to a different parallel unit. All writes to +a band follow the same pattern - a batch of logical blocks is written to one zone, another batch to the next one and so on. This ensures the parallelism of the write operations, as they can be -executed independently on a different chunks. Each band keeps track of the LBAs it consists of, as +executed independently on different zones. Each band keeps track of the LBAs it consists of, as well as their validity, as some of the data will be invalidated by subsequent writes to the same logical address. The L2P mapping can be restored from the SSD by reading this information in order from the oldest band to the youngest. +--------------+ +--------------+ +--------------+ - band 1 | chunk 1 +--------+ chk 1 +---- --- --- --- --- ---+ chk 1 | + band 1 | zone 1 +--------+ zone 1 +---- --- --- --- --- ---+ zone 1 | +--------------+ +--------------+ +--------------+ - band 2 | chunk 2 +--------+ chk 2 +---- --- --- --- --- ---+ chk 2 | + band 2 | zone 2 +--------+ zone 2 +---- --- --- --- --- ---+ zone 2 | +--------------+ +--------------+ +--------------+ - band 3 | chunk 3 +--------+ chk 3 +---- --- --- --- --- ---+ chk 3 | + band 3 | zone 3 +--------+ zone 3 +---- --- --- --- --- ---+ zone 3 | +--------------+ +--------------+ +--------------+ | ... | | ... | | ... | +--------------+ +--------------+ +--------------+ - band m | chunk m +--------+ chk m +---- --- --- --- --- ---+ chk m | + band m | zone m +--------+ zone m +---- --- --- --- --- ---+ zone m | +--------------+ +--------------+ +--------------+ | ... | | ... | | ... | +--------------+ +--------------+ +--------------+ @@ -51,15 +52,15 @@ metadata is split in two parts: head metadata band's data tail metadata - +-------------------+-------------------------------+----------------------+ - |chk 1|...|chk n|...|...|chk 1|...| | ... |chk m-1 |chk m| - |lbk 1| |lbk 1| | |lbk x| | | |lblk y |lblk y| - +-------------------+-------------+-----------------+----------------------+ + +-------------------+-------------------------------+------------------------+ + |zone 1 |...|zone n |...|...|zone 1 |...| | ... |zone m-1 |zone m| + |block 1| |block 1| | |block x| | | |block y |block y| + +-------------------+-------------+-----------------+------------------------+ -Bands are being written sequentially (in a way that was described earlier). Before a band can be -written to, all of its chunks need to be erased. During that time, the band is considered to be in a -`PREP` state. After that is done, the band transitions to the `OPENING` state, in which head metadata +Bands are written sequentially (in a way that was described earlier). Before a band can be written +to, all of its zones need to be erased. During that time, the band is considered to be in a `PREP` +state. After that is done, the band transitions to the `OPENING` state, in which head metadata is being written. Then the band moves to the `OPEN` state and actual user data can be written to the band. Once the whole available space is filled, tail metadata is written and the band transitions to `CLOSING` state. When that finishes the band becomes `CLOSED`. @@ -103,7 +104,7 @@ servicing read requests from the buffer. Since a write to the same LBA invalidates its previous physical location, some of the blocks on a band might contain old data that basically wastes space. As there is no way to overwrite an already -written block, this data will stay there until the whole chunk is reset. This might create a +written block, this data will stay there until the whole zone is reset. This might create a situation in which all of the bands contain some valid data and no band can be erased, so no writes can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole bands, so that they can be reused. @@ -123,13 +124,13 @@ long time ago) or due to read disturb (media characteristic, that causes corrupt blocks during a read operation). Module responsible for data relocation is called `reloc`. When a band is chosen for defragmentation -or an ANM (asynchronous NAND management) event is received, the appropriate blocks are marked as +or a media management event is received, the appropriate blocks are marked as required to be moved. The `reloc` module takes a band that has some of such blocks marked, checks their validity and, if they're still valid, copies them. Choosing a band for defragmentation depends on several factors: its valid ratio (1) (proportion of valid blocks to all user blocks), its age (2) (when was it written) and its write count / wear level -index of its chunks (3) (how many times the band was written to). The lower the ratio (1), the +index of its zones (3) (how many times the band was written to). The lower the ratio (1), the higher its age (2) and the lower its write count (3), the higher the chance the band will be chosen for defrag. @@ -137,9 +138,24 @@ for defrag. ## Prerequisites {#ftl_prereq} -In order to use the FTL module, an Open Channel SSD is required. The easiest way to obtain one is to -emulate it using QEMU. The QEMU with the patches providing Open Channel support can be found on the -SPDK's QEMU fork on [spdk-3.0.0](https://github.com/spdk/qemu/tree/spdk-3.0.0) branch. +In order to use the FTL module, a device capable of zoned interface is required e.g. `zone_block` +bdev or OCSSD `nvme` bdev. + +## FTL bdev creation {#ftl_create} + +Similar to other bdevs, the FTL bdevs can be created either based on JSON config files or via RPC. +Both interfaces require the same arguments which are described by the `--help` option of the +`bdev_ftl_create` RPC call, which are: + - bdev's name + - base bdev's name (base bdev must implement bdev_zone API) + - UUID of the FTL device (if the FTL is to be restored from the SSD) + +## FTL usage with OCSSD nvme bdev {#ftl_ocssd} + +This option requires an Open Channel SSD, which can be emulated using QEMU. + +The QEMU with the patches providing Open Channel support can be found on the SPDK's QEMU fork +on [spdk-3.0.0](https://github.com/spdk/qemu/tree/spdk-3.0.0) branch. ## Configuring QEMU {#ftl_qemu_config} @@ -223,39 +239,48 @@ Logical blks per chunk: 24576 ``` -Similarly to other bdevs, the FTL bdevs can be created either based on config files or via RPC. Both -interfaces require the same arguments which are described by the `--help` option of the -`bdev_ftl_create` RPC call, which are: - - bdev's name - - transport type of the device (e.g. PCIe) - - transport address of the device (e.g. `00:0a.0`) - - parallel unit range - - UUID of the FTL device (if the FTL is to be restored from the SSD) - -Example config: +In order to create FTL on top Open Channel SSD, the following steps are required: +1) Attach OCSSD NVMe controller +2) Create OCSSD bdev on the controller attached in step 1 (user could specify parallel unit range +and create multiple OCSSD bdevs on single OCSSD NVMe controller) +3) Create FTL bdev on top of bdev created in step 2 +Example: ``` -[Ftl] - TransportID "trtype:PCIe traddr:00:0a.0" nvme0 "0-3" 00000000-0000-0000-0000-000000000000 - TransportID "trtype:PCIe traddr:00:0a.0" nvme1 "4-5" e9825835-b03c-49d7-bc3e-5827cbde8a88 -``` +$ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:0a.0 -t pcie -The above will result in creation of two devices: - - `nvme0` on `00:0a.0` using parallel units 0-3, created from scratch - - `nvme1` on the same device using parallel units 4-5, restored from the SSD using the UUID - provided +$ scripts/rpc.py bdev_ocssd_create -c nvme0 -b nvme0n1 + nvme0n1 -The same can be achieved with the following two RPC calls: - -``` -$ scripts/rpc.py bdev_ftl_create -b nvme0 -l 0-3 -a 00:0a.0 +$ scripts/rpc.py bdev_ftl_create -b ftl0 -d nvme0n1 { - "name": "nvme0", - "uuid": "b4624a89-3174-476a-b9e5-5fd27d73e870" -} -$ scripts/rpc.py bdev_ftl_create -b nvme1 -l 0-3 -a 00:0a.0 -u e9825835-b03c-49d7-bc3e-5827cbde8a88 -{ - "name": "nvme1", - "uuid": "e9825835-b03c-49d7-bc3e-5827cbde8a88" + "name": "ftl0", + "uuid": "3b469565-1fa5-4bfb-8341-747ec9fca9b9" +} +``` + +## FTL usage with zone block bdev {#ftl_zone_block} + +Zone block bdev is a bdev adapter between regular `bdev` and `bdev_zone`. It emulates a zoned +interface on top of a regular block device. + +In order to create FTL on to of a regular bdev: +1) Create regular bdev e.g. `bdev_nvme`, `bdev_null`, `bdev_malloc` +2) Create zone block bdev on top of a regular bdev created in step 1 (user could specify zone capacity +and optimal number of open zones) +3) Create FTL bdev on top of bdev created in step 2 + +Example: +``` +$ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:05.0 -t pcie + nvme0n1 + +$ scripts/rpc.py bdev_zone_block_create -b zone1 -n nvme0n1 -z 4096 -o 32 + zone1 + +$ scripts/rpc.py bdev_ftl_create -b ftl0 -d zone1 +{ + "name": "ftl0", + "uuid": "3b469565-1fa5-4bfb-8341-747ec9f3a9b9" } ```