FTL: Update documentation

Signed-off-by: Kozlowski Mateusz <mateusz.kozlowski@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Change-Id: I78141ecf086d8ace07f9d8194c8eb9f64201a939
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/13393
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Reviewed-by: Ben Walker <benjamin.walker@intel.com>
Reviewed-by: Jim Harris <james.r.harris@intel.com>
This commit is contained in:
Kozlowski Mateusz 2022-06-15 13:16:15 +02:00 committed by Jim Harris
parent 63b2fecb3f
commit 7f5a982f3c

View File

@ -1,21 +1,26 @@
# Flash Translation Layer {#ftl} # Flash Translation Layer {#ftl}
The Flash Translation Layer library provides block device access on top of devices The Flash Translation Layer library provides efficient 4K block device access on top of devices
implementing bdev_zone interface. with >4K write unit size (eg. raid5f bdev) or devices with large indirection units (some
It handles the logical to physical address mapping, responds to the asynchronous capacity-focused NAND drives), which don't handle 4K writes well. It handles the logical to
media management events, and manages the defragmentation process. physical address mapping and manages the garbage collection process.
## Terminology {#ftl_terminology} ## Terminology {#ftl_terminology}
### Logical to physical address map ### Logical to physical address map
- Shorthand: L2P - Shorthand: `L2P`
Contains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs Contains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs
are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks
are calculated during device formation and are subtracted from the available address space). The are calculated during device formation and are subtracted from the available address space). The
spare blocks account for zones going offline throughout the lifespan of the device as well as spare blocks account for zones going offline throughout the lifespan of the device as well as
provide necessary buffer for data [defragmentation](#ftl_reloc). provide necessary buffer for data [garbage collection](#ftl_reloc).
Since the L2P would occupy a significant amount of DRAM (4B/LBA for drives smaller than 16TiB,
8B/LBA for bigger drives), FTL will, by default, store only the 2GiB of most recently used L2P
addresses in memory (the amount is configurable), and page them in and out of the cache device
as necessary.
### Band {#ftl_band} ### Band {#ftl_band}
@ -45,71 +50,55 @@ from the oldest band to the youngest.
parallel unit 1 pu 2 pu n parallel unit 1 pu 2 pu n
``` ```
The address map and valid map are, along with a several other things (e.g. UUID of the device it's The address map (`P2L`) is saved as a part of the band's metadata, at the end of each band:
part of, number of surfaced LBAs, band's sequence number, etc.), parts of the band's metadata. The
metadata is split in two parts:
```text ```text
head metadata band's data tail metadata band's data tail metadata
+-------------------+-------------------------------+------------------------+ +-------------------+-------------------------------+------------------------+
|zone 1 |...|zone n |...|...|zone 1 |...| | ... |zone m-1 |zone m| |zone 1 |...|zone n |...|...|zone 1 |...| | ... |zone m-1 |zone m|
|block 1| |block 1| | |block x| | | |block y |block y| |block 1| |block 1| | |block x| | | |block y |block y|
+-------------------+-------------+-----------------+------------------------+ +-------------------+-------------+-----------------+------------------------+
``` ```
- the head part, containing information already known when opening the band (device's UUID, band's
sequence number, etc.), located at the beginning blocks of the band,
- the tail part, containing the address map and the valid map, located at the end of the band.
Bands are written sequentially (in a way that was described earlier). Before a band can be written Bands are written sequentially (in a way that was described earlier). Before a band can be written
to, all of its zones need to be erased. During that time, the band is considered to be in a `PREP` to, all of its zones need to be erased. During that time, the band is considered to be in a `PREP`
state. After that is done, the band transitions to the `OPENING` state, in which head metadata state. Then the band moves to the `OPEN` state and actual user data can be written to the
is being written. Then the band moves to the `OPEN` state and actual user data can be written to the
band. Once the whole available space is filled, tail metadata is written and the band transitions to band. Once the whole available space is filled, tail metadata is written and the band transitions to
`CLOSING` state. When that finishes the band becomes `CLOSED`. `CLOSING` state. When that finishes the band becomes `CLOSED`.
### Ring write buffer {#ftl_rwb} ### Non volatile cache {#ftl_nvcache}
- Shorthand: RWB - Shorthand: `nvcache`
Because the smallest write size the SSD may support can be a multiple of block size, in order to
support writes to a single block, the data needs to be buffered. The write buffer is the solution to
this problem. It consists of a number of pre-allocated buffers called batches, each of size allowing
for a single transfer to the SSD. A single batch is divided into block-sized buffer entries.
Nvcache is a bdev that is used for buffering user writes and storing various metadata.
Nvcache data space is divided into chunks. Chunks are written in sequential manner.
When number of free chunks is below assigned threshold data from fully written chunks
is moved to base_bdev. This process is called chunk compaction.
```text ```text
write buffer nvcache
+-----------------------------------+ +-----------------------------------------+
|batch 1 | |chunk 1 |
| +-----------------------------+ | | +--------------------------------- + |
| |rwb |rwb | ... |rwb | | | |blk 1 + md| blk 2 + md| blk n + md| |
| |entry 1|entry 2| |entry n| | | +----------------------------------| |
| +-----------------------------+ | +-----------------------------------------+
+-----------------------------------+
| ... | | ... |
+-----------------------------------+ +-----------------------------------------+
|batch m | +-----------------------------------------+
| +-----------------------------+ | |chunk N |
| |rwb |rwb | ... |rwb | | | +--------------------------------- + |
| |entry 1|entry 2| |entry n| | | |blk 1 + md| blk 2 + md| blk n + md| |
| +-----------------------------+ | | +----------------------------------| |
+-----------------------------------+ +-----------------------------------------+
``` ```
When a write is scheduled, it needs to acquire an entry for each of its blocks and copy the data ### Garbage collection and relocation {#ftl_reloc}
onto this buffer. Once all blocks are copied, the write can be signalled as completed to the user.
In the meantime, the `rwb` is polled for filled batches and, if one is found, it's sent to the SSD.
After that operation is completed the whole batch can be freed. For the whole time the data is in
the `rwb`, the L2P points at the buffer entry instead of a location on the SSD. This allows for
servicing read requests from the buffer.
### Defragmentation and relocation {#ftl_reloc} - Shorthand: gc, reloc
- Shorthand: defrag, reloc
Since a write to the same LBA invalidates its previous physical location, some of the blocks on a Since a write to the same LBA invalidates its previous physical location, some of the blocks on a
band might contain old data that basically wastes space. As there is no way to overwrite an already band might contain old data that basically wastes space. As there is no way to overwrite an already
written block, this data will stay there until the whole zone is reset. This might create a written block for a ZNS drive, this data will stay there until the whole zone is reset. This might create a
situation in which all of the bands contain some valid data and no band can be erased, so no writes situation in which all of the bands contain some valid data and no band can be erased, so no writes
can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
bands, so that they can be reused. bands, so that they can be reused.
@ -125,28 +114,62 @@ bands, so that they can be reused.
Valid blocks are marked with an asterisk '\*'. Valid blocks are marked with an asterisk '\*'.
Another reason for data relocation might be an event from the SSD telling us that the data might Module responsible for data relocation is called `reloc`. When a band is chosen for garbage collection,
become corrupt if it's not relocated. This might happen due to its old age (if it was written a the appropriate blocks are marked as required to be moved. The `reloc` module takes a band that has
long time ago) or due to read disturb (media characteristic, that causes corruption of neighbouring some of such blocks marked, checks their validity and, if they're still valid, copies them.
blocks during a read operation).
Module responsible for data relocation is called `reloc`. When a band is chosen for defragmentation Choosing a band for garbage collection depends its validity ratio (proportion of valid blocks to all
or a media management event is received, the appropriate blocks are marked as user blocks). The lower the ratio, the higher the chance the band will be chosen for gc.
required to be moved. The `reloc` module takes a band that has some of such blocks marked, checks
their validity and, if they're still valid, copies them.
Choosing a band for defragmentation depends on several factors: its valid ratio (1) (proportion of ## Metadata {#ftl_metadata}
valid blocks to all user blocks), its age (2) (when was it written) and its write count / wear level
index of its zones (3) (how many times the band was written to). The lower the ratio (1), the In addition to the [L2P](#ftl_l2p), FTL will store additional metadata both on the cache, as
higher its age (2) and the lower its write count (3), the higher the chance the band will be chosen well as on the base devices. The following types of metadata are persisted:
for defrag.
- Superblock - stores the global state of FTL; stored on cache, mirrored to the base device
- L2P - see the [L2P](#ftl_l2p) section for details
- Band - stores the state of bands - write pointers, their OPEN/FREE/CLOSE state; stored on cache, mirrored to a different section of the cache device
- Valid map - bitmask of all the valid physical addresses, used for improving [relocation](#ftl_reloc)
- Chunk - stores the state of chunks - write pointers, their OPEN/FREE/CLOSE state; stored on cache, mirrored to a different section of the cache device
- P2L - stores the address mapping (P2L, see [band](#ftl_band)) of currently open bands. This allows for the recovery of open
bands after dirty shutdown without needing VSS DIX metadata on the base device; stored on the cache device
- Trim - stores information about unmapped (trimmed) LBAs; stored on cache, mirrored to a different section of the cache device
## Dirty shutdown recovery {#ftl_dirty_shutdown}
After power failure, FTL needs to rebuild the whole L2P using the address maps (`P2L`) stored within each band/chunk.
This needs to done, because while individual L2P pages may have been paged out and persisted to the cache device,
there's no way to tell which, if any, pages were dirty before the power failure occured. The P2L consists of not only
the mapping itself, but also a sequence id (`seq_id`), which describes the relative age of a given logical block
(multiple writes to the same logical block would produce the same amount of P2L entries, only the last one having the current data).
FTL will therefore rebuild the whole L2P by reading the P2L of all closed bands and chunks. For open bands, the P2L is stored on
the cache device, in a separate metadata region (see [the P2L section](#ftl_metadata)). Open chunks can be restored thanks to storing
the mapping in the VSS DIX metadata, which the cache device must be formatted with.
### Shared memory recovery {#ftl_shm_recovery}
In order to shorten the recovery after crash of the target application, FTL also stores its metadata in shared memory (`shm`) - this
allows it to keep track of the dirty-ness state of individual pages and shortens the recovery time dramatically, as FTL will only
need to mark any potential L2P pages which were paging out at the time of the crash as dirty and reissue the writes. There's no need
to read the whole P2L in this case.
### Trim {#ftl_trim}
Due to metadata size constraints and the difficulty of maintaining consistent data returned before and after dirty shutdown, FTL
currently only allows for trims (unmaps) aligned to 4MiB (alignment concerns both the offset and length of the trim command).
## Usage {#ftl_usage} ## Usage {#ftl_usage}
### Prerequisites {#ftl_prereq} ### Prerequisites {#ftl_prereq}
In order to use the FTL module, a device capable of zoned interface is required e.g. `zone_block` In order to use the FTL module, a cache device formatted with VSS DIX metadata is required.
bdev or OCSSD `nvme` bdev.
### FTL bdev creation {#ftl_create} ### FTL bdev creation {#ftl_create}
@ -155,51 +178,27 @@ Both interfaces require the same arguments which are described by the `--help` o
`bdev_ftl_create` RPC call, which are: `bdev_ftl_create` RPC call, which are:
- bdev's name - bdev's name
- base bdev's name (base bdev must implement bdev_zone API) - base bdev's name
- cache bdev's name (cache bdev must support VSS DIX mode - could be emulated by providing SPDK_FTL_VSS_EMU=1 flag to make;
emulating VSS should be done for testing purposes only, it is not power-fail safe)
- UUID of the FTL device (if the FTL is to be restored from the SSD) - UUID of the FTL device (if the FTL is to be restored from the SSD)
## Configuring SPDK {#ftl_spdk_config} ## FTL bdev stack {#ftl_bdev_stack}
To verify that the drive is emulated correctly, one can check the output of the NVMe identify app
(assuming that `scripts/setup.sh` was called before and the driver has been changed for that
device):
```bash
$ build/examples/identify
=====================================================
NVMe Controller at 0000:00:0a.0 [1d1d:1f1f]
=====================================================
Controller Capabilities/Features
================================
Vendor ID: 1d1d
Subsystem Vendor ID: 1af4
Serial Number: deadbeef
Model Number: QEMU NVMe Ctrl
... other info ...
```
## FTL usage with zone block bdev {#ftl_zone_block}
Zone block bdev is a bdev adapter between regular `bdev` and `bdev_zone`. It emulates a zoned
interface on top of a regular block device.
In order to create FTL on top of a regular bdev: In order to create FTL on top of a regular bdev:
1) Create regular bdev e.g. `bdev_nvme`, `bdev_null`, `bdev_malloc` 1) Create regular bdev e.g. `bdev_nvme`, `bdev_null`, `bdev_malloc`
2) Create zone block bdev on top of a regular bdev created in step 1 (user could specify zone capacity 2) Create second regular bdev for nvcache
and optimal number of open zones) 3) Create FTL bdev on top of bdev created in step 1 and step 2
3) Create FTL bdev on top of bdev created in step 2
Example: Example:
``` ```
$ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:05.0 -t pcie $ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:05.0 -t pcie
nvme0n1 nvme0n1
$ scripts/rpc.py bdev_zone_block_create -b zone1 -n nvme0n1 -z 4096 -o 32 $ scripts/rpc.py bdev_nvme_attach_controller -b nvme1 -a 00:06.0 -t pcie
zone1 nvme1n1
$ scripts/rpc.py bdev_ftl_create -b ftl0 -d zone1 $ scripts/rpc.py bdev_ftl_create -b ftl0 -d nvme0n1 -c nvme1n1
{ {
"name": "ftl0", "name": "ftl0",
"uuid": "3b469565-1fa5-4bfb-8341-747ec9f3a9b9" "uuid": "3b469565-1fa5-4bfb-8341-747ec9f3a9b9"