FTL: Update documentation
Signed-off-by: Kozlowski Mateusz <mateusz.kozlowski@intel.com> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Change-Id: I78141ecf086d8ace07f9d8194c8eb9f64201a939 Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/13393 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com>
This commit is contained in:
parent
63b2fecb3f
commit
7f5a982f3c
191
doc/ftl.md
191
doc/ftl.md
@ -1,21 +1,26 @@
|
|||||||
# Flash Translation Layer {#ftl}
|
# Flash Translation Layer {#ftl}
|
||||||
|
|
||||||
The Flash Translation Layer library provides block device access on top of devices
|
The Flash Translation Layer library provides efficient 4K block device access on top of devices
|
||||||
implementing bdev_zone interface.
|
with >4K write unit size (eg. raid5f bdev) or devices with large indirection units (some
|
||||||
It handles the logical to physical address mapping, responds to the asynchronous
|
capacity-focused NAND drives), which don't handle 4K writes well. It handles the logical to
|
||||||
media management events, and manages the defragmentation process.
|
physical address mapping and manages the garbage collection process.
|
||||||
|
|
||||||
## Terminology {#ftl_terminology}
|
## Terminology {#ftl_terminology}
|
||||||
|
|
||||||
### Logical to physical address map
|
### Logical to physical address map
|
||||||
|
|
||||||
- Shorthand: L2P
|
- Shorthand: `L2P`
|
||||||
|
|
||||||
Contains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs
|
Contains the mapping of the logical addresses (LBA) to their on-disk physical location. The LBAs
|
||||||
are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks
|
are contiguous and in range from 0 to the number of surfaced blocks (the number of spare blocks
|
||||||
are calculated during device formation and are subtracted from the available address space). The
|
are calculated during device formation and are subtracted from the available address space). The
|
||||||
spare blocks account for zones going offline throughout the lifespan of the device as well as
|
spare blocks account for zones going offline throughout the lifespan of the device as well as
|
||||||
provide necessary buffer for data [defragmentation](#ftl_reloc).
|
provide necessary buffer for data [garbage collection](#ftl_reloc).
|
||||||
|
|
||||||
|
Since the L2P would occupy a significant amount of DRAM (4B/LBA for drives smaller than 16TiB,
|
||||||
|
8B/LBA for bigger drives), FTL will, by default, store only the 2GiB of most recently used L2P
|
||||||
|
addresses in memory (the amount is configurable), and page them in and out of the cache device
|
||||||
|
as necessary.
|
||||||
|
|
||||||
### Band {#ftl_band}
|
### Band {#ftl_band}
|
||||||
|
|
||||||
@ -45,71 +50,55 @@ from the oldest band to the youngest.
|
|||||||
parallel unit 1 pu 2 pu n
|
parallel unit 1 pu 2 pu n
|
||||||
```
|
```
|
||||||
|
|
||||||
The address map and valid map are, along with a several other things (e.g. UUID of the device it's
|
The address map (`P2L`) is saved as a part of the band's metadata, at the end of each band:
|
||||||
part of, number of surfaced LBAs, band's sequence number, etc.), parts of the band's metadata. The
|
|
||||||
metadata is split in two parts:
|
|
||||||
|
|
||||||
```text
|
```text
|
||||||
head metadata band's data tail metadata
|
band's data tail metadata
|
||||||
+-------------------+-------------------------------+------------------------+
|
+-------------------+-------------------------------+------------------------+
|
||||||
|zone 1 |...|zone n |...|...|zone 1 |...| | ... |zone m-1 |zone m|
|
|zone 1 |...|zone n |...|...|zone 1 |...| | ... |zone m-1 |zone m|
|
||||||
|block 1| |block 1| | |block x| | | |block y |block y|
|
|block 1| |block 1| | |block x| | | |block y |block y|
|
||||||
+-------------------+-------------+-----------------+------------------------+
|
+-------------------+-------------+-----------------+------------------------+
|
||||||
```
|
```
|
||||||
|
|
||||||
- the head part, containing information already known when opening the band (device's UUID, band's
|
|
||||||
sequence number, etc.), located at the beginning blocks of the band,
|
|
||||||
- the tail part, containing the address map and the valid map, located at the end of the band.
|
|
||||||
|
|
||||||
Bands are written sequentially (in a way that was described earlier). Before a band can be written
|
Bands are written sequentially (in a way that was described earlier). Before a band can be written
|
||||||
to, all of its zones need to be erased. During that time, the band is considered to be in a `PREP`
|
to, all of its zones need to be erased. During that time, the band is considered to be in a `PREP`
|
||||||
state. After that is done, the band transitions to the `OPENING` state, in which head metadata
|
state. Then the band moves to the `OPEN` state and actual user data can be written to the
|
||||||
is being written. Then the band moves to the `OPEN` state and actual user data can be written to the
|
|
||||||
band. Once the whole available space is filled, tail metadata is written and the band transitions to
|
band. Once the whole available space is filled, tail metadata is written and the band transitions to
|
||||||
`CLOSING` state. When that finishes the band becomes `CLOSED`.
|
`CLOSING` state. When that finishes the band becomes `CLOSED`.
|
||||||
|
|
||||||
### Ring write buffer {#ftl_rwb}
|
### Non volatile cache {#ftl_nvcache}
|
||||||
|
|
||||||
- Shorthand: RWB
|
- Shorthand: `nvcache`
|
||||||
|
|
||||||
Because the smallest write size the SSD may support can be a multiple of block size, in order to
|
|
||||||
support writes to a single block, the data needs to be buffered. The write buffer is the solution to
|
|
||||||
this problem. It consists of a number of pre-allocated buffers called batches, each of size allowing
|
|
||||||
for a single transfer to the SSD. A single batch is divided into block-sized buffer entries.
|
|
||||||
|
|
||||||
|
Nvcache is a bdev that is used for buffering user writes and storing various metadata.
|
||||||
|
Nvcache data space is divided into chunks. Chunks are written in sequential manner.
|
||||||
|
When number of free chunks is below assigned threshold data from fully written chunks
|
||||||
|
is moved to base_bdev. This process is called chunk compaction.
|
||||||
```text
|
```text
|
||||||
write buffer
|
nvcache
|
||||||
+-----------------------------------+
|
+-----------------------------------------+
|
||||||
|batch 1 |
|
|chunk 1 |
|
||||||
| +-----------------------------+ |
|
| +--------------------------------- + |
|
||||||
| |rwb |rwb | ... |rwb | |
|
| |blk 1 + md| blk 2 + md| blk n + md| |
|
||||||
| |entry 1|entry 2| |entry n| |
|
| +----------------------------------| |
|
||||||
| +-----------------------------+ |
|
+-----------------------------------------+
|
||||||
+-----------------------------------+
|
|
||||||
| ... |
|
| ... |
|
||||||
+-----------------------------------+
|
+-----------------------------------------+
|
||||||
|batch m |
|
+-----------------------------------------+
|
||||||
| +-----------------------------+ |
|
|chunk N |
|
||||||
| |rwb |rwb | ... |rwb | |
|
| +--------------------------------- + |
|
||||||
| |entry 1|entry 2| |entry n| |
|
| |blk 1 + md| blk 2 + md| blk n + md| |
|
||||||
| +-----------------------------+ |
|
| +----------------------------------| |
|
||||||
+-----------------------------------+
|
+-----------------------------------------+
|
||||||
```
|
```
|
||||||
|
|
||||||
When a write is scheduled, it needs to acquire an entry for each of its blocks and copy the data
|
### Garbage collection and relocation {#ftl_reloc}
|
||||||
onto this buffer. Once all blocks are copied, the write can be signalled as completed to the user.
|
|
||||||
In the meantime, the `rwb` is polled for filled batches and, if one is found, it's sent to the SSD.
|
|
||||||
After that operation is completed the whole batch can be freed. For the whole time the data is in
|
|
||||||
the `rwb`, the L2P points at the buffer entry instead of a location on the SSD. This allows for
|
|
||||||
servicing read requests from the buffer.
|
|
||||||
|
|
||||||
### Defragmentation and relocation {#ftl_reloc}
|
- Shorthand: gc, reloc
|
||||||
|
|
||||||
- Shorthand: defrag, reloc
|
|
||||||
|
|
||||||
Since a write to the same LBA invalidates its previous physical location, some of the blocks on a
|
Since a write to the same LBA invalidates its previous physical location, some of the blocks on a
|
||||||
band might contain old data that basically wastes space. As there is no way to overwrite an already
|
band might contain old data that basically wastes space. As there is no way to overwrite an already
|
||||||
written block, this data will stay there until the whole zone is reset. This might create a
|
written block for a ZNS drive, this data will stay there until the whole zone is reset. This might create a
|
||||||
situation in which all of the bands contain some valid data and no band can be erased, so no writes
|
situation in which all of the bands contain some valid data and no band can be erased, so no writes
|
||||||
can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
|
can be executed anymore. Therefore a mechanism is needed to move valid data and invalidate whole
|
||||||
bands, so that they can be reused.
|
bands, so that they can be reused.
|
||||||
@ -125,28 +114,62 @@ bands, so that they can be reused.
|
|||||||
|
|
||||||
Valid blocks are marked with an asterisk '\*'.
|
Valid blocks are marked with an asterisk '\*'.
|
||||||
|
|
||||||
Another reason for data relocation might be an event from the SSD telling us that the data might
|
Module responsible for data relocation is called `reloc`. When a band is chosen for garbage collection,
|
||||||
become corrupt if it's not relocated. This might happen due to its old age (if it was written a
|
the appropriate blocks are marked as required to be moved. The `reloc` module takes a band that has
|
||||||
long time ago) or due to read disturb (media characteristic, that causes corruption of neighbouring
|
some of such blocks marked, checks their validity and, if they're still valid, copies them.
|
||||||
blocks during a read operation).
|
|
||||||
|
|
||||||
Module responsible for data relocation is called `reloc`. When a band is chosen for defragmentation
|
Choosing a band for garbage collection depends its validity ratio (proportion of valid blocks to all
|
||||||
or a media management event is received, the appropriate blocks are marked as
|
user blocks). The lower the ratio, the higher the chance the band will be chosen for gc.
|
||||||
required to be moved. The `reloc` module takes a band that has some of such blocks marked, checks
|
|
||||||
their validity and, if they're still valid, copies them.
|
|
||||||
|
|
||||||
Choosing a band for defragmentation depends on several factors: its valid ratio (1) (proportion of
|
## Metadata {#ftl_metadata}
|
||||||
valid blocks to all user blocks), its age (2) (when was it written) and its write count / wear level
|
|
||||||
index of its zones (3) (how many times the band was written to). The lower the ratio (1), the
|
In addition to the [L2P](#ftl_l2p), FTL will store additional metadata both on the cache, as
|
||||||
higher its age (2) and the lower its write count (3), the higher the chance the band will be chosen
|
well as on the base devices. The following types of metadata are persisted:
|
||||||
for defrag.
|
|
||||||
|
- Superblock - stores the global state of FTL; stored on cache, mirrored to the base device
|
||||||
|
|
||||||
|
- L2P - see the [L2P](#ftl_l2p) section for details
|
||||||
|
|
||||||
|
- Band - stores the state of bands - write pointers, their OPEN/FREE/CLOSE state; stored on cache, mirrored to a different section of the cache device
|
||||||
|
|
||||||
|
- Valid map - bitmask of all the valid physical addresses, used for improving [relocation](#ftl_reloc)
|
||||||
|
|
||||||
|
- Chunk - stores the state of chunks - write pointers, their OPEN/FREE/CLOSE state; stored on cache, mirrored to a different section of the cache device
|
||||||
|
|
||||||
|
- P2L - stores the address mapping (P2L, see [band](#ftl_band)) of currently open bands. This allows for the recovery of open
|
||||||
|
bands after dirty shutdown without needing VSS DIX metadata on the base device; stored on the cache device
|
||||||
|
|
||||||
|
- Trim - stores information about unmapped (trimmed) LBAs; stored on cache, mirrored to a different section of the cache device
|
||||||
|
|
||||||
|
## Dirty shutdown recovery {#ftl_dirty_shutdown}
|
||||||
|
|
||||||
|
After power failure, FTL needs to rebuild the whole L2P using the address maps (`P2L`) stored within each band/chunk.
|
||||||
|
This needs to done, because while individual L2P pages may have been paged out and persisted to the cache device,
|
||||||
|
there's no way to tell which, if any, pages were dirty before the power failure occured. The P2L consists of not only
|
||||||
|
the mapping itself, but also a sequence id (`seq_id`), which describes the relative age of a given logical block
|
||||||
|
(multiple writes to the same logical block would produce the same amount of P2L entries, only the last one having the current data).
|
||||||
|
|
||||||
|
FTL will therefore rebuild the whole L2P by reading the P2L of all closed bands and chunks. For open bands, the P2L is stored on
|
||||||
|
the cache device, in a separate metadata region (see [the P2L section](#ftl_metadata)). Open chunks can be restored thanks to storing
|
||||||
|
the mapping in the VSS DIX metadata, which the cache device must be formatted with.
|
||||||
|
|
||||||
|
### Shared memory recovery {#ftl_shm_recovery}
|
||||||
|
|
||||||
|
In order to shorten the recovery after crash of the target application, FTL also stores its metadata in shared memory (`shm`) - this
|
||||||
|
allows it to keep track of the dirty-ness state of individual pages and shortens the recovery time dramatically, as FTL will only
|
||||||
|
need to mark any potential L2P pages which were paging out at the time of the crash as dirty and reissue the writes. There's no need
|
||||||
|
to read the whole P2L in this case.
|
||||||
|
|
||||||
|
### Trim {#ftl_trim}
|
||||||
|
|
||||||
|
Due to metadata size constraints and the difficulty of maintaining consistent data returned before and after dirty shutdown, FTL
|
||||||
|
currently only allows for trims (unmaps) aligned to 4MiB (alignment concerns both the offset and length of the trim command).
|
||||||
|
|
||||||
## Usage {#ftl_usage}
|
## Usage {#ftl_usage}
|
||||||
|
|
||||||
### Prerequisites {#ftl_prereq}
|
### Prerequisites {#ftl_prereq}
|
||||||
|
|
||||||
In order to use the FTL module, a device capable of zoned interface is required e.g. `zone_block`
|
In order to use the FTL module, a cache device formatted with VSS DIX metadata is required.
|
||||||
bdev or OCSSD `nvme` bdev.
|
|
||||||
|
|
||||||
### FTL bdev creation {#ftl_create}
|
### FTL bdev creation {#ftl_create}
|
||||||
|
|
||||||
@ -155,51 +178,27 @@ Both interfaces require the same arguments which are described by the `--help` o
|
|||||||
`bdev_ftl_create` RPC call, which are:
|
`bdev_ftl_create` RPC call, which are:
|
||||||
|
|
||||||
- bdev's name
|
- bdev's name
|
||||||
- base bdev's name (base bdev must implement bdev_zone API)
|
- base bdev's name
|
||||||
|
- cache bdev's name (cache bdev must support VSS DIX mode - could be emulated by providing SPDK_FTL_VSS_EMU=1 flag to make;
|
||||||
|
emulating VSS should be done for testing purposes only, it is not power-fail safe)
|
||||||
- UUID of the FTL device (if the FTL is to be restored from the SSD)
|
- UUID of the FTL device (if the FTL is to be restored from the SSD)
|
||||||
|
|
||||||
## Configuring SPDK {#ftl_spdk_config}
|
## FTL bdev stack {#ftl_bdev_stack}
|
||||||
|
|
||||||
To verify that the drive is emulated correctly, one can check the output of the NVMe identify app
|
|
||||||
(assuming that `scripts/setup.sh` was called before and the driver has been changed for that
|
|
||||||
device):
|
|
||||||
|
|
||||||
```bash
|
|
||||||
$ build/examples/identify
|
|
||||||
=====================================================
|
|
||||||
NVMe Controller at 0000:00:0a.0 [1d1d:1f1f]
|
|
||||||
=====================================================
|
|
||||||
Controller Capabilities/Features
|
|
||||||
================================
|
|
||||||
Vendor ID: 1d1d
|
|
||||||
Subsystem Vendor ID: 1af4
|
|
||||||
Serial Number: deadbeef
|
|
||||||
Model Number: QEMU NVMe Ctrl
|
|
||||||
|
|
||||||
... other info ...
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
## FTL usage with zone block bdev {#ftl_zone_block}
|
|
||||||
|
|
||||||
Zone block bdev is a bdev adapter between regular `bdev` and `bdev_zone`. It emulates a zoned
|
|
||||||
interface on top of a regular block device.
|
|
||||||
|
|
||||||
In order to create FTL on top of a regular bdev:
|
In order to create FTL on top of a regular bdev:
|
||||||
1) Create regular bdev e.g. `bdev_nvme`, `bdev_null`, `bdev_malloc`
|
1) Create regular bdev e.g. `bdev_nvme`, `bdev_null`, `bdev_malloc`
|
||||||
2) Create zone block bdev on top of a regular bdev created in step 1 (user could specify zone capacity
|
2) Create second regular bdev for nvcache
|
||||||
and optimal number of open zones)
|
3) Create FTL bdev on top of bdev created in step 1 and step 2
|
||||||
3) Create FTL bdev on top of bdev created in step 2
|
|
||||||
|
|
||||||
Example:
|
Example:
|
||||||
```
|
```
|
||||||
$ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:05.0 -t pcie
|
$ scripts/rpc.py bdev_nvme_attach_controller -b nvme0 -a 00:05.0 -t pcie
|
||||||
nvme0n1
|
nvme0n1
|
||||||
|
|
||||||
$ scripts/rpc.py bdev_zone_block_create -b zone1 -n nvme0n1 -z 4096 -o 32
|
$ scripts/rpc.py bdev_nvme_attach_controller -b nvme1 -a 00:06.0 -t pcie
|
||||||
zone1
|
nvme1n1
|
||||||
|
|
||||||
$ scripts/rpc.py bdev_ftl_create -b ftl0 -d zone1
|
$ scripts/rpc.py bdev_ftl_create -b ftl0 -d nvme0n1 -c nvme1n1
|
||||||
{
|
{
|
||||||
"name": "ftl0",
|
"name": "ftl0",
|
||||||
"uuid": "3b469565-1fa5-4bfb-8341-747ec9f3a9b9"
|
"uuid": "3b469565-1fa5-4bfb-8341-747ec9f3a9b9"
|
||||||
|
Loading…
Reference in New Issue
Block a user