Spdk/examples/nvme/fio_plugin/README.md

203 lines
7.0 KiB
Markdown
Raw Normal View History

# FIO plugin
## Compiling fio
First, clone the fio source repository from https://github.com/axboe/fio
```bash
git clone https://github.com/axboe/fio
```
Then check out the latest fio version and compile the code:
```bash
make
```
## Compiling SPDK
First, clone the SPDK source repository from https://github.com/spdk/spdk
```bash
git clone https://github.com/spdk/spdk
git submodule update --init
```
Then, run the SPDK configure script to enable fio (point it to the root of the fio repository):
```bash
cd spdk
./configure --with-fio=/path/to/fio/repo <other configuration options>
```
Finally, build SPDK:
```bash
make
```
**Note to advanced users**: These steps assume you're using the DPDK submodule. If you are using your
own version of DPDK, the fio plugin requires that DPDK be compiled with -fPIC. You can compile DPDK
with -fPIC by modifying your DPDK configuration file and adding the line:
```bash
EXTRA_CFLAGS=-fPIC
```
## Usage
To use the SPDK fio plugin with fio, specify the plugin binary using LD_PRELOAD when running
fio and set ioengine=spdk in the fio configuration file (see example_config.fio in the same
directory as this README).
```bash
LD_PRELOAD=<path to spdk repo>/build/fio/spdk_nvme fio
```
To select NVMe devices, you pass an SPDK Transport Identifier string as the filename. These are in the
form:
```bash
filename=key=value [key=value] ... ns=value
```
Specifically, for local PCIe NVMe devices it will look like this:
```bash
filename=trtype=PCIe traddr=0000.04.00.0 ns=1
```
And remote devices accessed via NVMe over Fabrics will look like this:
```bash
filename=trtype=RDMA adrfam=IPv4 traddr=192.168.100.8 trsvcid=4420 ns=1
```
**Note**: The specification of the PCIe address should not use the normal ':'
and instead only use '.'. This is a limitation in fio - it splits filenames on
':'. Also, the NVMe namespaces start at 1, not 0, and the namespace must be
specified at the end of the string.
fio by default forks a separate process for every job. It also supports just spawning a separate
thread in the same process for every job. The SPDK fio plugin is limited to this latter thread
usage model, so fio jobs must also specify thread=1 when using the SPDK fio plugin. The SPDK fio
plugin supports multiple threads - in this case, the "1" just means "use thread mode".
fio also currently has a race condition on shutdown if dynamically loading the ioengine by specifying the
engine's full path via the ioengine parameter - LD_PRELOAD is recommended to avoid this race condition.
When testing random workloads, it is recommended to set norandommap=1. fio's random map
processing consumes extra CPU cycles which will degrade performance over time with
the fio_plugin since all I/O are submitted and completed on a single CPU core.
When testing FIO on multiple NVMe SSDs with SPDK plugin, it is recommended to use multiple jobs in FIO configurion.
It has been observed that there are some performance gap between FIO(with SPDK plugin enabled) and SPDK perf
(examples/nvme/perf/perf) on testing multiple NVMe SSDs. If you use one job(i.e., use one CPU core) configured for
FIO test, the performance is worse than SPDK perf (also using one CPU core) against many NVMe SSDs. But if you use
multiple jobs for FIO test, the performance of FIO is similar with SPDK perf. After analyzing this phenomenon, we
think that is caused by the FIO architecture. Mainly FIO can scale with multiple threads (i.e., using CPU cores),
but it is not good to use one thread against many I/O devices.
## End-to-end Data Protection (Optional)
Running with PI setting, following settings steps are required.
First, format device namespace with proper PI setting. For example:
```bash
nvme format /dev/nvme0n1 -l 1 -i 1 -p 0 -m 1
```
In fio configure file, add PRACT and set PRCHK by flags(GUARD|REFTAG|APPTAG) properly. For example:
```bash
pi_act=0
pi_chk=GUARD
```
Blocksize should be set as the sum of data and metadata. For example, if data blocksize is 512 Byte, host generated
PI metadata is 8 Byte, then blocksize in fio configure file should be 520 Byte:
```bash
bs=520
```
The storage device may use a block format that requires separate metadata (DIX). In this scenario, the fio_plugin
will automatically allocate an extra 4KiB buffer per I/O to hold this metadata. For some cases, such as 512 byte
blocks with 32 metadata bytes per block and a 128KiB I/O size, 4KiB isn't large enough. In this case, the
`md_per_io_size` option may be specified to increase the size of the metadata buffer.
Expose two options 'apptag' and 'apptag_mask', users can change them in the configuration file when using
application tag and application tag mask in end-to-end data protection. Application tag and application
tag mask are set to 0x1234 and 0xFFFF by default.
## VMD (Optional)
To enable VMD enumeration add enable_vmd flag in fio configuration file:
```bash
enable_vmd=1
```
examples/nvme_fio_plugin: add initial support for ZNS This adds initial support for ZNS by aligning the NVMe spec. defined ZNS structures and commands with the fio Zone representation and implementation of the following io-engine functions: get_zoned_model() / spdk_fio_get_zoned_model(), when namespace is ZNS and the Zoned-Command-Set is enabled, then this function informs fio that the device is ZBD_HOST_MANAGED. report_zones() / spdk_fio_report_zones(), submits a single zone-mgmt-recv and waits for its completion, converts the spec-defined zone-descriptors to the fio ZBD_ZONE representation and returns the number of zones in the converted report. reset_wp() / spdk_fio_reset_wp(), submits multiple zone-mgmt-send, covering the range [offset, offset+length] and waits for their completion. Four helper-functions are added to assist in the above implementations. get_fio_qpair(), this helper is added to retrieve the namespace matching the given fio-file, ensuring that management commands reach the correct namespace. spdk_fio_qpair_mdts_nbytes(), this helper is added to assist report_zones() retrieve the zone-report within the bounds of the maximum-data-transfer of the device. The functions pcu() and pcu_cb() provide a means to submit management-commands and waiting for their completions. They are needed since, although mgmt-send/recv are IO-commands in the context of NVMe, then for fio they are not part of the regular queue/event/getevents but utilized in a synchronous/blocking manner. Note, in the fio-zone-representation, then the start/len/capacity/wp fields are in units of bytes, whereas the corresponding values in NVMe are in lbas/sectors. It is worth noting as the offset <-> lba conversions do not take NVMe configurations with extended-lba format into account. Thus, the implementation is initial support for ZNS as more work is needed to support pi/extended-lba configurations. Note, a guard FIO_HAS_ZBD checks for the required io-engine ops version and indirectly testing for available of fio Zone representation by testing for a macro introduced in the same fio-release as the required fio Zone representation. Signed-off-by: Simon A. F. Lund <simon.lund@samsung.com> Change-Id: Id3d1d61a52db2e55019032c724197df4d559271a Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/4836 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
2020-10-22 14:24:13 +00:00
## ZNS
examples/nvme_fio_plugin: add initial support for ZNS This adds initial support for ZNS by aligning the NVMe spec. defined ZNS structures and commands with the fio Zone representation and implementation of the following io-engine functions: get_zoned_model() / spdk_fio_get_zoned_model(), when namespace is ZNS and the Zoned-Command-Set is enabled, then this function informs fio that the device is ZBD_HOST_MANAGED. report_zones() / spdk_fio_report_zones(), submits a single zone-mgmt-recv and waits for its completion, converts the spec-defined zone-descriptors to the fio ZBD_ZONE representation and returns the number of zones in the converted report. reset_wp() / spdk_fio_reset_wp(), submits multiple zone-mgmt-send, covering the range [offset, offset+length] and waits for their completion. Four helper-functions are added to assist in the above implementations. get_fio_qpair(), this helper is added to retrieve the namespace matching the given fio-file, ensuring that management commands reach the correct namespace. spdk_fio_qpair_mdts_nbytes(), this helper is added to assist report_zones() retrieve the zone-report within the bounds of the maximum-data-transfer of the device. The functions pcu() and pcu_cb() provide a means to submit management-commands and waiting for their completions. They are needed since, although mgmt-send/recv are IO-commands in the context of NVMe, then for fio they are not part of the regular queue/event/getevents but utilized in a synchronous/blocking manner. Note, in the fio-zone-representation, then the start/len/capacity/wp fields are in units of bytes, whereas the corresponding values in NVMe are in lbas/sectors. It is worth noting as the offset <-> lba conversions do not take NVMe configurations with extended-lba format into account. Thus, the implementation is initial support for ZNS as more work is needed to support pi/extended-lba configurations. Note, a guard FIO_HAS_ZBD checks for the required io-engine ops version and indirectly testing for available of fio Zone representation by testing for a macro introduced in the same fio-release as the required fio Zone representation. Signed-off-by: Simon A. F. Lund <simon.lund@samsung.com> Change-Id: Id3d1d61a52db2e55019032c724197df4d559271a Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/4836 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
2020-10-22 14:24:13 +00:00
To use Zoned Namespaces then build the io-engine against, and run using, a fio version >= 3.23 and add:
```bash
zonemode=zbd
```
examples/nvme_fio_plugin: add initial support for ZNS This adds initial support for ZNS by aligning the NVMe spec. defined ZNS structures and commands with the fio Zone representation and implementation of the following io-engine functions: get_zoned_model() / spdk_fio_get_zoned_model(), when namespace is ZNS and the Zoned-Command-Set is enabled, then this function informs fio that the device is ZBD_HOST_MANAGED. report_zones() / spdk_fio_report_zones(), submits a single zone-mgmt-recv and waits for its completion, converts the spec-defined zone-descriptors to the fio ZBD_ZONE representation and returns the number of zones in the converted report. reset_wp() / spdk_fio_reset_wp(), submits multiple zone-mgmt-send, covering the range [offset, offset+length] and waits for their completion. Four helper-functions are added to assist in the above implementations. get_fio_qpair(), this helper is added to retrieve the namespace matching the given fio-file, ensuring that management commands reach the correct namespace. spdk_fio_qpair_mdts_nbytes(), this helper is added to assist report_zones() retrieve the zone-report within the bounds of the maximum-data-transfer of the device. The functions pcu() and pcu_cb() provide a means to submit management-commands and waiting for their completions. They are needed since, although mgmt-send/recv are IO-commands in the context of NVMe, then for fio they are not part of the regular queue/event/getevents but utilized in a synchronous/blocking manner. Note, in the fio-zone-representation, then the start/len/capacity/wp fields are in units of bytes, whereas the corresponding values in NVMe are in lbas/sectors. It is worth noting as the offset <-> lba conversions do not take NVMe configurations with extended-lba format into account. Thus, the implementation is initial support for ZNS as more work is needed to support pi/extended-lba configurations. Note, a guard FIO_HAS_ZBD checks for the required io-engine ops version and indirectly testing for available of fio Zone representation by testing for a macro introduced in the same fio-release as the required fio Zone representation. Signed-off-by: Simon A. F. Lund <simon.lund@samsung.com> Change-Id: Id3d1d61a52db2e55019032c724197df4d559271a Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/4836 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
2020-10-22 14:24:13 +00:00
To your fio-script, also have a look at script-examples provided with fio:
```bash
fio/examples/zbd-seq-read.fio
fio/examples/zbd-rand-write.fio
```
### Maximum Open Zones
Zoned Namespaces has a resource constraint on the amount of zones which can be in an opened state at
any point in time. You can control how many zones fio will keep in an open state by using the
``--max_open_zones`` option.
nvme/fio_plugin: implement support for fio .get_max_open_zones callback Implement support for the recently added fio .get_max_open_zones callback. If our ioengine does not implement this callback, fio will always result in an error when using --zonemode=zbd, on platforms which does not have a fio oslib implementation for this callback, e.g. FreeBSD. On Linux, fio will by default try to parse sysfs, which will of course not work on SPDK. Implement this callback so that our ioengine will be able to provide fio with the proper max open zones limit. This will ensure that fio will be able to fetch the proper max open zones limit, regardless of OS. While our SPDK nvme ioengine did overwrite the max_open_zones option if it was set to zero, this is a bit of a hack. The new fio callback is the proper way to inform fio about the max open zones limit, so that fio itself can have access to the actual device limit. (Just overwriting the requested max_open_zones option will not allow fio to know if the requested max_open_zones option exceeds the device limit.) Remove the SPDK specific hack and update our README.md accordingly. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Change-Id: I532a0fa065b9e215ee6229b9100135e5403f198e Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/7898 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Community-CI: Mellanox Build Bot Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com> Reviewed-by: Aleksey Marchuk <alexeymar@mellanox.com>
2021-05-17 10:42:54 +00:00
If you use a fio version newer than 3.26, fio will automatically detect and set the proper value.
If you use an old version of fio, make sure to provide the proper --max_open_zones value yourself.
### Maximum Active Zones
Zoned Namespaces has a resource constraint on the number of zones that can be active at any point in
time. Unlike ``max_open_zones``, then fio currently do not manage this constraint, and there is thus
no option to limit it either.
When running with the SPDK/NVMe fio io-engine you can be exposed to error messages, in the form of
completion errors, with the NVMe status code of 0xbd ("Too Many Active Zones"). To work around this,
then you can reset all zones before fio start running its jobs by using the engine option:
```bash
--initial_zone_reset=1
```
### Zone Append
examples/nvme_fio_plugin: add support for zone append Now when we have support for spdk_nvme_zns_zone_append() and spdk_nvme_zns_zone_appendv(), hook them up in the nvme fio plugin. Note that fio itself does not have support for zone append, since unlike SPDK, there is no user facing zone append API in Linux. Therefore, this new option simply replaces writes with zone appends in the SPDK fio backend. This is however still useful for the following reasons: -Provides a way to test zone append in SPDK. -By using zone append, we can test with iodepth > 1. With regular writes, the user can only specify iodepth=1. This is because for zone namespaces, writes have to target the write pointer. Having more than one write in flight, per zone, will lead to I/O errors. In Linux, it is possible to use fio with iodepth > 1 on zoned namespaces, simply because of the mq-deadline scheduler, which throttles writes such that there is only one write in flight, per zone, even if user space has queued up more. Since a user might not want to use zone append unconditionally, even on a namespace that supports it, make this an option rather than enabling it unconditionally. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Change-Id: I028b79f6445bc63b68c97d1370c6f8139779666d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/6330 Community-CI: Broadcom CI Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
2021-02-08 13:39:31 +00:00
When running FIO against a Zoned Namespace you need to specify --iodepth=1 to avoid
"Zone Invalid Write: The write to a zone was not at the write pointer." I/O errors.
However, if your controller supports Zone Append, you can use the engine option:
```bash
--zone_append=1
```
examples/nvme_fio_plugin: add support for zone append Now when we have support for spdk_nvme_zns_zone_append() and spdk_nvme_zns_zone_appendv(), hook them up in the nvme fio plugin. Note that fio itself does not have support for zone append, since unlike SPDK, there is no user facing zone append API in Linux. Therefore, this new option simply replaces writes with zone appends in the SPDK fio backend. This is however still useful for the following reasons: -Provides a way to test zone append in SPDK. -By using zone append, we can test with iodepth > 1. With regular writes, the user can only specify iodepth=1. This is because for zone namespaces, writes have to target the write pointer. Having more than one write in flight, per zone, will lead to I/O errors. In Linux, it is possible to use fio with iodepth > 1 on zoned namespaces, simply because of the mq-deadline scheduler, which throttles writes such that there is only one write in flight, per zone, even if user space has queued up more. Since a user might not want to use zone append unconditionally, even on a namespace that supports it, make this an option rather than enabling it unconditionally. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Change-Id: I028b79f6445bc63b68c97d1370c6f8139779666d Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/6330 Community-CI: Broadcom CI Community-CI: Mellanox Build Bot Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com>
2021-02-08 13:39:31 +00:00
To send zone append commands instead of write commands to the controller.
When using zone append, you will be able to specify a --iodepth greater than 1.
### Shared Memory Increase
If your device has a lot of zones, fio can give you errors such as:
```bash
smalloc: OOM. Consider using --alloc-size to increase the shared memory available.
```
This is because fio needs to allocate memory for the zone-report, that is, retrieve the state of
zones on the device including auxiliary accounting information. To solve this, then you can follow
fio's advice and increase ``--alloc-size``.