This patch adds a library, application and test scripts for extending
SPDK to present virtio-scsi controllers to QEMU-based VMs and
process I/O submitted to devices attached to those controllers.
This functionality is dependent on QEMU patches to enable
vhost-scsi in userspace - those patches are currently working their
way through the QEMU mailing list, but temporary patches to enable
this functionality in QEMU will be made available shortly through the
SPDK github repository.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Signed-off-by: Krzysztof Jakimiak <krzysztof.jakimiak@intel.com>
Signed-off-by: Michal Kosciowski <michal.kosciowski@intel.com>
Signed-off-by: Karol Latecki <karolx.latecki@intel.com>
Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com>
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Signed-off-by: Krzysztof Jakimiak <krzysztof.jakimiak@intel.com>
Change-Id: I138e4021f0ac4b1cd9a6e4041783cdf06e6f0efb
This avoids registering PMDs that are not used by a given
application. For example, an app may wish to *not* use
ioat - in this case, ioat PMD would not be registered with
DPDK, and we would not waste time probing these devices
when probing other devices like NVMe.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: If378e40bde9057c7808603aa1918bcfe80fa0e9d
These allocations need to be from memory registered with the SPDK env
library to allow future work on automatic ibverbs memory registration.
Change-Id: I6ec6999ecd6d6bf6ba4ab159630f7d01f3d46154
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
It is no longer used now that AER handling holds the request until it is
triggerred.
Change-Id: I71a75e86f82bc06f677cf26defa701e60b9aa1bd
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This will allow returning a different default value per mem map.
Change-Id: I94d3de197acfb2e6ad40092ab0588ba4e951af80
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Add a top-level structure that can be reused for other kinds of memory
address translations.
Change-Id: I046f98b76b4e98087d90095d6e9dea5cd6ab7898
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This patch make the function spdk_scsi_task_process_null_lun() as public and
finish the task immediately once we get task in iscsi layer.
Change-Id: I4ada027d3a324dce8ef0d0f7706dbc14184ead96
Signed-off-by: Cunyin Chang <cunyin.chang@intel.com>
After checking the code, aerl in our session is 0,
so there will be only 1 AER. So currently,
we will only handle 1 AER case.
When the AER event is triggered by real NVMe device owned
by the subsystem, it notifies all sessions belonging to
the subsystem.
Change-Id: Ia80fb0f03e893c20d8dd14afbed8db10db38301c
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
By the time the module is unloaded, the reactors
have already stopped. That means the event will never
actually fire. Simply remove it.
Change-Id: I4fe371ae7a679d51254d0267fbbbf74c3e9cf477
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The wr_id should never be NULL - it will always correspond to a request
we previously posted. Convert the check to an assert() so we notice if
this ever happens (which would indicate a programming error somewhere
else).
While we're here, add a more robust check to make sure the request is
actually in the correct array of requests for the connection being
polled (also in an assert, since this should never fail in normal
execution).
Change-Id: I855763d7d827fb8cf00a775c7bc2ccb579db8d0f
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Kernel nvmf host always tries to connect nvmf target
when we does not issue nvme disconnect command. Thus,
we face rdma_create_qp issue, the reason is that we call
rdma_listen too early, and the event retrieved from
rdma_cm_get_event is too late.
And this patch solves this issue.
Change-Id: I153a8aea7420a86a236301dad9bd54af97f60865
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
The function spdk_iscsi_transfer_in will handle the task if the
status is not SPDK_SCSI_STATUS_GOOD.
Change-Id: I61155ffa056b3eac551f215d50e1808e5389fdb5
Signed-off-by: cunyinch <cunyin.chang@intel.com>
Calling spdk_nvmf_request_complete to complete spdk_nvmf_request
causes some fields in completion queue entry not set correctly.
Calling spdk_nvmf_request_complete fixes the problem.
This can be used for issuing an abort for the timed-out command.
Change-Id: I3c5727fdddc156cd7c8f99afbc3e6da8e73bba56
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Make sure the compiler arranges the fast path as the fallthrough case by
annotating the checks in spdk_vtophys().
Change-Id: If0fc3149297131894b5c7a94bff31bf8ee40326e
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Now that all DPDK memory is registered at startup, spdk_vtophys() never
needs to add new translations to the vtophys map. This means that any
lookup that fails to find an allocated map_1gb will always return
SPDK_VTOPHYS_ERROR rather than trying to allocate it and then failing
the lookup anyway.
Change-Id: I7e6f7af183199651f5808a17810a17970b0e3331
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
vtophys_get_paddr() and vtophys_get_dpdk_paddr() are doing similar
things; combine them into one function that works for all DPDK
memory addresses.
Part of the vtophys test is temporarily disabled until the next commit,
which will register all DPDK memory at startup and stop lookiing up
addresses at runtime.
Change-Id: I91312837aa1e6170bacaf3b0d2adbdc4391d3afa
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This just moves the lookup of the physical address up one level - now
_spdk_vtophys_register_one() is only responsible for filling out the
mapping table, not looking up the translation.
Change-Id: I9fd5b85da623e403fda0563b6bdebd4aaaf42864
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Rather than storing the page frame number, just store the full physical
address of each 2 MB page. This simplifies the lookup code and makes
the map generic (values are inserted and retrieved without any
modification) for future uses.
Change-Id: Ib1081513a0682f6b8b908f3401c00d87b00f484c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
No need to build a whitelist and scan anymore - the NVMe
driver can directly attach to a specified device.
Change-Id: Ie60c09b6ab37a7f068c496f0cad53bfdc8617349
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Move the ibv_recv_wr initialization in
nvme_rdma_alloc_rsps. Thus we can save some
CPU times
Change-Id: Id449b2684290431f8b3ba97ec4058171d34038bf
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
We do not need to set it for submission since the contents
are same
Change-Id: I345094e2e8a858b318be73d28f09393566587d95
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
When performed limitation iSCSI tests, 128 target nodes with 1
connection for each target node, for IO bigger than 256KiB iSCSI
target will report run of out task pool issue sometimes. When
all the iSCSI parameters with default values, each connection
will consume maximum 189 tasks, we hardcoded the task pool with
16384, so 189 * 128 connection will exceed 16384. Increase the
default number from 16384 to 32768 will fix the issue.
With 1MiB block size and queue depth with 128 for each connection,
there will be 64 outstanding iSCSI commands in the iSCSI target,
for Writes, the maximum R2T number is 4, so the maximum tasks for
the 4 R2T is (1 + 16) * 4 = 68, 8KiB for the first burst task, 16
for the data segment. For Reads, the maximum 64 data in segment can
be used as 4 iSCSI Read commands. The rest 56 iSCSI commands will
cost 56 tasks, so the total number is 56 + 64 + 68 = 188, 1 additional
task for NOP task.
Change-Id: I945871cbe3076139f08c2ef647af2d9c84601dcb
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
This is consistent with the rest of the RPC calls that report a number
of blocks, and it matches the field in the split_disk structure.
Change-Id: Ie25534617112d65979c317fe13e05a6c32520a15
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The driver_specific object should contain a single object with the
blockdev driver's name so that the user can determine how to interpret
it. This matches the NVMe blockdev driver.
Change-Id: I434b910a95dd527363af78469dc900e9d19ec12e
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Now that namespace splitting support has been removed from the NVMe bdev
in commit efccac8 ("bdev/nvme: remove NvmeLunsPerNs and LunSizeInMB"),
the block_size and total_size fields in the NVMe bdev's driver_specific
config data are redundant. The generic get_bdevs num_blocks and
block_size fields provide the same information.
Change-Id: I080d2017d608716a593bb553ee667e9c4017ffb7
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Move cb_arg to the first argument to match the other NVMe callback
function signatures.
Change-Id: I4e699c8071dcb7ba4ce3cdb82ee985600208204c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Whether a nvme command having data transfer cannot be completely
determined by command opcode. For set features command, some features
don't require data transfer.
Change spdk_nvmf_request_prep_data to fix this issue.
This has been reported for a number of different device
types. We suspect these devices are technically out of
spec, but they work with most other available NVMe
drivers on accident.
Change-Id: I529cfc03fc314cbab2a1cd40620bf1dd5b54182d
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Reason: the 4 fields of struct ibv_recv_wr is already
set in the following 4 lines.
Change-Id: I97437ee2e4c6e944154813bb48b1740b182220df
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
The responsibility for writing the pid file should lie with the init
system, not the application itself.
This was also broken by the recent instance ID/shared memory ID rework;
the pid file was named based on the pid, making it fairly worthless.
Change-Id: Ifb4f2d3ce5cf132f2c2e8bd3d0ba605ff8ccd8fe
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This was added by mistake in commit 18d26e42a3 ("env: Move DPDK
intialization into the env library."). It is always dead code, because
shm_id is set to getpid() right above it, and it will never be -1.
Change-Id: I19c798a87bf7a3b12547d772b981b038857abcaa
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This patch makes spdk_scsi_lun_construct behave as documented.
spdk_scsi_lun_construct will return only newly created LUN.
If LUN with that name already exists, NULL will be returned.
Unit test relevant to this behaviour is now changed to show
this functionality is now working.
Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Change-Id: I002903d6f96555c638aba3fa99cc2c2504ced603
This is necessary to prevent claiming the same LUN twice
and properly cleanup in case of an error during spdk_scsi_dev_construct.
This patch addresses three issues:
- spdk_scsi_lun_claim error is correctly handled in spdk_scsi_dev_add_lun
- on error when constructing scsi dev, it is now correctly removed along with attached luns
- spdk_scsi_dev_destruct not only unclaims, but calls spdk_scsi_lun_destruct on each lun in dev
Unit tests relevant to this behaviour are changed to show this functionality is now working.
Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Change-Id: I111c320f875e5003e3f1f7748a2630097301ce1b
This patch adds two new unit tests for scsi device:
- creating two different devices, each containing the same lun
- creating one device, with the same lun twice
As noted in code, three asserts are incorrectly set to show functionality
that is not working currently.
Next patch in series implements that functionality and changes asserts
in the unit tests.
Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Change-Id: I2645401fee4f2cd986458e0a4db108ce4e1bf9db
By default, all SPDK applications will not share memory.
To share memory, start the applications with the same
shared memory id.
Change-Id: Ib6180369ef0ed12d05983a21d7943e467402b21a
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Funnel all of the return paths in the main parsing routine through the
code that sets the *end pointer so that all error cases set it.
Change-Id: I0565913f7b9488470ede79dc1af84eb4b9a03225
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The direct and virtual mode code is identical; move it to session.c like
the other virtualized get/set features.
Change-Id: I0a0e2dd795197c142ad5d9d0e4ddedb2aa5c8c2a
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Even for direct mode, each session should use its own
async event configuration like virtual mode instead of
passthrough.
Change-Id: I9c1175f3677c672c0cad684341b8a46a575d753e
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Instance ID is too overloaded and the uses are beginning
to conflict. Separate the RPC configuration out.
Change-Id: I712731130339fee4fc8de4dc2d0fea7040773c58
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The host and port output parameters point into the (non-const) char *ip,
so it makes more sense for them to be non-const as well.
This allows the flexibility to pass non-const char pointers as the
output parameters, which will be used in the nvmf_tgt/conf.c parsing
code.
Change-Id: I1d5b102fc389c06d36432904e4fda944437b659e
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
For namespaces with end-to-end protection information, metadata size
of exactly 8 bytes, and extended LBA configured, the NVMe driver would
calculate the size of the data block incorrectly. The NVMe spec has a
special provision for this specific case (8-byte metadata only) and
PRACT = 1 that requires that the host does not send the metadata as part
of the host memory buffer.
To fix this, clean up the calculation of the per-block data transfer
size by adding a new extended_lba_size field in the namespace, which
represents the total size of data to be transferred per block based on
the namespace's configured metadata size and whether it transfers
metadata as part of the data buffer. Then add the special case for
PRACT = 1 and PI configured and extended LBA in the R/W helper
functions.
Change-Id: I0b383a58c773cac06e6c018858b57129064c6059
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
These were repeated a few different places, so pull them into a common
header file.
Change-Id: Id807fa2cfec0de2e0363aeb081510fb801781985
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This removes one addition from the submission path (negligible, but a
nice side effect), but also opens up the possibility of reporting the
total time an I/O took - since we are always tracking the submission
time anyway, there is no extra cost to report it in the completion
callback.
Change-Id: I7129e7c09d20da8082042a7622d045846461dd9c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The phenoemon is that we can not shutdown the nvmf tgt.
The solution is that we need to adjust the shutting down orders of
nvmf tgt subsystem and rdma trasport layer.
Change-Id: Ie39657370b1574960e0ee7cf604cc5872db0bed3
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
This function parses in place by inserting null terminators.
Change-Id: I61cb97b87ec05d0183fbaa993fd3d7580a188bde
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Split the ref_count field of the bottom level of the vtophys map tree to
a separate array so that the pfn_2mb field can be expanded to a full 64
bits again. This doesn't change behavior for the current use as a page
frame number; it is setup work for storing an arbitrary 64-bit pointer
value in the bottom level.
Change-Id: I0bc44df3edc9df4a479229d69c2f3884d43a340d
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Since the io_channel will be passed to the underlying bdev's
read/write/... functions later, we need to also acquire an io_channel
for the underlying bdev, not for the virtual bdev.
Change-Id: Ica13076973fef875ea636770fce8eb27017aa1c3
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
These strings are not modified by the functions they are passed to, so
they can be const char *.
Change-Id: I11532f232990a305d706c14aac1b0f8f93b8f576
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
For infinite timeout states, instead of printing UINT64_MAX as a
decimal number, interpret it as "no timeout" instead.
Change-Id: I579f5857f96286734940ab5f493261e60354c4fe
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The PCIe transport initializes the quirks directly, so the generic hook
to get PCI ID is no longer necessary. This path was dead code.
Change-Id: I25bdaa598db53e4312a264d9d8356d1b416696e5
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The logic to fail queue pairs when the controller is failed should be
handled in the generic code, not in the individual transports.
This also allows nvme_qpair_fail() to be private to nvme_qpair.c.
Change-Id: I6194576dceb35073b9af8847e59314900028637c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Avoid a runtime check for the rte_ring type - we know that the event
ring is multi-producer/single-consumer at compile time.
Change-Id: I5d42aee9c635db86e545b661361a68818d80961d
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Avoid accessing the internals of the bdev_io from outside of the bdev
library.
Change-Id: I01dfc38b2520353ad42bcd8587b90f197eadf101
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This is more CPU efficient than only grabbing one
completion per call to ibv_poll_cq.
Change-Id: I0c70d33639f0f345482d9e7c810f9c6723937058
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
This virtual block device takes an underlying block device and splits it
into several smaller equal-sized block devices.
Change-Id: I6f6e686c1177b2e4885f7e88809ad329caae55bd
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
These were only intended for testing and should be replaced by a virtual
blockdev that can be layered on top of any kind of bdev.
Change-Id: I3ba2cc94630a6c6748d96e3401fee05aaabe20e0
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>