After running into multiple issues with missing packages when building bpftrace, I went to the bpftrace repo and used the install documentation to successfully build and install bpftrace. This patch updates the spdk documentation to point to the documentation in the bpftrace repo. Change-Id: Iab5ef901d282a26d6a3c4f59cceb46b317ee658a Signed-off-by: John Kariuki <John.K.Kariuki@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/14221 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Dong Yi <dongx.yi@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
170 lines
7.7 KiB
Markdown
170 lines
7.7 KiB
Markdown
# Userspace DTrace (USDT) {#usdt}
|
|
|
|
## Building bpftrace
|
|
|
|
We have found issues with the packaged bpftrace on both Ubuntu 20.04
|
|
and Fedora 33. So bpftrace should be built and installed from source.
|
|
To build and install bpftrace go to [building bpftraces](https://github.com/iovisor/bpftrace/blob/master/INSTALL.md#building-bpftrace)
|
|
|
|
## bpftrace.sh
|
|
|
|
bpftrace.sh is a helper script that facilitates running bpftrace scripts
|
|
against a running SPDK application. Here is a typical usage:
|
|
|
|
```bash
|
|
scripts/bpftrace.sh `pidof spdk_tgt` scripts/bpf/nvmf.bt
|
|
```
|
|
|
|
Attaching to USDT probes requires the full path of the binary in the
|
|
probe description. SPDK bpftrace scripts can be written with an __EXE__
|
|
marker instead of a full path name, and bpftrace.sh will dynamically
|
|
replace that string with the full path name using information from procfs.
|
|
|
|
It is also useful to filter certain kernel events (such as system calls)
|
|
based on the PID of the SPDK application. SPDK bpftrace scripts can be
|
|
written with a __PID__ marker, and bpftrace.sh will dynamically replace
|
|
that string with the PID provided to the script.
|
|
|
|
## Configuring SPDK Build
|
|
|
|
```bash
|
|
./configure --with-usdt
|
|
```
|
|
|
|
## Start SPDK application and bpftrace script
|
|
|
|
From first terminal:
|
|
|
|
```bash
|
|
build/bin/spdk_tgt -m 0xC
|
|
```
|
|
|
|
From second terminal:
|
|
|
|
```bash
|
|
scripts/bpftrace.sh `pidof spdk_tgt` scripts/bpf/nvmf.bt
|
|
```
|
|
|
|
nvmf.bt will print information about nvmf subsystem and poll
|
|
group info state transitions.
|
|
|
|
From third terminal:
|
|
|
|
```bash
|
|
scripts/rpc.py <<EOF
|
|
nvmf_create_transport -t tcp
|
|
nvmf_create_subsystem nqn.2016-06.io.spdk:cnode1 -a -s SPDK00000000000001 -m 10
|
|
nvmf_subsystem_add_listener nqn.2016-06.io.spdk:cnode1 -t tcp -a 127.0.0.1 -s 4420
|
|
bdev_null_create null0 1000 512
|
|
nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 null0
|
|
EOF
|
|
```
|
|
|
|
This creates the nvmf tcp transport, a new nvmf subsystem that listens on a tcp
|
|
port, and a null bdev which is added as a namespace to the new nvmf subsystem.
|
|
|
|
You will see output from the second terminal that looks like this:
|
|
|
|
```bash
|
|
2110.935735: nvmf_tgt reached state NONE
|
|
2110.954316: nvmf_tgt reached state CREATE_TARGET
|
|
2110.967905: nvmf_tgt reached state CREATE_POLL_GROUPS
|
|
2111.235982: nvmf_tgt reached state START_SUBSYSTEMS
|
|
2111.253560: nqn.2014-08.org.nvmexpress.discovery change state from INACTIVE to ACTIVE start
|
|
2111.260278: nqn.2014-08.org.nvmexpress.discovery on thread 2 state to ACTIVE start
|
|
2111.264281: nqn.2014-08.org.nvmexpress.discovery on thread 2 state to ACTIVE done
|
|
2111.284083: nqn.2014-08.org.nvmexpress.discovery change state from INACTIVE to ACTIVE done
|
|
2111.289197: nvmf_tgt reached state RUNNING
|
|
2111.271573: nqn.2014-08.org.nvmexpress.discovery on thread 3 state to ACTIVE start
|
|
2111.279787: nqn.2014-08.org.nvmexpress.discovery on thread 3 state to ACTIVE done
|
|
2189.921492: nqn.2016-06.io.spdk:cnode1 change state from INACTIVE to ACTIVE start
|
|
2189.952508: nqn.2016-06.io.spdk:cnode1 on thread 2 state to ACTIVE start
|
|
2189.959125: nqn.2016-06.io.spdk:cnode1 on thread 2 state to ACTIVE done
|
|
2190.005832: nqn.2016-06.io.spdk:cnode1 change state from INACTIVE to ACTIVE done
|
|
2189.969058: nqn.2016-06.io.spdk:cnode1 on thread 3 state to ACTIVE start
|
|
2189.999889: nqn.2016-06.io.spdk:cnode1 on thread 3 state to ACTIVE done
|
|
2197.859104: nqn.2016-06.io.spdk:cnode1 change state from ACTIVE to PAUSED start
|
|
2197.879199: nqn.2016-06.io.spdk:cnode1 on thread 2 state to PAUSED start
|
|
2197.883416: nqn.2016-06.io.spdk:cnode1 on thread 2 state to PAUSED done
|
|
2197.902291: nqn.2016-06.io.spdk:cnode1 change state from ACTIVE to PAUSED done
|
|
2197.908939: nqn.2016-06.io.spdk:cnode1 change state from PAUSED to ACTIVE start
|
|
2197.912644: nqn.2016-06.io.spdk:cnode1 on thread 2 state to ACTIVE start
|
|
2197.927409: nqn.2016-06.io.spdk:cnode1 on thread 2 state to ACTIVE done
|
|
2197.949742: nqn.2016-06.io.spdk:cnode1 change state from PAUSED to ACTIVE done
|
|
2197.890812: nqn.2016-06.io.spdk:cnode1 on thread 3 state to PAUSED start
|
|
2197.897233: nqn.2016-06.io.spdk:cnode1 on thread 3 state to PAUSED done
|
|
2197.931278: nqn.2016-06.io.spdk:cnode1 on thread 3 state to ACTIVE start
|
|
2197.946124: nqn.2016-06.io.spdk:cnode1 on thread 3 state to ACTIVE done
|
|
2205.859904: nqn.2016-06.io.spdk:cnode1 change state from ACTIVE to PAUSED start
|
|
2205.891392: nqn.2016-06.io.spdk:cnode1 on thread 2 state to PAUSED start
|
|
2205.896588: nqn.2016-06.io.spdk:cnode1 on thread 2 state to PAUSED done
|
|
2205.920133: nqn.2016-06.io.spdk:cnode1 change state from ACTIVE to PAUSED done
|
|
2205.905900: nqn.2016-06.io.spdk:cnode1 on thread 3 state to PAUSED start
|
|
2205.914856: nqn.2016-06.io.spdk:cnode1 on thread 3 state to PAUSED done
|
|
2206.091084: nqn.2016-06.io.spdk:cnode1 change state from PAUSED to ACTIVE start
|
|
2206.099222: nqn.2016-06.io.spdk:cnode1 on thread 2 state to ACTIVE start
|
|
2206.105445: nqn.2016-06.io.spdk:cnode1 on thread 2 state to ACTIVE done
|
|
2206.119271: nqn.2016-06.io.spdk:cnode1 change state from PAUSED to ACTIVE done
|
|
2206.109144: nqn.2016-06.io.spdk:cnode1 on thread 3 state to ACTIVE start
|
|
2206.115636: nqn.2016-06.io.spdk:cnode1 on thread 3 state to ACTIVE done
|
|
```
|
|
|
|
Now stop the bpftrace.sh running in the second terminal, and start
|
|
it again with the send_msg.bt script. This script keeps a count of
|
|
functions executed as part of an spdk_for_each_channel or
|
|
spdk_thread_send_msg function call.
|
|
|
|
```bash
|
|
scripts/bpftrace.sh `pidof spdk_tgt` scripts/bpf/send_msg.bt
|
|
```
|
|
|
|
From the third terminal, create another null bdev and add it as a
|
|
namespace to the cnode1 subsystem.
|
|
|
|
```bash
|
|
scripts/rpc.py <<EOF
|
|
bdev_null_create null1 1000 512
|
|
nvmf_subsystem_add_ns nqn.2016-06.io.spdk:cnode1 null1
|
|
EOF
|
|
```
|
|
|
|
Now Ctrl-C the bpftrace.sh in the second terminal, and it will
|
|
print the final results of the maps.
|
|
|
|
```bash
|
|
@for_each_channel[subsystem_state_change_on_pg]: 2
|
|
|
|
@send_msg[_finish_unregister]: 1
|
|
@send_msg[_call_completion]: 2
|
|
@send_msg[put_io_channel]: 4
|
|
@send_msg[_call_channel]: 4
|
|
```
|
|
|
|
## TODOs and known limitations
|
|
|
|
- add definitions for common nvmf data structures - then we can pass the subsystem pointer
|
|
itself as a probe parameter and let the script decide which fields it wants to access
|
|
(Note: these would need to be kept up-to-date with the C definitions of the struct - it is
|
|
not possible to include the header files in a bpftrace script)
|
|
- investigate using pahole to generate data structure definitions that can be included in
|
|
bpftrace scripts; this would allow us to pass the subsystem pointer itself as a probe
|
|
argument, and let the script decide which fields it wants to access; for example,
|
|
`pahole -E -C spdk_nvmf_subsystem build/bin/spdk_tgt` gets us close to what we need,
|
|
but there are some limiters:
|
|
- our structures have char arrays (not char pointers) for things like subnqn; large
|
|
arrays like these cannot currently be passed to bpftrace printf without generating
|
|
a stack space error (probe points are limited to 512 bytes of stack); we could
|
|
modify SPDK to have the char array for storage, and a char pointer that points to
|
|
that storage, the latter could easily then be used in bpftrace scripts
|
|
- our structures include fields with their enum types instead of int; bpftrace will
|
|
complain it does not know about the enum (pahole doesn't print out enum
|
|
descriptions); information on enums can be found in the applications .debug_info
|
|
section, but we would need something that can convert that into a file we can
|
|
include in a bpftrace script
|
|
- Note that bpftrace prints are not always printed in exact chronological order; this can
|
|
be seen especially with spdk_for_each_channel iterations, where we execute trace points
|
|
on multiple threads in a very short period of time, and those may not get printed to the
|
|
console in exact order; this is why the nvmf.bt script prints out a msec.nsec timestamp
|
|
so that the user can understand the ordering and even pipe through sort if desired
|
|
- flesh out more DTrace probes in the nvmf code
|