DPDK patches and discussions
 help / color / mirror / Atom feed
From: Tudor Cornea <tudor.cornea@gmail.com>
To: ferruh.yigit@intel.com
Cc: padraig.j.connolly@intel.com, thomas@monjalon.net,
	stephen@networkplumber.org, helin.zhang@intel.com, dev@dpdk.org,
	Tudor Cornea <tudor.cornea@gmail.com>,
	Padraig Connolly <Padraig.J.Connolly@intel.com>
Subject: [PATCH v6] kni: allow configuring the kni thread granularity
Date: Thu, 20 Jan 2022 14:41:34 +0200
Message-ID: <20220120124134.4123542-1-tudor.cornea@gmail.com> (raw)
In-Reply-To: <1642173499-59396-1-git-send-email-tudor.cornea@gmail.com>

The Kni kthreads seem to be re-scheduled at a granularity of roughly
1 millisecond right now, which seems to be insufficient for performing
tests involving a lot of control plane traffic.

Even if KNI_KTHREAD_RESCHEDULE_INTERVAL is set to 5 microseconds, it
seems that the existing code cannot reschedule at the desired granularily,
due to precision constraints of schedule_timeout_interruptible().

In our use case, we leverage the Linux Kernel for control plane, and
it is not uncommon to have 60K - 100K pps for some signaling protocols.

Since we are not in atomic context, the usleep_range() function seems to be
more appropriate for being able to introduce smaller controlled delays,
in the range of 5-10 microseconds. Upon reading the existing code, it would
seem that this was the original intent. Adding sub-millisecond delays,
seems unfeasible with a call to schedule_timeout_interruptible().


Below, we attempted a brief comparison between the existing implementation,
which uses schedule_timeout_interruptible() and usleep_range().

We attempt to measure the CPU usage, and RTT between two Kni interfaces,
which are created on top of vmxnet3 adapters, connected by a vSwitch.

insmod rte_kni.ko kthread_mode=single carrier=on

kni_single CPU Usage: 2-4 %
[root@localhost ~]# ping -I eth1
PING ( from eth1: 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=2.70 ms
64 bytes from icmp_seq=2 ttl=64 time=1.00 ms
64 bytes from icmp_seq=3 ttl=64 time=1.99 ms
64 bytes from icmp_seq=4 ttl=64 time=0.985 ms
64 bytes from icmp_seq=5 ttl=64 time=1.00 ms

usleep_range(5, 10)
kni_single CPU usage: 50%
64 bytes from icmp_seq=1 ttl=64 time=0.338 ms
64 bytes from icmp_seq=2 ttl=64 time=0.150 ms
64 bytes from icmp_seq=3 ttl=64 time=0.123 ms
64 bytes from icmp_seq=4 ttl=64 time=0.139 ms
64 bytes from icmp_seq=5 ttl=64 time=0.159 ms

usleep_range(20, 50)
kni_single CPU usage: 24%
64 bytes from icmp_seq=1 ttl=64 time=0.202 ms
64 bytes from icmp_seq=2 ttl=64 time=0.170 ms
64 bytes from icmp_seq=3 ttl=64 time=0.171 ms
64 bytes from icmp_seq=4 ttl=64 time=0.248 ms
64 bytes from icmp_seq=5 ttl=64 time=0.185 ms

usleep_range(50, 100)
kni_single CPU usage: 13%
64 bytes from icmp_seq=1 ttl=64 time=0.537 ms
64 bytes from icmp_seq=2 ttl=64 time=0.257 ms
64 bytes from icmp_seq=3 ttl=64 time=0.231 ms
64 bytes from icmp_seq=4 ttl=64 time=0.143 ms
64 bytes from icmp_seq=5 ttl=64 time=0.200 ms

usleep_range(100, 200)
kni_single CPU usage: 7%
64 bytes from icmp_seq=1 ttl=64 time=0.716 ms
64 bytes from icmp_seq=2 ttl=64 time=0.167 ms
64 bytes from icmp_seq=3 ttl=64 time=0.459 ms
64 bytes from icmp_seq=4 ttl=64 time=0.455 ms
64 bytes from icmp_seq=5 ttl=64 time=0.252 ms

usleep_range(1000, 1100)
kni_single CPU usage: 2%
64 bytes from icmp_seq=1 ttl=64 time=2.22 ms
64 bytes from icmp_seq=2 ttl=64 time=1.17 ms
64 bytes from icmp_seq=3 ttl=64 time=1.17 ms
64 bytes from icmp_seq=4 ttl=64 time=1.17 ms
64 bytes from icmp_seq=5 ttl=64 time=1.15 ms

Upon testing, usleep_range(1000, 1100) seems roughly equivalent in
latency and cpu usage to the variant with schedule_timeout_interruptible(),
while usleep_range(100, 200) seems to give a decent tradeoff between
latency and cpu usage, while allowing users to tweak the limits for
improved precision if they have such use cases.

Disabling RTE_KNI_PREEMPT_DEFAULT, interestingly seems to lead to a
softlockup on my kernel.

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 1226 Comm: kni_single Tainted: G        W  O 3.10 #1
 <IRQ>  [<ffffffff814f84de>] dump_stack+0x19/0x1b
 [<ffffffff814f7891>] panic+0xcd/0x1e0
 [<ffffffff810993b0>] watchdog_timer_fn+0x160/0x160
 [<ffffffff810644b2>] __run_hrtimer.isra.4+0x42/0xd0
 [<ffffffff81064b57>] hrtimer_interrupt+0xe7/0x1f0
 [<ffffffff8102cd57>] smp_apic_timer_interrupt+0x67/0xa0
 [<ffffffff8150321d>] apic_timer_interrupt+0x6d/0x80

This patch also attempts to remove this option.

[1] https://www.kernel.org/doc/Documentation/timers/timers-howto.txt

Signed-off-by: Tudor Cornea <tudor.cornea@gmail.com>
Acked-by: Padraig Connolly <Padraig.J.Connolly@intel.com>
Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
* Removed tabs and newline in the description of the
  min_scheduling_interval and max_scheduling_interval
  parameters. They seem to be non-standard.
  In a quick glance over the Linux tree, I saw some (rare)
  usages of newlines and tabs:
      drivers/scsi/bnx2fc/bnx2fc_fcoe.c (debug_logging)
  Fixing the other parameters might mean that we have to chop
  some text, otherwise the line could probably get too big.
* Rebased the patch on top of the dpdk-next-net branch
* Removed RTE_KNI_PREEMPT_DEFAULT configuration option
* Fixed unwrapped commit description warning
* Changed from hrtimers to Linux High Precision Timers in docs
* Added two tabs at the beginning of the new params description.
  Stephen correctly pointed out that the descriptions of the parameters
  for the Kni module are nonstandard w.r.t existing kernel code.
  I was thinking to preserve compatibility with the existing parameters
  of the Kni module for the moment, while an additional clean-up patch
  could format the descriptions to be closer to the kernel standard.
* Fixed some spelling errors
 config/rte_config.h                            |  3 ---
 doc/guides/prog_guide/kernel_nic_interface.rst | 33 ++++++++++++++++++++++++++
 kernel/linux/kni/kni_dev.h                     |  2 +-
 kernel/linux/kni/kni_misc.c                    | 32 ++++++++++++++++++-------
 4 files changed, 58 insertions(+), 12 deletions(-)

diff --git a/config/rte_config.h b/config/rte_config.h
index cab4390..91d96ee 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -95,9 +95,6 @@
-/* KNI defines */
 /* rte_graph defines */
diff --git a/doc/guides/prog_guide/kernel_nic_interface.rst b/doc/guides/prog_guide/kernel_nic_interface.rst
index 771c7d7..a0763c5 100644
--- a/doc/guides/prog_guide/kernel_nic_interface.rst
+++ b/doc/guides/prog_guide/kernel_nic_interface.rst
@@ -61,6 +61,10 @@ can be specified when the module is loaded to control its behavior:
                     userspace callback and supporting async requests (default=off):
                     on    Enable request processing support for bifurcated drivers.
+    parm:           min_scheduling_interval: "Kni thread min scheduling interval (default=100 microseconds):
+                     (long)
+    parm:           max_scheduling_interval: "Kni thread max scheduling interval (default=200 microseconds):
+                     (long)
 Loading the ``rte_kni`` kernel module without any optional parameters is
@@ -202,6 +206,35 @@ Enabling bifurcated device support releases ``rtnl`` lock before calling
 callback and locks it back after callback. Also enables asynchronous request to
 support callbacks that requires rtnl lock to work (interface down).
+KNI Kthread Scheduling
+The ``min_scheduling_interval`` and ``max_scheduling_interval`` parameters
+control the rescheduling interval of the KNI kthreads.
+This might be useful if we have use cases in which we require improved
+latency or performance for control plane traffic.
+The implementation is backed by Linux High Precision Timers, and uses ``usleep_range``.
+Hence, it will have the same granularity constraints as this Linux subsystem.
+For Linux High Precision Timers, you can check the following resource: `Kernel Timers <http://www.kernel.org/doc/Documentation/timers/timers-howto.txt>`_
+To set the ``min_scheduling_interval`` to a value of 100 microseconds:
+.. code-block:: console
+    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko min_scheduling_interval=100
+To set the ``max_scheduling_interval`` to a value of 200 microseconds:
+.. code-block:: console
+    # insmod <build_dir>/kernel/linux/kni/rte_kni.ko max_scheduling_interval=200
+If the ``min_scheduling_interval`` and ``max_scheduling_interval`` parameters are
+not specified, the default interval limits will be set to *100* and *200* respectively.
 KNI Creation and Deletion
diff --git a/kernel/linux/kni/kni_dev.h b/kernel/linux/kni/kni_dev.h
index e863348..a2c6d9f 100644
--- a/kernel/linux/kni/kni_dev.h
+++ b/kernel/linux/kni/kni_dev.h
@@ -27,7 +27,7 @@
 #include <linux/list.h>
 #include <rte_kni_common.h>
 #define MBUF_BURST_SZ 32
diff --git a/kernel/linux/kni/kni_misc.c b/kernel/linux/kni/kni_misc.c
index f10dcd0..45ef4c5 100644
--- a/kernel/linux/kni/kni_misc.c
+++ b/kernel/linux/kni/kni_misc.c
@@ -45,6 +45,10 @@ uint32_t kni_dflt_carrier;
 static char *enable_bifurcated;
 uint32_t bifurcated_support;
+/* Kni thread scheduling interval */
+static long min_scheduling_interval = 100; /* us */
+static long max_scheduling_interval = 200; /* us */
 #define KNI_DEV_IN_USE_BIT_NUM 0 /* Bit number for device in use */
 static int kni_net_id;
@@ -132,11 +136,8 @@ kni_thread_single(void *data)
 		/* reschedule out for a while */
-		schedule_timeout_interruptible(
+		usleep_range(min_scheduling_interval, max_scheduling_interval);
 	return 0;
@@ -153,10 +154,7 @@ kni_thread_multiple(void *param)
-		schedule_timeout_interruptible(
+		usleep_range(min_scheduling_interval, max_scheduling_interval);
 	return 0;
@@ -617,6 +615,14 @@ kni_init(void)
 	if (bifurcated_support == 1)
 		pr_debug("bifurcated support is enabled.\n");
+	if (min_scheduling_interval < 0 || max_scheduling_interval < 0 ||
+		min_scheduling_interval > KNI_KTHREAD_MAX_RESCHEDULE_INTERVAL ||
+		max_scheduling_interval > KNI_KTHREAD_MAX_RESCHEDULE_INTERVAL ||
+		min_scheduling_interval >= max_scheduling_interval) {
+		pr_err("Invalid parameters for scheduling interval\n");
+		return -EINVAL;
+	}
 	rc = register_pernet_subsys(&kni_net_ops);
@@ -692,3 +698,13 @@ MODULE_PARM_DESC(enable_bifurcated,
 "\t\ton    Enable request processing support for bifurcated drivers.\n"
+module_param(min_scheduling_interval, long, 0644);
+"Kni thread min scheduling interval (default=100 microseconds)"
+module_param(max_scheduling_interval, long, 0644);
+"Kni thread max scheduling interval (default=200 microseconds)"

  parent reply	other threads:[~2022-01-20 12:42 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-02 10:38 [dpdk-dev] [PATCH] " Tudor Cornea
2021-11-02 15:51 ` [dpdk-dev] [PATCH v2] " Tudor Cornea
2021-11-02 15:53   ` Stephen Hemminger
2021-11-03 20:40     ` Tudor Cornea
2021-11-03 22:18       ` Stephen Hemminger
2021-11-08 10:13   ` [dpdk-dev] [PATCH v3] " Tudor Cornea
2021-11-22 17:31     ` Ferruh Yigit
2021-11-23 17:08       ` Ferruh Yigit
2021-11-24 17:10         ` Tudor Cornea
2021-11-24 19:24     ` [PATCH v4] " Tudor Cornea
2022-01-14 13:53       ` Connolly, Padraig J
2022-01-14 14:13       ` Ferruh Yigit
2022-01-14 15:18       ` [PATCH v5] " Tudor Cornea
2022-01-14 16:24         ` Stephen Hemminger
2022-01-14 16:43           ` Ferruh Yigit
2022-01-17 16:24             ` Tudor Cornea
2022-01-20 12:41         ` Tudor Cornea [this message]
2022-02-02 19:30           ` [PATCH v6] " Thomas Monjalon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220120124134.4123542-1-tudor.cornea@gmail.com \
    --to=tudor.cornea@gmail.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=helin.zhang@intel.com \
    --cc=padraig.j.connolly@intel.com \
    --cc=stephen@networkplumber.org \
    --cc=thomas@monjalon.net \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git