* [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
@ 2021-04-16 11:04 Chengchang Tang
  2021-04-16 11:04 ` [dpdk-dev] [RFC 1/2] net/bonding: add Tx prepare for bonding Chengchang Tang
                   ` (4 more replies)
  0 siblings, 5 replies; 61+ messages in thread
From: Chengchang Tang @ 2021-04-16 11:04 UTC (permalink / raw)
  To: dev; +Cc: linuxarm, chas3, humin29, ferruh.yigit
This patch add Tx prepare for bonding device.
Currently, the bonding driver has not implemented the callback of
rte_eth_tx_prepare function. Therefore, the TX prepare function of the
slave devices will never be invoked. When hardware offloading such as
CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
to adjust packets (for example, set correct pseudo packet headers).
Otherwise, related offloading fails and even packets are sent
incorrectly. Due to this limitation, the bonded device cannot use these
HW offloading in the Tx direction.
Because packet sending algorithms are numerous and complex in bond PMD,
it is hard to design the callback for rte_eth_tx_prepare. In this patch,
the tx_prepare callback of bonding PMD is not implemented. Instead,
rte_eth_tx_prepare has been called in tx_burst callback. And a global
variable is introduced to control whether the bonded device need call
the rte_eth_tx_prepare. If upper-layer users need to use some TX
offloading that depend on tx_prepare , they should enable the preparation
function. In this way, the bonded device will call the rte_eth_tx_prepare
for the fast path packets in the tx_burst callback.
Chengchang Tang (2):
  net/bonding: add Tx prepare for bonding
  app/testpmd: add cmd for bonding Tx prepare
 app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
 drivers/net/bonding/eth_bond_private.h      |  1 +
 drivers/net/bonding/rte_eth_bond.h          | 29 +++++++++++++
 drivers/net/bonding/rte_eth_bond_api.c      | 28 ++++++++++++
 drivers/net/bonding/rte_eth_bond_pmd.c      | 33 +++++++++++++--
 drivers/net/bonding/version.map             |  5 +++
 7 files changed, 167 insertions(+), 4 deletions(-)
--
2.7.4
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [dpdk-dev] [RFC 1/2] net/bonding: add Tx prepare for bonding
  2021-04-16 11:04 [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Chengchang Tang
@ 2021-04-16 11:04 ` Chengchang Tang
  2021-04-16 11:04 ` [dpdk-dev] [RFC 2/2] app/testpmd: add cmd for bonding Tx prepare Chengchang Tang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 61+ messages in thread
From: Chengchang Tang @ 2021-04-16 11:04 UTC (permalink / raw)
  To: dev; +Cc: linuxarm, chas3, humin29, ferruh.yigit
To use the HW offloads capability (e.g. checksum and TSO) in the Tx
direction, the upper-layer users need to call rte_eth_dev_prepare to do
some adjustment to the packets before sending them (e.g. processing
pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
callback of the bond driver is not implemented. Therefore, related
offloads can not be used unless the upper layer users process the packet
properly in their own application. But it is bad for the
transplantability.
However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonding device to determine which slave device's prepare function should
be invoked. In addition, if the link status changes after the packets are
prepared, the packets may fail to be sent because packets allocation may
change.
So, in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the prepare function of the slave device is added to
the tx_burst callback. And a global variable is introduced to control
whether the bonded device need call the rte_eth_tx_prepare. If upper-layer
users need to use related offloads, they should enable the preparation
function. In this way, the bonded device will call the rte_eth_tx_prepare
for the fast path packets in the tx_burst callback.
Note:
The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
because in broadcast mode, a packet needs to be sent by all slave ports.
Different PMDs process the packets differently in tx_prepare. As a result,
the sent packet may be incorrect.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
---
 drivers/net/bonding/eth_bond_private.h |  1 +
 drivers/net/bonding/rte_eth_bond.h     | 29 +++++++++++++++++++++++++++++
 drivers/net/bonding/rte_eth_bond_api.c | 28 ++++++++++++++++++++++++++++
 drivers/net/bonding/rte_eth_bond_pmd.c | 33 +++++++++++++++++++++++++++++----
 drivers/net/bonding/version.map        |  5 +++++
 5 files changed, 92 insertions(+), 4 deletions(-)
diff --git a/drivers/net/bonding/eth_bond_private.h b/drivers/net/bonding/eth_bond_private.h
index 75fb8dc..72ec4a0 100644
--- a/drivers/net/bonding/eth_bond_private.h
+++ b/drivers/net/bonding/eth_bond_private.h
@@ -126,6 +126,7 @@ struct bond_dev_private {
 	/**< Flag for whether MAC address is user defined or not */
 	uint8_t link_status_polling_enabled;
+	uint8_t tx_prepare_enabled;
 	uint32_t link_status_polling_interval_ms;
 	uint32_t link_down_delay_ms;
diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
index 874aa91..8ec09eb 100644
--- a/drivers/net/bonding/rte_eth_bond.h
+++ b/drivers/net/bonding/rte_eth_bond.h
@@ -343,6 +343,35 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
 int
 rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
+/**
+ * Enable Tx prepare for bonded port
+ *
+ * To perform some HW offloads in the Tx direction, some PMDs need to call
+ * rte_eth_tx_prepare to do some adjustment for packets. This function
+ * enables packets preparation in the fast path for bonded device.
+ *
+ * @param bonded_port_id      Bonded device id
+ *
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_tx_prepare_enable(uint16_t bonded_port_id);
+
+/**
+ * Disable Tx prepare for bonded port
+ *
+ * This function disables Tx prepare for the fast path packets.
+ *
+ * @param bonded_port_id      Bonded device id
+ *
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_tx_prepare_disable(uint16_t bonded_port_id);
 #ifdef __cplusplus
 }
diff --git a/drivers/net/bonding/rte_eth_bond_api.c b/drivers/net/bonding/rte_eth_bond_api.c
index 17e6ff8..b04806a 100644
--- a/drivers/net/bonding/rte_eth_bond_api.c
+++ b/drivers/net/bonding/rte_eth_bond_api.c
@@ -1050,3 +1050,31 @@ rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id)
 	return internals->link_up_delay_ms;
 }
+
+int
+rte_eth_bond_tx_prepare_enable(uint16_t bonded_port_id)
+{
+	struct bond_dev_private *internals;
+
+	if (valid_bonded_port_id(bonded_port_id) != 0)
+		return -1;
+
+	internals = rte_eth_devices[bonded_port_id].data->dev_private;
+	internals->tx_prepare_enabled = 1;
+
+	return 0;
+}
+
+int
+rte_eth_bond_tx_prepare_disable(uint16_t bonded_port_id)
+{
+	struct bond_dev_private *internals;
+
+	if (valid_bonded_port_id(bonded_port_id) != 0)
+		return -1;
+
+	internals = rte_eth_devices[bonded_port_id].data->dev_private;
+	internals->tx_prepare_enabled = 0;
+
+	return 0;
+}
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index 2e9cea5..3b7870f 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < num_of_slaves; i++) {
 		if (slave_nb_pkts[i] > 0) {
+			int nb_prep_pkts = slave_nb_pkts[i];
+			if (internals->tx_prepare_enabled)
+				nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
+						bd_tx_q->queue_id,
+						slave_bufs[i], nb_prep_pkts);
+
 			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-					slave_bufs[i], slave_nb_pkts[i]);
+					slave_bufs[i], nb_prep_pkts);
 			/* if tx burst fails move packets to end of bufs */
 			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
@@ -632,6 +638,7 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 {
 	struct bond_dev_private *internals;
 	struct bond_tx_queue *bd_tx_q;
+	int nb_prep_pkts = nb_pkts;
 	bd_tx_q = (struct bond_tx_queue *)queue;
 	internals = bd_tx_q->dev_private;
@@ -639,8 +646,13 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 	if (internals->active_slave_count < 1)
 		return 0;
+	if (internals->tx_prepare_enabled)
+		nb_prep_pkts =
+			rte_eth_tx_prepare(internals->current_primary_port,
+				bd_tx_q->queue_id, bufs, nb_prep_pkts);
+
 	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+			bufs, nb_prep_pkts);
 }
 static inline uint16_t
@@ -939,6 +951,7 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	}
 	for (i = 0; i < num_of_slaves; i++) {
+		int nb_prep_pkts;
 		rte_eth_macaddr_get(slaves[i], &active_slave_addr);
 		for (j = num_tx_total; j < nb_pkts; j++) {
 			if (j + 3 < nb_pkts)
@@ -955,8 +968,14 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 		}
+		nb_prep_pkts = nb_pkts - num_tx_total;
+		if (internals->tx_prepare_enabled)
+			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
+					bd_tx_q->queue_id, bufs + num_tx_total,
+					nb_prep_pkts);
+
 		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-				bufs + num_tx_total, nb_pkts - num_tx_total);
+				bufs + num_tx_total, nb_prep_pkts);
 		if (num_tx_total == nb_pkts)
 			break;
@@ -1159,12 +1178,18 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < slave_count; i++) {
+		int nb_prep_pkts;
 		if (slave_nb_bufs[i] == 0)
 			continue;
+		nb_prep_pkts = slave_nb_bufs[i];
+		if (internals->tx_prepare_enabled)
+			nb_prep_pkts = rte_eth_tx_prepare(slave_port_ids[i],
+					bd_tx_q->queue_id, slave_bufs[i],
+					nb_prep_pkts);
 		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
 				bd_tx_q->queue_id, slave_bufs[i],
-				slave_nb_bufs[i]);
+				nb_prep_pkts);
 		total_tx_count += slave_tx_count;
diff --git a/drivers/net/bonding/version.map b/drivers/net/bonding/version.map
index df81ee7..b642729 100644
--- a/drivers/net/bonding/version.map
+++ b/drivers/net/bonding/version.map
@@ -31,3 +31,8 @@ DPDK_21 {
 	local: *;
 };
+
+EXPERIMENTAL {
+	rte_eth_bond_tx_prepare_disable;
+	rte_eth_bond_tx_prepare_enable;
+};
--
2.7.4
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [dpdk-dev] [RFC 2/2] app/testpmd: add cmd for bonding Tx prepare
  2021-04-16 11:04 [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Chengchang Tang
  2021-04-16 11:04 ` [dpdk-dev] [RFC 1/2] net/bonding: add Tx prepare for bonding Chengchang Tang
@ 2021-04-16 11:04 ` Chengchang Tang
  2021-04-16 11:12 ` [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Min Hu (Connor)
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 61+ messages in thread
From: Chengchang Tang @ 2021-04-16 11:04 UTC (permalink / raw)
  To: dev; +Cc: linuxarm, chas3, humin29, ferruh.yigit
Add new command to support enable/disable Tx prepare on each slave of a
bonded device. This helps to test some Tx HW offloads (e.g. checksum and
TSO) for boned devices in testpmd. The related commands are as follows:
set bonding tx_prepare <port_id> [enable|disable]
When this option is enabled, bonding driver would call rte_eth_dev_prepare
to do some adjustment to the packets in the fast path to meet the device's
requirement to turn on some HW offload(e.g. processing pseudo headers when
Tx checksum offload enabled). This help bonded device to use more Tx
offloads.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
---
 app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
 2 files changed, 75 insertions(+)
diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index f44116b..2d1b3b6 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -647,6 +647,9 @@ static void cmd_help_long_parsed(void *parsed_result,
 			"set bonding lacp dedicated_queues <port_id> (enable|disable)\n"
 			"	Enable/disable dedicated queues for LACP control traffic.\n\n"
+			"set bonding tx_prepare <port_id> (enable|disable)\n"
+			"	Enable/disable tx_prepare for fast path traffic.\n\n"
+
 #endif
 			"set link-up port (port_id)\n"
 			"	Set link up for a port.\n\n"
@@ -5886,6 +5889,68 @@ cmdline_parse_inst_t cmd_set_lacp_dedicated_queues = {
 		}
 };
+/* *** SET BONDING TX_PREPARE *** */
+struct cmd_set_bonding_tx_prepare_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t bonding;
+	cmdline_fixed_string_t tx_prepare;
+	portid_t port_id;
+	cmdline_fixed_string_t mode;
+};
+
+static void cmd_set_bonding_tx_prepare_parsed(void *parsed_result,
+		__rte_unused  struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_set_bonding_tx_prepare_result *res = parsed_result;
+	portid_t port_id = res->port_id;
+
+	if (!strcmp(res->mode, "enable")) {
+		if (rte_eth_bond_tx_prepare_enable(port_id) == 0)
+			printf("Tx prepare for bonding device enabled\n");
+		else
+			printf("Enabling bonding device Tx prepare "
+					"on port %d failed\n", port_id);
+	} else if (!strcmp(res->mode, "disable")) {
+		if (rte_eth_bond_tx_prepare_disable(port_id) == 0)
+			printf("Tx prepare for bonding device disabled\n");
+		else
+			printf("Disabling bonding device Tx prepare "
+					"on port %d failed\n", port_id);
+	}
+}
+
+cmdline_parse_token_string_t cmd_setbonding_tx_prepare_set =
+TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+		set, "set");
+cmdline_parse_token_string_t cmd_setbonding_tx_prepare_bonding =
+TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+		bonding, "bonding");
+cmdline_parse_token_string_t cmd_setbonding_tx_prepare_tx_prepare =
+TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+		tx_prepare, "tx_prepare");
+cmdline_parse_token_num_t cmd_setbonding_tx_prepare_port_id =
+TOKEN_NUM_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+		port_id, RTE_UINT16);
+cmdline_parse_token_string_t cmd_setbonding_tx_prepare_mode =
+TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+		mode, "enable#disable");
+
+cmdline_parse_inst_t cmd_set_bond_tx_prepare = {
+		.f = cmd_set_bonding_tx_prepare_parsed,
+		.help_str = "set bonding tx_prepare <port_id> enable|disable: "
+			"Enable/disable tx_prepare for port_id",
+		.data = NULL,
+		.tokens = {
+			(void *)&cmd_setbonding_tx_prepare_set,
+			(void *)&cmd_setbonding_tx_prepare_bonding,
+			(void *)&cmd_setbonding_tx_prepare_tx_prepare,
+			(void *)&cmd_setbonding_tx_prepare_port_id,
+			(void *)&cmd_setbonding_tx_prepare_mode,
+			NULL
+		}
+};
+
 /* *** SET BALANCE XMIT POLICY *** */
 struct cmd_set_bonding_balance_xmit_policy_result {
 	cmdline_fixed_string_t set;
@@ -16966,6 +17031,7 @@ cmdline_parse_ctx_t main_ctx[] = {
 	(cmdline_parse_inst_t *) &cmd_set_balance_xmit_policy,
 	(cmdline_parse_inst_t *) &cmd_set_bond_mon_period,
 	(cmdline_parse_inst_t *) &cmd_set_lacp_dedicated_queues,
+	(cmdline_parse_inst_t *) &cmd_set_bond_tx_prepare,
 	(cmdline_parse_inst_t *) &cmd_set_bonding_agg_mode_policy,
 #endif
 	(cmdline_parse_inst_t *)&cmd_vlan_offload,
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index 36f0a32..bdbf1ea 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -2590,6 +2590,15 @@ when in mode 4 (link-aggregation-802.3ad)::
    testpmd> set bonding lacp dedicated_queues (port_id) (enable|disable)
+set bonding tx_prepare
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Enable Tx prepare on bonding devices to help the slave devices prepare the
+packets for some HW offloading (e.g. checksum and TSO)::
+
+   testpmd> set bonding tx_prepare (port_id) (enable|disable)
+
+
 set bonding agg_mode
 ~~~~~~~~~~~~~~~~~~~~
--
2.7.4
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
  2021-04-16 11:04 [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Chengchang Tang
  2021-04-16 11:04 ` [dpdk-dev] [RFC 1/2] net/bonding: add Tx prepare for bonding Chengchang Tang
  2021-04-16 11:04 ` [dpdk-dev] [RFC 2/2] app/testpmd: add cmd for bonding Tx prepare Chengchang Tang
@ 2021-04-16 11:12 ` Min Hu (Connor)
  2021-04-20  1:26 ` Ferruh Yigit
  2021-04-23  9:46 ` [dpdk-dev] [PATCH " Chengchang Tang
  4 siblings, 0 replies; 61+ messages in thread
From: Min Hu (Connor) @ 2021-04-16 11:12 UTC (permalink / raw)
  To: Chengchang Tang, dev; +Cc: linuxarm, chas3, ferruh.yigit
Looks good to me.
在 2021/4/16 19:04, Chengchang Tang 写道:
> This patch add Tx prepare for bonding device.
> 
> Currently, the bonding driver has not implemented the callback of
> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
> slave devices will never be invoked. When hardware offloading such as
> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
> to adjust packets (for example, set correct pseudo packet headers).
> Otherwise, related offloading fails and even packets are sent
> incorrectly. Due to this limitation, the bonded device cannot use these
> HW offloading in the Tx direction.
> 
> Because packet sending algorithms are numerous and complex in bond PMD,
> it is hard to design the callback for rte_eth_tx_prepare. In this patch,
> the tx_prepare callback of bonding PMD is not implemented. Instead,
> rte_eth_tx_prepare has been called in tx_burst callback. And a global
> variable is introduced to control whether the bonded device need call
> the rte_eth_tx_prepare. If upper-layer users need to use some TX
> offloading that depend on tx_prepare , they should enable the preparation
> function. In this way, the bonded device will call the rte_eth_tx_prepare
> for the fast path packets in the tx_burst callback.
> 
> Chengchang Tang (2):
>    net/bonding: add Tx prepare for bonding
>    app/testpmd: add cmd for bonding Tx prepare
> 
>   app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
>   doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
>   drivers/net/bonding/eth_bond_private.h      |  1 +
>   drivers/net/bonding/rte_eth_bond.h          | 29 +++++++++++++
>   drivers/net/bonding/rte_eth_bond_api.c      | 28 ++++++++++++
>   drivers/net/bonding/rte_eth_bond_pmd.c      | 33 +++++++++++++--
>   drivers/net/bonding/version.map             |  5 +++
>   7 files changed, 167 insertions(+), 4 deletions(-)
> 
> --
> 2.7.4
> 
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
  2021-04-16 11:04 [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Chengchang Tang
                   ` (2 preceding siblings ...)
  2021-04-16 11:12 ` [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Min Hu (Connor)
@ 2021-04-20  1:26 ` Ferruh Yigit
  2021-04-20  2:44   ` Chengchang Tang
  2021-04-23  9:46 ` [dpdk-dev] [PATCH " Chengchang Tang
  4 siblings, 1 reply; 61+ messages in thread
From: Ferruh Yigit @ 2021-04-20  1:26 UTC (permalink / raw)
  To: Chengchang Tang, dev; +Cc: linuxarm, chas3, humin29
On 4/16/2021 12:04 PM, Chengchang Tang wrote:
> This patch add Tx prepare for bonding device.
> 
> Currently, the bonding driver has not implemented the callback of
> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
> slave devices will never be invoked. When hardware offloading such as
> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
> to adjust packets (for example, set correct pseudo packet headers).
> Otherwise, related offloading fails and even packets are sent
> incorrectly. Due to this limitation, the bonded device cannot use these
> HW offloading in the Tx direction.
> 
> Because packet sending algorithms are numerous and complex in bond PMD,
> it is hard to design the callback for rte_eth_tx_prepare. In this patch,
> the tx_prepare callback of bonding PMD is not implemented. Instead,
> rte_eth_tx_prepare has been called in tx_burst callback. And a global
> variable is introduced to control whether the bonded device need call
> the rte_eth_tx_prepare. If upper-layer users need to use some TX
> offloading that depend on tx_prepare , they should enable the preparation
> function. In this way, the bonded device will call the rte_eth_tx_prepare
> for the fast path packets in the tx_burst callback.
> 
What do you think to add a devarg to bonding PMD to control the tx_prepare?
It won't be as dynamic as API, since it can be possible to change the behavior 
after application is started with API, but do we really need this?
> Chengchang Tang (2):
>    net/bonding: add Tx prepare for bonding
>    app/testpmd: add cmd for bonding Tx prepare
> 
>   app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
>   doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
>   drivers/net/bonding/eth_bond_private.h      |  1 +
>   drivers/net/bonding/rte_eth_bond.h          | 29 +++++++++++++
>   drivers/net/bonding/rte_eth_bond_api.c      | 28 ++++++++++++
>   drivers/net/bonding/rte_eth_bond_pmd.c      | 33 +++++++++++++--
>   drivers/net/bonding/version.map             |  5 +++
>   7 files changed, 167 insertions(+), 4 deletions(-)
> 
> --
> 2.7.4
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
  2021-04-20  1:26 ` Ferruh Yigit
@ 2021-04-20  2:44   ` Chengchang Tang
  2021-04-20  8:33     ` Ananyev, Konstantin
  0 siblings, 1 reply; 61+ messages in thread
From: Chengchang Tang @ 2021-04-20  2:44 UTC (permalink / raw)
  To: Ferruh Yigit, dev; +Cc: linuxarm, chas3, humin29
On 2021/4/20 9:26, Ferruh Yigit wrote:
> On 4/16/2021 12:04 PM, Chengchang Tang wrote:
>> This patch add Tx prepare for bonding device.
>>
>> Currently, the bonding driver has not implemented the callback of
>> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
>> slave devices will never be invoked. When hardware offloading such as
>> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
>> to adjust packets (for example, set correct pseudo packet headers).
>> Otherwise, related offloading fails and even packets are sent
>> incorrectly. Due to this limitation, the bonded device cannot use these
>> HW offloading in the Tx direction.
>>
>> Because packet sending algorithms are numerous and complex in bond PMD,
>> it is hard to design the callback for rte_eth_tx_prepare. In this patch,
>> the tx_prepare callback of bonding PMD is not implemented. Instead,
>> rte_eth_tx_prepare has been called in tx_burst callback. And a global
>> variable is introduced to control whether the bonded device need call
>> the rte_eth_tx_prepare. If upper-layer users need to use some TX
>> offloading that depend on tx_prepare , they should enable the preparation
>> function. In this way, the bonded device will call the rte_eth_tx_prepare
>> for the fast path packets in the tx_burst callback.
>>
> 
> What do you think to add a devarg to bonding PMD to control the tx_prepare?
> It won't be as dynamic as API, since it can be possible to change the behavior after application is started with API, but do we really need this?
If an API is not added, unnecessary constraints may be introduced. If the
bonding device is created through the rte_eth_bond_create interface instead
devarg "vdev", this function cannot be used because devargs does not take effect
in this case. But from an ease-of-use perspective, adding a devarg is a good
idea. I will add related implementations in the later official patches.
If I understand correctly, the current community does not want to introduce
more private APIs for PMDs. However, the absence of an API on this issue would
introduce some unnecessary constraints, and from that point of view, I think
adding an API seems necessary.
> 
>> Chengchang Tang (2):
>>    net/bonding: add Tx prepare for bonding
>>    app/testpmd: add cmd for bonding Tx prepare
>>
>>   app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
>>   doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
>>   drivers/net/bonding/eth_bond_private.h      |  1 +
>>   drivers/net/bonding/rte_eth_bond.h          | 29 +++++++++++++
>>   drivers/net/bonding/rte_eth_bond_api.c      | 28 ++++++++++++
>>   drivers/net/bonding/rte_eth_bond_pmd.c      | 33 +++++++++++++--
>>   drivers/net/bonding/version.map             |  5 +++
>>   7 files changed, 167 insertions(+), 4 deletions(-)
>>
>> -- 
>> 2.7.4
>>
> 
> 
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
  2021-04-20  2:44   ` Chengchang Tang
@ 2021-04-20  8:33     ` Ananyev, Konstantin
  2021-04-20 12:44       ` Chengchang Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Ananyev, Konstantin @ 2021-04-20  8:33 UTC (permalink / raw)
  To: Chengchang Tang, Yigit, Ferruh, dev; +Cc: linuxarm, chas3, humin29
Hi everyone,
> 
> On 2021/4/20 9:26, Ferruh Yigit wrote:
> > On 4/16/2021 12:04 PM, Chengchang Tang wrote:
> >> This patch add Tx prepare for bonding device.
> >>
> >> Currently, the bonding driver has not implemented the callback of
> >> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
> >> slave devices will never be invoked. When hardware offloading such as
> >> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
> >> to adjust packets (for example, set correct pseudo packet headers).
> >> Otherwise, related offloading fails and even packets are sent
> >> incorrectly. Due to this limitation, the bonded device cannot use these
> >> HW offloading in the Tx direction.
> >>
> >> Because packet sending algorithms are numerous and complex in bond PMD,
> >> it is hard to design the callback for rte_eth_tx_prepare. In this patch,
> >> the tx_prepare callback of bonding PMD is not implemented. Instead,
> >> rte_eth_tx_prepare has been called in tx_burst callback. And a global
> >> variable is introduced to control whether the bonded device need call
> >> the rte_eth_tx_prepare. If upper-layer users need to use some TX
> >> offloading that depend on tx_prepare , they should enable the preparation
> >> function. In this way, the bonded device will call the rte_eth_tx_prepare
> >> for the fast path packets in the tx_burst callback.
I admit that I didn't look at the implementation yet, but it sounds like 
overcomplication to me. Can't we just have a new TX function for bonding PMD
when TX offloads are enabled? And inside that function we will do:
tx_prepare(); tx_burst(); for selected device.
We can select this function at setup stage analysing requested by user TX offloads. 
> >>
> >
> > What do you think to add a devarg to bonding PMD to control the tx_prepare?
> > It won't be as dynamic as API, since it can be possible to change the behavior after application is started with API, but do we really need
> this?
> 
> If an API is not added, unnecessary constraints may be introduced. If the
> bonding device is created through the rte_eth_bond_create interface instead
> devarg "vdev", this function cannot be used because devargs does not take effect
> in this case. But from an ease-of-use perspective, adding a devarg is a good
> idea. I will add related implementations in the later official patches.
I am also against introducing new devarg to control tx_prepare() invocation.
I think at dev_config/queue_setup phase PMD will have enough information to decide.
> 
> If I understand correctly, the current community does not want to introduce
> more private APIs for PMDs. However, the absence of an API on this issue would
> introduce some unnecessary constraints, and from that point of view, I think
> adding an API seems necessary.
> >
> >> Chengchang Tang (2):
> >>    net/bonding: add Tx prepare for bonding
> >>    app/testpmd: add cmd for bonding Tx prepare
> >>
> >>   app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
> >>   doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
> >>   drivers/net/bonding/eth_bond_private.h      |  1 +
> >>   drivers/net/bonding/rte_eth_bond.h          | 29 +++++++++++++
> >>   drivers/net/bonding/rte_eth_bond_api.c      | 28 ++++++++++++
> >>   drivers/net/bonding/rte_eth_bond_pmd.c      | 33 +++++++++++++--
> >>   drivers/net/bonding/version.map             |  5 +++
> >>   7 files changed, 167 insertions(+), 4 deletions(-)
> >>
> >> --
> >> 2.7.4
> >>
> >
> >
> > .
> >
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
  2021-04-20  8:33     ` Ananyev, Konstantin
@ 2021-04-20 12:44       ` Chengchang Tang
  2021-04-20 13:18         ` Ananyev, Konstantin
  0 siblings, 1 reply; 61+ messages in thread
From: Chengchang Tang @ 2021-04-20 12:44 UTC (permalink / raw)
  To: Ananyev, Konstantin, Yigit, Ferruh, dev; +Cc: linuxarm, chas3, humin29
Hi
On 2021/4/20 16:33, Ananyev, Konstantin wrote:
> Hi everyone,
> 
>>
>> On 2021/4/20 9:26, Ferruh Yigit wrote:
>>> On 4/16/2021 12:04 PM, Chengchang Tang wrote:
>>>> This patch add Tx prepare for bonding device.
>>>>
>>>> Currently, the bonding driver has not implemented the callback of
>>>> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
>>>> slave devices will never be invoked. When hardware offloading such as
>>>> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
>>>> to adjust packets (for example, set correct pseudo packet headers).
>>>> Otherwise, related offloading fails and even packets are sent
>>>> incorrectly. Due to this limitation, the bonded device cannot use these
>>>> HW offloading in the Tx direction.
>>>>
>>>> Because packet sending algorithms are numerous and complex in bond PMD,
>>>> it is hard to design the callback for rte_eth_tx_prepare. In this patch,
>>>> the tx_prepare callback of bonding PMD is not implemented. Instead,
>>>> rte_eth_tx_prepare has been called in tx_burst callback. And a global
>>>> variable is introduced to control whether the bonded device need call
>>>> the rte_eth_tx_prepare. If upper-layer users need to use some TX
>>>> offloading that depend on tx_prepare , they should enable the preparation
>>>> function. In this way, the bonded device will call the rte_eth_tx_prepare
>>>> for the fast path packets in the tx_burst callback.
> 
> I admit that I didn't look at the implementation yet, but it sounds like 
> overcomplication to me. Can't we just have a new TX function for bonding PMD
> when TX offloads are enabled? And inside that function we will do:
> tx_prepare(); tx_burst(); for selected device.
The solution you mentioned is workable and may perform better. However, the current
solution is also simple and has a limited impact on performance. It is actually:
if (tx_prepare_enable)
	tx_prepare();
tx_burst();
Overall, it adds almost only one judgment to the case where the related Tx offloads
is not turned on.
> We can select this function at setup stage analysing requested by user TX offloads. 
> 
In PMDs, it is a common practice to select different Tx/Rx function during the setup
phase. But for a 'vdev' device like Bonding, we may need to think more about it.
The reasons are explained below.
> 
>>>>
>>>
>>> What do you think to add a devarg to bonding PMD to control the tx_prepare?
>>> It won't be as dynamic as API, since it can be possible to change the behavior after application is started with API, but do we really need
>> this?
>>
>> If an API is not added, unnecessary constraints may be introduced. If the
>> bonding device is created through the rte_eth_bond_create interface instead
>> devarg "vdev", this function cannot be used because devargs does not take effect
>> in this case. But from an ease-of-use perspective, adding a devarg is a good
>> idea. I will add related implementations in the later official patches.
> 
> I am also against introducing new devarg to control tx_prepare() invocation.
> I think at dev_config/queue_setup phase PMD will have enough information to decide.
> 
Currently, the community does not specify which Tx offloads need to invoke tx_prepare.
For Vdev devices such as bond, all NIC devices need to be considered. Generally,
tx_prepare is used in CKSUM and TSO. It is possible that for some NIC devices, even
CKSUM and TSO do not need to invoke tx_prepare, or for some NIC devices, there are
other Tx offloads that need to call tx_prepare. From this perspective, leaving the
choice to the user seems to be a better choice.
>>
>> If I understand correctly, the current community does not want to introduce
>> more private APIs for PMDs. However, the absence of an API on this issue would
>> introduce some unnecessary constraints, and from that point of view, I think
>> adding an API seems necessary.
>>>
>>>> Chengchang Tang (2):
>>>>    net/bonding: add Tx prepare for bonding
>>>>    app/testpmd: add cmd for bonding Tx prepare
>>>>
>>>>   app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
>>>>   doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
>>>>   drivers/net/bonding/eth_bond_private.h      |  1 +
>>>>   drivers/net/bonding/rte_eth_bond.h          | 29 +++++++++++++
>>>>   drivers/net/bonding/rte_eth_bond_api.c      | 28 ++++++++++++
>>>>   drivers/net/bonding/rte_eth_bond_pmd.c      | 33 +++++++++++++--
>>>>   drivers/net/bonding/version.map             |  5 +++
>>>>   7 files changed, 167 insertions(+), 4 deletions(-)
>>>>
>>>> --
>>>> 2.7.4
>>>>
>>>
>>>
>>> .
>>>
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
  2021-04-20 12:44       ` Chengchang Tang
@ 2021-04-20 13:18         ` Ananyev, Konstantin
  2021-04-20 14:06           ` Chengchang Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Ananyev, Konstantin @ 2021-04-20 13:18 UTC (permalink / raw)
  To: Chengchang Tang, Yigit, Ferruh, dev; +Cc: linuxarm, chas3, humin29
> -----Original Message-----
> From: Chengchang Tang <tangchengchang@huawei.com>
> Sent: Tuesday, April 20, 2021 1:44 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; dev@dpdk.org
> Cc: linuxarm@huawei.com; chas3@att.com; humin29@huawei.com
> Subject: Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
> 
> Hi
> On 2021/4/20 16:33, Ananyev, Konstantin wrote:
> > Hi everyone,
> >
> >>
> >> On 2021/4/20 9:26, Ferruh Yigit wrote:
> >>> On 4/16/2021 12:04 PM, Chengchang Tang wrote:
> >>>> This patch add Tx prepare for bonding device.
> >>>>
> >>>> Currently, the bonding driver has not implemented the callback of
> >>>> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
> >>>> slave devices will never be invoked. When hardware offloading such as
> >>>> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
> >>>> to adjust packets (for example, set correct pseudo packet headers).
> >>>> Otherwise, related offloading fails and even packets are sent
> >>>> incorrectly. Due to this limitation, the bonded device cannot use these
> >>>> HW offloading in the Tx direction.
> >>>>
> >>>> Because packet sending algorithms are numerous and complex in bond PMD,
> >>>> it is hard to design the callback for rte_eth_tx_prepare. In this patch,
> >>>> the tx_prepare callback of bonding PMD is not implemented. Instead,
> >>>> rte_eth_tx_prepare has been called in tx_burst callback. And a global
> >>>> variable is introduced to control whether the bonded device need call
> >>>> the rte_eth_tx_prepare. If upper-layer users need to use some TX
> >>>> offloading that depend on tx_prepare , they should enable the preparation
> >>>> function. In this way, the bonded device will call the rte_eth_tx_prepare
> >>>> for the fast path packets in the tx_burst callback.
> >
> > I admit that I didn't look at the implementation yet, but it sounds like
> > overcomplication to me. Can't we just have a new TX function for bonding PMD
> > when TX offloads are enabled? And inside that function we will do:
> > tx_prepare(); tx_burst(); for selected device.
> 
> The solution you mentioned is workable and may perform better. However, the current
> solution is also simple and has a limited impact on performance. It is actually:
> if (tx_prepare_enable)
> 	tx_prepare();
> tx_burst();
> 
> Overall, it adds almost only one judgment to the case where the related Tx offloads
> is not turned on.
> 
> > We can select this function at setup stage analysing requested by user TX offloads.
> >
> 
> In PMDs, it is a common practice to select different Tx/Rx function during the setup
> phase. But for a 'vdev' device like Bonding, we may need to think more about it.
> The reasons are explained below.
> >
> >>>>
> >>>
> >>> What do you think to add a devarg to bonding PMD to control the tx_prepare?
> >>> It won't be as dynamic as API, since it can be possible to change the behavior after application is started with API, but do we really need
> >> this?
> >>
> >> If an API is not added, unnecessary constraints may be introduced. If the
> >> bonding device is created through the rte_eth_bond_create interface instead
> >> devarg "vdev", this function cannot be used because devargs does not take effect
> >> in this case. But from an ease-of-use perspective, adding a devarg is a good
> >> idea. I will add related implementations in the later official patches.
> >
> > I am also against introducing new devarg to control tx_prepare() invocation.
> > I think at dev_config/queue_setup phase PMD will have enough information to decide.
> >
> Currently, the community does not specify which Tx offloads need to invoke tx_prepare.
I think inside bond PMD we can safely assume that any TX offload does need tx_prepare().
If that's not the case then slave dev tx_prepare pointer will be NULL and rte_eth_tx_prepare()
will be just a NOOP. 
> For Vdev devices such as bond, all NIC devices need to be considered. Generally,
> tx_prepare is used in CKSUM and TSO. It is possible that for some NIC devices, even
> CKSUM and TSO do not need to invoke tx_prepare, or for some NIC devices, there are
> other Tx offloads that need to call tx_prepare. From this perspective, leaving the
> choice to the user seems to be a better choice.
Wonder how user will know when to enable/disable it?
As you said it depends on the underlying HW/PMD and can change from system to system?
I think it is PMD that needs to take this decision, and I think the safest bet might be to enable
it when any TX offloads was enabled by user.
> >>
> >> If I understand correctly, the current community does not want to introduce
> >> more private APIs for PMDs. However, the absence of an API on this issue would
> >> introduce some unnecessary constraints, and from that point of view, I think
> >> adding an API seems necessary.
> >>>
> >>>> Chengchang Tang (2):
> >>>>    net/bonding: add Tx prepare for bonding
> >>>>    app/testpmd: add cmd for bonding Tx prepare
> >>>>
> >>>>   app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
> >>>>   doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
> >>>>   drivers/net/bonding/eth_bond_private.h      |  1 +
> >>>>   drivers/net/bonding/rte_eth_bond.h          | 29 +++++++++++++
> >>>>   drivers/net/bonding/rte_eth_bond_api.c      | 28 ++++++++++++
> >>>>   drivers/net/bonding/rte_eth_bond_pmd.c      | 33 +++++++++++++--
> >>>>   drivers/net/bonding/version.map             |  5 +++
> >>>>   7 files changed, 167 insertions(+), 4 deletions(-)
> >>>>
> >>>> --
> >>>> 2.7.4
> >>>>
> >>>
> >>>
> >>> .
> >>>
> >
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
  2021-04-20 13:18         ` Ananyev, Konstantin
@ 2021-04-20 14:06           ` Chengchang Tang
  0 siblings, 0 replies; 61+ messages in thread
From: Chengchang Tang @ 2021-04-20 14:06 UTC (permalink / raw)
  To: Ananyev, Konstantin, Yigit, Ferruh, dev; +Cc: linuxarm, chas3, humin29
On 2021/4/20 21:18, Ananyev, Konstantin wrote:
> 
> 
>> -----Original Message-----
>> From: Chengchang Tang <tangchengchang@huawei.com>
>> Sent: Tuesday, April 20, 2021 1:44 PM
>> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Yigit, Ferruh <ferruh.yigit@intel.com>; dev@dpdk.org
>> Cc: linuxarm@huawei.com; chas3@att.com; humin29@huawei.com
>> Subject: Re: [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device
>>
>> Hi
>> On 2021/4/20 16:33, Ananyev, Konstantin wrote:
>>> Hi everyone,
>>>
>>>>
>>>> On 2021/4/20 9:26, Ferruh Yigit wrote:
>>>>> On 4/16/2021 12:04 PM, Chengchang Tang wrote:
>>>>>> This patch add Tx prepare for bonding device.
>>>>>>
>>>>>> Currently, the bonding driver has not implemented the callback of
>>>>>> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
>>>>>> slave devices will never be invoked. When hardware offloading such as
>>>>>> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
>>>>>> to adjust packets (for example, set correct pseudo packet headers).
>>>>>> Otherwise, related offloading fails and even packets are sent
>>>>>> incorrectly. Due to this limitation, the bonded device cannot use these
>>>>>> HW offloading in the Tx direction.
>>>>>>
>>>>>> Because packet sending algorithms are numerous and complex in bond PMD,
>>>>>> it is hard to design the callback for rte_eth_tx_prepare. In this patch,
>>>>>> the tx_prepare callback of bonding PMD is not implemented. Instead,
>>>>>> rte_eth_tx_prepare has been called in tx_burst callback. And a global
>>>>>> variable is introduced to control whether the bonded device need call
>>>>>> the rte_eth_tx_prepare. If upper-layer users need to use some TX
>>>>>> offloading that depend on tx_prepare , they should enable the preparation
>>>>>> function. In this way, the bonded device will call the rte_eth_tx_prepare
>>>>>> for the fast path packets in the tx_burst callback.
>>>
>>> I admit that I didn't look at the implementation yet, but it sounds like
>>> overcomplication to me. Can't we just have a new TX function for bonding PMD
>>> when TX offloads are enabled? And inside that function we will do:
>>> tx_prepare(); tx_burst(); for selected device.
>>
>> The solution you mentioned is workable and may perform better. However, the current
>> solution is also simple and has a limited impact on performance. It is actually:
>> if (tx_prepare_enable)
>> 	tx_prepare();
>> tx_burst();
>>
>> Overall, it adds almost only one judgment to the case where the related Tx offloads
>> is not turned on.
>>
>>> We can select this function at setup stage analysing requested by user TX offloads.
>>>
>>
>> In PMDs, it is a common practice to select different Tx/Rx function during the setup
>> phase. But for a 'vdev' device like Bonding, we may need to think more about it.
>> The reasons are explained below.
>>>
>>>>>>
>>>>>
>>>>> What do you think to add a devarg to bonding PMD to control the tx_prepare?
>>>>> It won't be as dynamic as API, since it can be possible to change the behavior after application is started with API, but do we really need
>>>> this?
>>>>
>>>> If an API is not added, unnecessary constraints may be introduced. If the
>>>> bonding device is created through the rte_eth_bond_create interface instead
>>>> devarg "vdev", this function cannot be used because devargs does not take effect
>>>> in this case. But from an ease-of-use perspective, adding a devarg is a good
>>>> idea. I will add related implementations in the later official patches.
>>>
>>> I am also against introducing new devarg to control tx_prepare() invocation.
>>> I think at dev_config/queue_setup phase PMD will have enough information to decide.
>>>
>> Currently, the community does not specify which Tx offloads need to invoke tx_prepare.
> 
> I think inside bond PMD we can safely assume that any TX offload does need tx_prepare().
> If that's not the case then slave dev tx_prepare pointer will be NULL and rte_eth_tx_prepare()
> will be just a NOOP. 
Get it. I agree that these decisions should be offloaded directly into PMDs.
In the formal patch, the API that used to control enable states will be deleted.
> 
>> For Vdev devices such as bond, all NIC devices need to be considered. Generally,
>> tx_prepare is used in CKSUM and TSO. It is possible that for some NIC devices, even
>> CKSUM and TSO do not need to invoke tx_prepare, or for some NIC devices, there are
>> other Tx offloads that need to call tx_prepare. From this perspective, leaving the
>> choice to the user seems to be a better choice.
> 
> Wonder how user will know when to enable/disable it?
> As you said it depends on the underlying HW/PMD and can change from system to system?
Generally, decisions need to be made based on debugging results, which is not good.
> I think it is PMD that needs to take this decision, and I think the safest bet might be to enable
> it when any TX offloads was enabled by user.
> 
I agree that these decisions should be made by the PMDs. Even, I think the tx_prepare()
should always be called in bonding, its impact on performance should be directly controlled
by the PMDs.
>>>>
>>>> If I understand correctly, the current community does not want to introduce
>>>> more private APIs for PMDs. However, the absence of an API on this issue would
>>>> introduce some unnecessary constraints, and from that point of view, I think
>>>> adding an API seems necessary.
>>>>>
>>>>>> Chengchang Tang (2):
>>>>>>    net/bonding: add Tx prepare for bonding
>>>>>>    app/testpmd: add cmd for bonding Tx prepare
>>>>>>
>>>>>>   app/test-pmd/cmdline.c                      | 66 +++++++++++++++++++++++++++++
>>>>>>   doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 ++++
>>>>>>   drivers/net/bonding/eth_bond_private.h      |  1 +
>>>>>>   drivers/net/bonding/rte_eth_bond.h          | 29 +++++++++++++
>>>>>>   drivers/net/bonding/rte_eth_bond_api.c      | 28 ++++++++++++
>>>>>>   drivers/net/bonding/rte_eth_bond_pmd.c      | 33 +++++++++++++--
>>>>>>   drivers/net/bonding/version.map             |  5 +++
>>>>>>   7 files changed, 167 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.7.4
>>>>>>
>>>>>
>>>>>
>>>>> .
>>>>>
>>>
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [dpdk-dev] [PATCH 0/2] add Tx prepare support for bonding device
  2021-04-16 11:04 [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Chengchang Tang
                   ` (3 preceding siblings ...)
  2021-04-20  1:26 ` Ferruh Yigit
@ 2021-04-23  9:46 ` Chengchang Tang
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
                     ` (3 more replies)
  4 siblings, 4 replies; 61+ messages in thread
From: Chengchang Tang @ 2021-04-23  9:46 UTC (permalink / raw)
  To: dev; +Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
This patch set add Tx prepare for bonding device.
Currently, the bonding driver has not implemented the callback of
rte_eth_tx_prepare function. Therefore, the TX prepare function of the
slave devices will never be invoked. When hardware offloading such as
CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
to adjust packets (for example, set correct pseudo packet headers).
Otherwise, related offloading fails and even packets are sent
incorrectly. Due to this limitation, the bonded device cannot use these
HW offloading in the Tx direction.
Because packet sending algorithms are numerous and complex in bond PMD,
it is hard to design the callback for rte_eth_tx_prepare. In this
patchset, the tx_prepare callback of bonding PMD is not implemented.
Instead, rte_eth_tx_prepare has been called in tx_burst callback. In
this way, all tx_offloads can be processed correctly for all NIC devices.
It is the responsibility of the slave PMDs to decide when the real
tx_prepare needs to be used. If tx_prepare is not required in some cases,
then slave PMDs tx_prepare pointer should be NULL and rte_eth_tx_prepare()
will be just a NOOP. That is, the effectiveness and security of tx_prepare
and its impact on performance depend on the design of slave PMDs.
And configuring Tx offloading for bonding is also added in this patchset.
This solves the problem that we need to configure slave devices one by one
when configuring Tx offloading.
Chengchang Tang (2):
  net/bonding: support Tx prepare for bonding
  net/bonding: support configuring Tx offloading for bonding
 drivers/net/bonding/rte_eth_bond.h     |  1 -
 drivers/net/bonding/rte_eth_bond_pmd.c | 41 ++++++++++++++++++++++++++++++----
 2 files changed, 37 insertions(+), 5 deletions(-)
--
2.7.4
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-04-23  9:46 ` [dpdk-dev] [PATCH " Chengchang Tang
@ 2021-04-23  9:46   ` Chengchang Tang
  2021-06-08  9:49     ` Andrew Rybchenko
                       ` (4 more replies)
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding Chengchang Tang
                     ` (2 subsequent siblings)
  3 siblings, 5 replies; 61+ messages in thread
From: Chengchang Tang @ 2021-04-23  9:46 UTC (permalink / raw)
  To: dev; +Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
To use the HW offloads capability (e.g. checksum and TSO) in the Tx
direction, the upper-layer users need to call rte_eth_dev_prepare to do
some adjustment to the packets before sending them (e.g. processing
pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
callback of the bond driver is not implemented. Therefore, related
offloads can not be used unless the upper layer users process the packet
properly in their own application. But it is bad for the
transplantability.
However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonding device to determine which slave device's prepare function should
be invoked. In addition, if the link status changes after the packets are
prepared, the packets may fail to be sent because packets allocation may
change.
So, in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
tx_offloads can be processed correctly for all NIC devices in these modes.
If tx_prepare is not required in some cases, then slave PMDs tx_prepare
pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
In these cases, the impact on performance will be very limited. It is
the responsibility of the slave PMDs to decide when the real tx_prepare
needs to be used. The information from dev_config/queue_setup is
sufficient for them to make these decisions.
Note:
The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
because in broadcast mode, a packet needs to be sent by all slave ports.
Different PMDs process the packets differently in tx_prepare. As a result,
the sent packet may be incorrect.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
---
 drivers/net/bonding/rte_eth_bond.h     |  1 -
 drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
 2 files changed, 24 insertions(+), 5 deletions(-)
diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
index 874aa91..1e6cc6d 100644
--- a/drivers/net/bonding/rte_eth_bond.h
+++ b/drivers/net/bonding/rte_eth_bond.h
@@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
 int
 rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
-
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index 2e9cea5..84af348 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < num_of_slaves; i++) {
 		if (slave_nb_pkts[i] > 0) {
+			int nb_prep_pkts;
+
+			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
+					bd_tx_q->queue_id, slave_bufs[i],
+					slave_nb_pkts[i]);
+
 			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-					slave_bufs[i], slave_nb_pkts[i]);
+					slave_bufs[i], nb_prep_pkts);
 			/* if tx burst fails move packets to end of bufs */
 			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
@@ -632,6 +638,7 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 {
 	struct bond_dev_private *internals;
 	struct bond_tx_queue *bd_tx_q;
+	int nb_prep_pkts;
 	bd_tx_q = (struct bond_tx_queue *)queue;
 	internals = bd_tx_q->dev_private;
@@ -639,8 +646,11 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 	if (internals->active_slave_count < 1)
 		return 0;
+	nb_prep_pkts = rte_eth_tx_prepare(internals->current_primary_port,
+				bd_tx_q->queue_id, bufs, nb_pkts);
+
 	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+			bufs, nb_prep_pkts);
 }
 static inline uint16_t
@@ -939,6 +949,8 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	}
 	for (i = 0; i < num_of_slaves; i++) {
+		int nb_prep_pkts;
+
 		rte_eth_macaddr_get(slaves[i], &active_slave_addr);
 		for (j = num_tx_total; j < nb_pkts; j++) {
 			if (j + 3 < nb_pkts)
@@ -955,9 +967,12 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 		}
-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+		nb_prep_pkts = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
 				bufs + num_tx_total, nb_pkts - num_tx_total);
+		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+				bufs + num_tx_total, nb_prep_pkts);
+
 		if (num_tx_total == nb_pkts)
 			break;
 	}
@@ -1159,12 +1174,17 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < slave_count; i++) {
+		int nb_prep_pkts;
+
 		if (slave_nb_bufs[i] == 0)
 			continue;
+		nb_prep_pkts = rte_eth_tx_prepare(slave_port_ids[i],
+				bd_tx_q->queue_id, slave_bufs[i],
+				slave_nb_bufs[i]);
 		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
 				bd_tx_q->queue_id, slave_bufs[i],
-				slave_nb_bufs[i]);
+				nb_prep_pkts);
 		total_tx_count += slave_tx_count;
--
2.7.4
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding
  2021-04-23  9:46 ` [dpdk-dev] [PATCH " Chengchang Tang
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
@ 2021-04-23  9:46   ` Chengchang Tang
  2021-06-08  9:49     ` Andrew Rybchenko
  2021-04-30  6:26   ` [dpdk-dev] [PATCH 0/2] add Tx prepare support for bonding device Chengchang Tang
  2021-06-03  1:44   ` Chengchang Tang
  3 siblings, 1 reply; 61+ messages in thread
From: Chengchang Tang @ 2021-04-23  9:46 UTC (permalink / raw)
  To: dev; +Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
Currently, the TX offloading of the bonding device will not take effect by
using dev_configure. Because the related configuration will not be
delivered to the slave devices in this way.
The Tx offloading capability of the bonding device is the intersection of
the capability of all slave devices. Based on this, the following functions
are added to the bonding driver:
1. If a Tx offloading is within the capability of the bonding device (i.e.
all the slave devices support this Tx offloading), the enabling status of
the offloading of all slave devices depends on the configuration of the
bonding device.
2. For the Tx offloading that is not within the Tx offloading capability
of the bonding device, the enabling status of the offloading on the slave
devices is irrelevant to the bonding device configuration. And it depends
on the original configuration of the slave devices.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
---
 drivers/net/bonding/rte_eth_bond_pmd.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index 84af348..9922657 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -1712,6 +1712,8 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
 	struct rte_flow_error flow_error;
 	struct bond_dev_private *internals = bonded_eth_dev->data->dev_private;
+	uint64_t tx_offload_cap = internals->tx_offload_capa;
+	uint64_t tx_offload;
 	/* Stop slave */
 	errval = rte_eth_dev_stop(slave_eth_dev->data->port_id);
@@ -1759,6 +1761,17 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
 		slave_eth_dev->data->dev_conf.rxmode.offloads &=
 				~DEV_RX_OFFLOAD_JUMBO_FRAME;
+	while (tx_offload_cap != 0) {
+		tx_offload = 1ULL << __builtin_ctzll(tx_offload_cap);
+		if (bonded_eth_dev->data->dev_conf.txmode.offloads & tx_offload)
+			slave_eth_dev->data->dev_conf.txmode.offloads |=
+				tx_offload;
+		else
+			slave_eth_dev->data->dev_conf.txmode.offloads &=
+				~tx_offload;
+		tx_offload_cap &= ~tx_offload;
+	}
+
 	nb_rx_queues = bonded_eth_dev->data->nb_rx_queues;
 	nb_tx_queues = bonded_eth_dev->data->nb_tx_queues;
--
2.7.4
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] add Tx prepare support for bonding device
  2021-04-23  9:46 ` [dpdk-dev] [PATCH " Chengchang Tang
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding Chengchang Tang
@ 2021-04-30  6:26   ` Chengchang Tang
  2021-04-30  6:47     ` Min Hu (Connor)
  2021-06-03  1:44   ` Chengchang Tang
  3 siblings, 1 reply; 61+ messages in thread
From: Chengchang Tang @ 2021-04-30  6:26 UTC (permalink / raw)
  To: dev; +Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
Hi,all
Any comments?
On 2021/4/23 17:46, Chengchang Tang wrote:
> This patch set add Tx prepare for bonding device.
> 
> Currently, the bonding driver has not implemented the callback of
> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
> slave devices will never be invoked. When hardware offloading such as
> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
> to adjust packets (for example, set correct pseudo packet headers).
> Otherwise, related offloading fails and even packets are sent
> incorrectly. Due to this limitation, the bonded device cannot use these
> HW offloading in the Tx direction.
> 
> Because packet sending algorithms are numerous and complex in bond PMD,
> it is hard to design the callback for rte_eth_tx_prepare. In this
> patchset, the tx_prepare callback of bonding PMD is not implemented.
> Instead, rte_eth_tx_prepare has been called in tx_burst callback. In
> this way, all tx_offloads can be processed correctly for all NIC devices.
> It is the responsibility of the slave PMDs to decide when the real
> tx_prepare needs to be used. If tx_prepare is not required in some cases,
> then slave PMDs tx_prepare pointer should be NULL and rte_eth_tx_prepare()
> will be just a NOOP. That is, the effectiveness and security of tx_prepare
> and its impact on performance depend on the design of slave PMDs.
> 
> And configuring Tx offloading for bonding is also added in this patchset.
> This solves the problem that we need to configure slave devices one by one
> when configuring Tx offloading.
> 
> Chengchang Tang (2):
>   net/bonding: support Tx prepare for bonding
>   net/bonding: support configuring Tx offloading for bonding
> 
>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>  drivers/net/bonding/rte_eth_bond_pmd.c | 41 ++++++++++++++++++++++++++++++----
>  2 files changed, 37 insertions(+), 5 deletions(-)
> 
> --
> 2.7.4
> 
> 
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] add Tx prepare support for bonding device
  2021-04-30  6:26   ` [dpdk-dev] [PATCH 0/2] add Tx prepare support for bonding device Chengchang Tang
@ 2021-04-30  6:47     ` Min Hu (Connor)
  0 siblings, 0 replies; 61+ messages in thread
From: Min Hu (Connor) @ 2021-04-30  6:47 UTC (permalink / raw)
  To: Chengchang Tang, dev; +Cc: linuxarm, chas3, ferruh.yigit, konstantin.ananyev
在 2021/4/30 14:26, Chengchang Tang 写道:
> Hi,all
> Any comments?
> 
> On 2021/4/23 17:46, Chengchang Tang wrote:
>> This patch set add Tx prepare for bonding device.
>>
>> Currently, the bonding driver has not implemented the callback of
>> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
>> slave devices will never be invoked. When hardware offloading such as
>> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
>> to adjust packets (for example, set correct pseudo packet headers).
>> Otherwise, related offloading fails and even packets are sent
>> incorrectly. Due to this limitation, the bonded device cannot use these
>> HW offloading in the Tx direction.
>>
>> Because packet sending algorithms are numerous and complex in bond PMD,
>> it is hard to design the callback for rte_eth_tx_prepare. In this
>> patchset, the tx_prepare callback of bonding PMD is not implemented.
>> Instead, rte_eth_tx_prepare has been called in tx_burst callback. In
>> this way, all tx_offloads can be processed correctly for all NIC devices.
>> It is the responsibility of the slave PMDs to decide when the real
>> tx_prepare needs to be used. If tx_prepare is not required in some cases,
>> then slave PMDs tx_prepare pointer should be NULL and rte_eth_tx_prepare()
>> will be just a NOOP. That is, the effectiveness and security of tx_prepare
>> and its impact on performance depend on the design of slave PMDs.
>>
>> And configuring Tx offloading for bonding is also added in this patchset.
>> This solves the problem that we need to configure slave devices one by one
>> when configuring Tx offloading.
>>
>> Chengchang Tang (2):
>>    net/bonding: support Tx prepare for bonding
>>    net/bonding: support configuring Tx offloading for bonding
>>
>>   drivers/net/bonding/rte_eth_bond.h     |  1 -
>>   drivers/net/bonding/rte_eth_bond_pmd.c | 41 ++++++++++++++++++++++++++++++----
>>   2 files changed, 37 insertions(+), 5 deletions(-)
>>
Acked-by: Min Hu (Connor) <humin29@huawei.com>
>> --
>> 2.7.4
>>
>>
>> .
>>
> 
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 0/2] add Tx prepare support for bonding device
  2021-04-23  9:46 ` [dpdk-dev] [PATCH " Chengchang Tang
                     ` (2 preceding siblings ...)
  2021-04-30  6:26   ` [dpdk-dev] [PATCH 0/2] add Tx prepare support for bonding device Chengchang Tang
@ 2021-06-03  1:44   ` Chengchang Tang
  3 siblings, 0 replies; 61+ messages in thread
From: Chengchang Tang @ 2021-06-03  1:44 UTC (permalink / raw)
  To: dev; +Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
Hi,all
Any comments to these patches?
On 2021/4/23 17:46, Chengchang Tang wrote:
> This patch set add Tx prepare for bonding device.
> 
> Currently, the bonding driver has not implemented the callback of
> rte_eth_tx_prepare function. Therefore, the TX prepare function of the
> slave devices will never be invoked. When hardware offloading such as
> CKSUM and TSO are enabled for some drivers, tx_prepare needs to be used
> to adjust packets (for example, set correct pseudo packet headers).
> Otherwise, related offloading fails and even packets are sent
> incorrectly. Due to this limitation, the bonded device cannot use these
> HW offloading in the Tx direction.
> 
> Because packet sending algorithms are numerous and complex in bond PMD,
> it is hard to design the callback for rte_eth_tx_prepare. In this
> patchset, the tx_prepare callback of bonding PMD is not implemented.
> Instead, rte_eth_tx_prepare has been called in tx_burst callback. In
> this way, all tx_offloads can be processed correctly for all NIC devices.
> It is the responsibility of the slave PMDs to decide when the real
> tx_prepare needs to be used. If tx_prepare is not required in some cases,
> then slave PMDs tx_prepare pointer should be NULL and rte_eth_tx_prepare()
> will be just a NOOP. That is, the effectiveness and security of tx_prepare
> and its impact on performance depend on the design of slave PMDs.
> 
> And configuring Tx offloading for bonding is also added in this patchset.
> This solves the problem that we need to configure slave devices one by one
> when configuring Tx offloading.
> 
> Chengchang Tang (2):
>   net/bonding: support Tx prepare for bonding
>   net/bonding: support configuring Tx offloading for bonding
> 
>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>  drivers/net/bonding/rte_eth_bond_pmd.c | 41 ++++++++++++++++++++++++++++++----
>  2 files changed, 37 insertions(+), 5 deletions(-)
> 
> --
> 2.7.4
> 
> 
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
@ 2021-06-08  9:49     ` Andrew Rybchenko
  2021-06-09  6:42       ` Chengchang Tang
  2022-05-24 12:11       ` Min Hu (Connor)
  2022-07-25  4:08     ` [PATCH v2 0/3] add Tx prepare support for bonding driver Chengwen Feng
                       ` (3 subsequent siblings)
  4 siblings, 2 replies; 61+ messages in thread
From: Andrew Rybchenko @ 2021-06-08  9:49 UTC (permalink / raw)
  To: Chengchang Tang, dev
  Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
"for bonding" is redundant in the summary since it is already "net/bonding".
On 4/23/21 12:46 PM, Chengchang Tang wrote:
> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
> direction, the upper-layer users need to call rte_eth_dev_prepare to do
> some adjustment to the packets before sending them (e.g. processing
> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
> callback of the bond driver is not implemented. Therefore, related
> offloads can not be used unless the upper layer users process the packet
> properly in their own application. But it is bad for the
> transplantability.
> 
> However, it is difficult to design the tx_prepare callback for bonding
> driver. Because when a bonded device sends packets, the bonded device
> allocates the packets to different slave devices based on the real-time
> link status and bonding mode. That is, it is very difficult for the
> bonding device to determine which slave device's prepare function should
> be invoked. In addition, if the link status changes after the packets are
> prepared, the packets may fail to be sent because packets allocation may
> change.
> 
> So, in this patch, the tx_prepare callback of bonding driver is not
> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
> tx_offloads can be processed correctly for all NIC devices in these modes.
> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
> In these cases, the impact on performance will be very limited. It is
> the responsibility of the slave PMDs to decide when the real tx_prepare
> needs to be used. The information from dev_config/queue_setup is
> sufficient for them to make these decisions.
> 
> Note:
> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
> because in broadcast mode, a packet needs to be sent by all slave ports.
> Different PMDs process the packets differently in tx_prepare. As a result,
> the sent packet may be incorrect.
> 
> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> ---
>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>  2 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
> index 874aa91..1e6cc6d 100644
> --- a/drivers/net/bonding/rte_eth_bond.h
> +++ b/drivers/net/bonding/rte_eth_bond.h
> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>  int
>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
> 
> -
>  #ifdef __cplusplus
>  }
>  #endif
> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> index 2e9cea5..84af348 100644
> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>  	/* Send packet burst on each slave device */
>  	for (i = 0; i < num_of_slaves; i++) {
>  		if (slave_nb_pkts[i] > 0) {
> +			int nb_prep_pkts;
> +
> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
> +					bd_tx_q->queue_id, slave_bufs[i],
> +					slave_nb_pkts[i]);
> +
Shouldn't it be called iff queue Tx offloads are not zero?
It will allow to decrease performance degradation if no
Tx offloads are enabled. Same in all cases below.
>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> -					slave_bufs[i], slave_nb_pkts[i]);
> +					slave_bufs[i], nb_prep_pkts);
In fact it is a problem here and really big problems.
Tx prepare may fail and return less packets. Tx prepare
of some packet may always fail. If application tries to
send packets in a loop until success, it will be a
forever loop here. Since application calls Tx burst,
it is 100% legal behaviour of the function to return 0
if Tx ring is full. It is not an error indication.
However, in the case of Tx prepare it is an error
indication.
Should we change Tx burst description and enforce callers
to check for rte_errno? It sounds like a major change...
[snip]
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding Chengchang Tang
@ 2021-06-08  9:49     ` Andrew Rybchenko
  2021-06-09  6:57       ` Chengchang Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Andrew Rybchenko @ 2021-06-08  9:49 UTC (permalink / raw)
  To: Chengchang Tang, dev
  Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
"for bonding" is redundant in the summary since it is already
"net/bonding"
On 4/23/21 12:46 PM, Chengchang Tang wrote:
> Currently, the TX offloading of the bonding device will not take effect by
TX -> Tx
> using dev_configure. Because the related configuration will not be
> delivered to the slave devices in this way.
I think it is a major problem that Tx offloads are actually
ignored. It should be a patches with "Fixes:" which addresses
it.
> The Tx offloading capability of the bonding device is the intersection of
> the capability of all slave devices. Based on this, the following functions
> are added to the bonding driver:
> 1. If a Tx offloading is within the capability of the bonding device (i.e.
> all the slave devices support this Tx offloading), the enabling status of
> the offloading of all slave devices depends on the configuration of the
> bonding device.
> 
> 2. For the Tx offloading that is not within the Tx offloading capability
> of the bonding device, the enabling status of the offloading on the slave
> devices is irrelevant to the bonding device configuration. And it depends
> on the original configuration of the slave devices.
> 
> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> ---
>  drivers/net/bonding/rte_eth_bond_pmd.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> index 84af348..9922657 100644
> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> @@ -1712,6 +1712,8 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>  	struct rte_flow_error flow_error;
> 
>  	struct bond_dev_private *internals = bonded_eth_dev->data->dev_private;
> +	uint64_t tx_offload_cap = internals->tx_offload_capa;
> +	uint64_t tx_offload;
> 
>  	/* Stop slave */
>  	errval = rte_eth_dev_stop(slave_eth_dev->data->port_id);
> @@ -1759,6 +1761,17 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>  		slave_eth_dev->data->dev_conf.rxmode.offloads &=
>  				~DEV_RX_OFFLOAD_JUMBO_FRAME;
> 
> +	while (tx_offload_cap != 0) {
> +		tx_offload = 1ULL << __builtin_ctzll(tx_offload_cap);
> +		if (bonded_eth_dev->data->dev_conf.txmode.offloads & tx_offload)
> +			slave_eth_dev->data->dev_conf.txmode.offloads |=
> +				tx_offload;
> +		else
> +			slave_eth_dev->data->dev_conf.txmode.offloads &=
> +				~tx_offload;
> +		tx_offload_cap &= ~tx_offload;
> +	}
> +
Frankly speaking I don't understand why it is that complicated.
ethdev rejects of unsupported Tx offloads. So, can't we simply:
slave_eth_dev->data->dev_conf.txmode.offloads =
    bonded_eth_dev->data->dev_conf.txmode.offloads;
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-06-08  9:49     ` Andrew Rybchenko
@ 2021-06-09  6:42       ` Chengchang Tang
  2021-06-09  9:35         ` Andrew Rybchenko
  2021-06-09 10:25         ` Ananyev, Konstantin
  2022-05-24 12:11       ` Min Hu (Connor)
  1 sibling, 2 replies; 61+ messages in thread
From: Chengchang Tang @ 2021-06-09  6:42 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
Hi, Andrew and Ferruh
On 2021/6/8 17:49, Andrew Rybchenko wrote:
> "for bonding" is redundant in the summary since it is already "net/bonding".
> 
> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>> some adjustment to the packets before sending them (e.g. processing
>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>> callback of the bond driver is not implemented. Therefore, related
>> offloads can not be used unless the upper layer users process the packet
>> properly in their own application. But it is bad for the
>> transplantability.
>>
>> However, it is difficult to design the tx_prepare callback for bonding
>> driver. Because when a bonded device sends packets, the bonded device
>> allocates the packets to different slave devices based on the real-time
>> link status and bonding mode. That is, it is very difficult for the
>> bonding device to determine which slave device's prepare function should
>> be invoked. In addition, if the link status changes after the packets are
>> prepared, the packets may fail to be sent because packets allocation may
>> change.
>>
>> So, in this patch, the tx_prepare callback of bonding driver is not
>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>> tx_offloads can be processed correctly for all NIC devices in these modes.
>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>> In these cases, the impact on performance will be very limited. It is
>> the responsibility of the slave PMDs to decide when the real tx_prepare
>> needs to be used. The information from dev_config/queue_setup is
>> sufficient for them to make these decisions.
>>
>> Note:
>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>> because in broadcast mode, a packet needs to be sent by all slave ports.
>> Different PMDs process the packets differently in tx_prepare. As a result,
>> the sent packet may be incorrect.
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> ---
>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>  2 files changed, 24 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>> index 874aa91..1e6cc6d 100644
>> --- a/drivers/net/bonding/rte_eth_bond.h
>> +++ b/drivers/net/bonding/rte_eth_bond.h
>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>  int
>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>
>> -
>>  #ifdef __cplusplus
>>  }
>>  #endif
>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>> index 2e9cea5..84af348 100644
>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>  	/* Send packet burst on each slave device */
>>  	for (i = 0; i < num_of_slaves; i++) {
>>  		if (slave_nb_pkts[i] > 0) {
>> +			int nb_prep_pkts;
>> +
>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>> +					bd_tx_q->queue_id, slave_bufs[i],
>> +					slave_nb_pkts[i]);
>> +
> 
> Shouldn't it be called iff queue Tx offloads are not zero?
> It will allow to decrease performance degradation if no
> Tx offloads are enabled. Same in all cases below.
Regarding this point, it has been discussed in the previous RFC:
https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
According to the TX_OFFLOAD status of the current device, PMDs can determine
whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
to NULL, so that the actual tx_prepare processing will be skipped directly in
rte_eth_tx_prepare().
> 
>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> -					slave_bufs[i], slave_nb_pkts[i]);
>> +					slave_bufs[i], nb_prep_pkts);
> 
> In fact it is a problem here and really big problems.
> Tx prepare may fail and return less packets. Tx prepare
> of some packet may always fail. If application tries to
> send packets in a loop until success, it will be a
> forever loop here. Since application calls Tx burst,
> it is 100% legal behaviour of the function to return 0
> if Tx ring is full. It is not an error indication.
> However, in the case of Tx prepare it is an error
> indication.
> 
> Should we change Tx burst description and enforce callers
> to check for rte_errno? It sounds like a major change...
> 
I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
But what about the failure caused by other reasons? At present, it is possible
for some PMDs to fail during tx_burst due to other reasons. In this case,
repeated tries to send will also fail.
I'm not sure if all PMDs need to support the behavior of sending packets in a
loop until it succeeds. If not, I think the current problem can be reminded to
the user by adding a description to the bonding. If it is necessary, I think the
description of tx_burst should also add related instructions, so that the developers
of PMDs can better understand how tx_burst should be designed, such as putting all
hardware-related constraint checks into tx_prepare. And another prerequisite for
the above behavior is that the packets must be prepared (i.e. checked by
rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
to use rte_eth_tx_prepare() in more scenarios.
What's Ferruh's opinion on this?
> [snip]
> 
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding
  2021-06-08  9:49     ` Andrew Rybchenko
@ 2021-06-09  6:57       ` Chengchang Tang
  2021-06-09  9:11         ` Ananyev, Konstantin
  0 siblings, 1 reply; 61+ messages in thread
From: Chengchang Tang @ 2021-06-09  6:57 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
On 2021/6/8 17:49, Andrew Rybchenko wrote:
> "for bonding" is redundant in the summary since it is already
> "net/bonding"
> 
> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>> Currently, the TX offloading of the bonding device will not take effect by
> 
> TX -> Tx
> 
>> using dev_configure. Because the related configuration will not be
>> delivered to the slave devices in this way.
> 
> I think it is a major problem that Tx offloads are actually
> ignored. It should be a patches with "Fixes:" which addresses
> it.
> 
>> The Tx offloading capability of the bonding device is the intersection of
>> the capability of all slave devices. Based on this, the following functions
>> are added to the bonding driver:
>> 1. If a Tx offloading is within the capability of the bonding device (i.e.
>> all the slave devices support this Tx offloading), the enabling status of
>> the offloading of all slave devices depends on the configuration of the
>> bonding device.
>>
>> 2. For the Tx offloading that is not within the Tx offloading capability
>> of the bonding device, the enabling status of the offloading on the slave
>> devices is irrelevant to the bonding device configuration. And it depends
>> on the original configuration of the slave devices.
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> ---
>>  drivers/net/bonding/rte_eth_bond_pmd.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>> index 84af348..9922657 100644
>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>> @@ -1712,6 +1712,8 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>>  	struct rte_flow_error flow_error;
>>
>>  	struct bond_dev_private *internals = bonded_eth_dev->data->dev_private;
>> +	uint64_t tx_offload_cap = internals->tx_offload_capa;
>> +	uint64_t tx_offload;
>>
>>  	/* Stop slave */
>>  	errval = rte_eth_dev_stop(slave_eth_dev->data->port_id);
>> @@ -1759,6 +1761,17 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>>  		slave_eth_dev->data->dev_conf.rxmode.offloads &=
>>  				~DEV_RX_OFFLOAD_JUMBO_FRAME;
>>
>> +	while (tx_offload_cap != 0) {
>> +		tx_offload = 1ULL << __builtin_ctzll(tx_offload_cap);
>> +		if (bonded_eth_dev->data->dev_conf.txmode.offloads & tx_offload)
>> +			slave_eth_dev->data->dev_conf.txmode.offloads |=
>> +				tx_offload;
>> +		else
>> +			slave_eth_dev->data->dev_conf.txmode.offloads &=
>> +				~tx_offload;
>> +		tx_offload_cap &= ~tx_offload;
>> +	}
>> +
> 
> Frankly speaking I don't understand why it is that complicated.
> ethdev rejects of unsupported Tx offloads. So, can't we simply:
> slave_eth_dev->data->dev_conf.txmode.offloads =
>     bonded_eth_dev->data->dev_conf.txmode.offloads;
> 
Using such a complicated method is to increase the flexibility of the slave devices,
allowing the Tx offloading of the slave devices to be incompletely consistent with
the bond device. If some offloading can be turned on without bond device awareness,
they can be retained in this case.
> 
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding
  2021-06-09  6:57       ` Chengchang Tang
@ 2021-06-09  9:11         ` Ananyev, Konstantin
  2021-06-09  9:37           ` Andrew Rybchenko
  0 siblings, 1 reply; 61+ messages in thread
From: Ananyev, Konstantin @ 2021-06-09  9:11 UTC (permalink / raw)
  To: Chengchang Tang, Andrew Rybchenko, dev
  Cc: linuxarm, chas3, humin29, Yigit, Ferruh
> 
> 
> On 2021/6/8 17:49, Andrew Rybchenko wrote:
> > "for bonding" is redundant in the summary since it is already
> > "net/bonding"
> >
> > On 4/23/21 12:46 PM, Chengchang Tang wrote:
> >> Currently, the TX offloading of the bonding device will not take effect by
> >
> > TX -> Tx
> >
> >> using dev_configure. Because the related configuration will not be
> >> delivered to the slave devices in this way.
> >
> > I think it is a major problem that Tx offloads are actually
> > ignored. It should be a patches with "Fixes:" which addresses
> > it.
> >
> >> The Tx offloading capability of the bonding device is the intersection of
> >> the capability of all slave devices. Based on this, the following functions
> >> are added to the bonding driver:
> >> 1. If a Tx offloading is within the capability of the bonding device (i.e.
> >> all the slave devices support this Tx offloading), the enabling status of
> >> the offloading of all slave devices depends on the configuration of the
> >> bonding device.
> >>
> >> 2. For the Tx offloading that is not within the Tx offloading capability
> >> of the bonding device, the enabling status of the offloading on the slave
> >> devices is irrelevant to the bonding device configuration. And it depends
> >> on the original configuration of the slave devices.
> >>
> >> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> >> ---
> >>  drivers/net/bonding/rte_eth_bond_pmd.c | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> >> index 84af348..9922657 100644
> >> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> >> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> >> @@ -1712,6 +1712,8 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
> >>  	struct rte_flow_error flow_error;
> >>
> >>  	struct bond_dev_private *internals = bonded_eth_dev->data->dev_private;
> >> +	uint64_t tx_offload_cap = internals->tx_offload_capa;
> >> +	uint64_t tx_offload;
> >>
> >>  	/* Stop slave */
> >>  	errval = rte_eth_dev_stop(slave_eth_dev->data->port_id);
> >> @@ -1759,6 +1761,17 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
> >>  		slave_eth_dev->data->dev_conf.rxmode.offloads &=
> >>  				~DEV_RX_OFFLOAD_JUMBO_FRAME;
> >>
> >> +	while (tx_offload_cap != 0) {
> >> +		tx_offload = 1ULL << __builtin_ctzll(tx_offload_cap);
> >> +		if (bonded_eth_dev->data->dev_conf.txmode.offloads & tx_offload)
> >> +			slave_eth_dev->data->dev_conf.txmode.offloads |=
> >> +				tx_offload;
> >> +		else
> >> +			slave_eth_dev->data->dev_conf.txmode.offloads &=
> >> +				~tx_offload;
> >> +		tx_offload_cap &= ~tx_offload;
> >> +	}
> >> +
> >
> > Frankly speaking I don't understand why it is that complicated.
> > ethdev rejects of unsupported Tx offloads. So, can't we simply:
> > slave_eth_dev->data->dev_conf.txmode.offloads =
> >     bonded_eth_dev->data->dev_conf.txmode.offloads;
> >
> 
> Using such a complicated method is to increase the flexibility of the slave devices,
> allowing the Tx offloading of the slave devices to be incompletely consistent with
> the bond device. If some offloading can be turned on without bond device awareness,
> they can be retained in this case.
Not sure how that can that happen...
From my understanding tx_offload for bond device has to be intersection of tx_offloads
of all slaves, no? Otherwise bond device might be misconfigured.
Anyway for that code snippet above, wouldn't the same be achived by:
slave_eth_dev->data->dev_conf.txmode.offloads &= internals->tx_offload_capa & bonded_eth_dev->data->dev_conf.txmode.offloads;
?
 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-06-09  6:42       ` Chengchang Tang
@ 2021-06-09  9:35         ` Andrew Rybchenko
  2021-06-10  7:32           ` Chengchang Tang
  2021-06-09 10:25         ` Ananyev, Konstantin
  1 sibling, 1 reply; 61+ messages in thread
From: Andrew Rybchenko @ 2021-06-09  9:35 UTC (permalink / raw)
  To: Chengchang Tang, dev
  Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
On 6/9/21 9:42 AM, Chengchang Tang wrote:
> Hi, Andrew and Ferruh
> 
> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>> "for bonding" is redundant in the summary since it is already "net/bonding".
>>
>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>> some adjustment to the packets before sending them (e.g. processing
>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>> callback of the bond driver is not implemented. Therefore, related
>>> offloads can not be used unless the upper layer users process the packet
>>> properly in their own application. But it is bad for the
>>> transplantability.
>>>
>>> However, it is difficult to design the tx_prepare callback for bonding
>>> driver. Because when a bonded device sends packets, the bonded device
>>> allocates the packets to different slave devices based on the real-time
>>> link status and bonding mode. That is, it is very difficult for the
>>> bonding device to determine which slave device's prepare function should
>>> be invoked. In addition, if the link status changes after the packets are
>>> prepared, the packets may fail to be sent because packets allocation may
>>> change.
>>>
>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>> In these cases, the impact on performance will be very limited. It is
>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>> needs to be used. The information from dev_config/queue_setup is
>>> sufficient for them to make these decisions.
>>>
>>> Note:
>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>> the sent packet may be incorrect.
>>>
>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>> ---
>>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>  2 files changed, 24 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>> index 874aa91..1e6cc6d 100644
>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>  int
>>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>
>>> -
>>>  #ifdef __cplusplus
>>>  }
>>>  #endif
>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>> index 2e9cea5..84af348 100644
>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>  	/* Send packet burst on each slave device */
>>>  	for (i = 0; i < num_of_slaves; i++) {
>>>  		if (slave_nb_pkts[i] > 0) {
>>> +			int nb_prep_pkts;
>>> +
>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>> +					slave_nb_pkts[i]);
>>> +
>>
>> Shouldn't it be called iff queue Tx offloads are not zero?
>> It will allow to decrease performance degradation if no
>> Tx offloads are enabled. Same in all cases below.
> 
> Regarding this point, it has been discussed in the previous RFC:
> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
> 
> According to the TX_OFFLOAD status of the current device, PMDs can determine
> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
> to NULL, so that the actual tx_prepare processing will be skipped directly in
> rte_eth_tx_prepare().
I still think that the following is right:
No Tx offloads at all => Tx prepare is not necessary
Am I wrong?
>>
>>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>> +					slave_bufs[i], nb_prep_pkts);
>>
>> In fact it is a problem here and really big problems.
>> Tx prepare may fail and return less packets. Tx prepare
>> of some packet may always fail. If application tries to
>> send packets in a loop until success, it will be a
>> forever loop here. Since application calls Tx burst,
>> it is 100% legal behaviour of the function to return 0
>> if Tx ring is full. It is not an error indication.
>> However, in the case of Tx prepare it is an error
>> indication.
>>
>> Should we change Tx burst description and enforce callers
>> to check for rte_errno? It sounds like a major change...
>>
> 
> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
> But what about the failure caused by other reasons? At present, it is possible
> for some PMDs to fail during tx_burst due to other reasons. In this case,
> repeated tries to send will also fail.
If so, packet should be simply dropped by Tx burst and Tx burst
should move on. If a packet cannot be transmitted, it must be
dropped (counted) and Tx burst should move to the next packet.
> I'm not sure if all PMDs need to support the behavior of sending packets in a
> loop until it succeeds. If not, I think the current problem can be reminded to
> the user by adding a description to the bonding. If it is necessary, I think the
> description of tx_burst should also add related instructions, so that the developers
> of PMDs can better understand how tx_burst should be designed, such as putting all
> hardware-related constraint checks into tx_prepare. And another prerequisite for
> the above behavior is that the packets must be prepared (i.e. checked by
> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
> to use rte_eth_tx_prepare() in more scenarios.
IMHO any PMD specific behaviour is a nightmare to application
developer and must be avoided. Ideally application should not
care if it is running on top of tap, virtio, failsafe or
bonding. It should talk to ethdev in terms of ethdev API that's
it. I know that net/bonding is designed that application should
know about it, but IMHO the places where it requires the
knowledge must be minimized to make applications more portable
across various PMDs/HW.
I think that the only sensible solution for above problem is
to skip a packet which prepare dislikes. count it as dropped
and try to prepare/transmit subsequent packets.
It is an interesting effect of the Tx prepare just before
Tx burst inside bonding PMD. If Tx burst fails to send
something because ring is full, a number of packets will
be processed by Tx prepare again and again. I guess it is
unavoidable.
> What's Ferruh's opinion on this?
> 
>> [snip]
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding
  2021-06-09  9:11         ` Ananyev, Konstantin
@ 2021-06-09  9:37           ` Andrew Rybchenko
  2021-06-10  6:29             ` Chengchang Tang
  0 siblings, 1 reply; 61+ messages in thread
From: Andrew Rybchenko @ 2021-06-09  9:37 UTC (permalink / raw)
  To: Ananyev, Konstantin, Chengchang Tang, dev
  Cc: linuxarm, chas3, humin29, Yigit, Ferruh
On 6/9/21 12:11 PM, Ananyev, Konstantin wrote:
> 
>>
>>
>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>>> "for bonding" is redundant in the summary since it is already
>>> "net/bonding"
>>>
>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>> Currently, the TX offloading of the bonding device will not take effect by
>>>
>>> TX -> Tx
>>>
>>>> using dev_configure. Because the related configuration will not be
>>>> delivered to the slave devices in this way.
>>>
>>> I think it is a major problem that Tx offloads are actually
>>> ignored. It should be a patches with "Fixes:" which addresses
>>> it.
>>>
>>>> The Tx offloading capability of the bonding device is the intersection of
>>>> the capability of all slave devices. Based on this, the following functions
>>>> are added to the bonding driver:
>>>> 1. If a Tx offloading is within the capability of the bonding device (i.e.
>>>> all the slave devices support this Tx offloading), the enabling status of
>>>> the offloading of all slave devices depends on the configuration of the
>>>> bonding device.
>>>>
>>>> 2. For the Tx offloading that is not within the Tx offloading capability
>>>> of the bonding device, the enabling status of the offloading on the slave
>>>> devices is irrelevant to the bonding device configuration. And it depends
>>>> on the original configuration of the slave devices.
>>>>
>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>> ---
>>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 13 +++++++++++++
>>>>  1 file changed, 13 insertions(+)
>>>>
>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> index 84af348..9922657 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> @@ -1712,6 +1712,8 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>>>>  	struct rte_flow_error flow_error;
>>>>
>>>>  	struct bond_dev_private *internals = bonded_eth_dev->data->dev_private;
>>>> +	uint64_t tx_offload_cap = internals->tx_offload_capa;
>>>> +	uint64_t tx_offload;
>>>>
>>>>  	/* Stop slave */
>>>>  	errval = rte_eth_dev_stop(slave_eth_dev->data->port_id);
>>>> @@ -1759,6 +1761,17 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>>>>  		slave_eth_dev->data->dev_conf.rxmode.offloads &=
>>>>  				~DEV_RX_OFFLOAD_JUMBO_FRAME;
>>>>
>>>> +	while (tx_offload_cap != 0) {
>>>> +		tx_offload = 1ULL << __builtin_ctzll(tx_offload_cap);
>>>> +		if (bonded_eth_dev->data->dev_conf.txmode.offloads & tx_offload)
>>>> +			slave_eth_dev->data->dev_conf.txmode.offloads |=
>>>> +				tx_offload;
>>>> +		else
>>>> +			slave_eth_dev->data->dev_conf.txmode.offloads &=
>>>> +				~tx_offload;
>>>> +		tx_offload_cap &= ~tx_offload;
>>>> +	}
>>>> +
>>>
>>> Frankly speaking I don't understand why it is that complicated.
>>> ethdev rejects of unsupported Tx offloads. So, can't we simply:
>>> slave_eth_dev->data->dev_conf.txmode.offloads =
>>>     bonded_eth_dev->data->dev_conf.txmode.offloads;
>>>
>>
>> Using such a complicated method is to increase the flexibility of the slave devices,
>> allowing the Tx offloading of the slave devices to be incompletely consistent with
>> the bond device. If some offloading can be turned on without bond device awareness,
>> they can be retained in this case.
> 
> 
> Not sure how that can that happen...
+1
@Chengchang could you provide an example how it could happen.
> From my understanding tx_offload for bond device has to be intersection of tx_offloads
> of all slaves, no? Otherwise bond device might be misconfigured.
> Anyway for that code snippet above, wouldn't the same be achived by:
> slave_eth_dev->data->dev_conf.txmode.offloads &= internals->tx_offload_capa & bonded_eth_dev->data->dev_conf.txmode.offloads;
> ?
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-06-09  6:42       ` Chengchang Tang
  2021-06-09  9:35         ` Andrew Rybchenko
@ 2021-06-09 10:25         ` Ananyev, Konstantin
  2021-06-10  6:46           ` Chengchang Tang
  1 sibling, 1 reply; 61+ messages in thread
From: Ananyev, Konstantin @ 2021-06-09 10:25 UTC (permalink / raw)
  To: Chengchang Tang, Andrew Rybchenko, dev
  Cc: linuxarm, chas3, humin29, Yigit, Ferruh
> > On 4/23/21 12:46 PM, Chengchang Tang wrote:
> >> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
> >> direction, the upper-layer users need to call rte_eth_dev_prepare to do
> >> some adjustment to the packets before sending them (e.g. processing
> >> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
> >> callback of the bond driver is not implemented. Therefore, related
> >> offloads can not be used unless the upper layer users process the packet
> >> properly in their own application. But it is bad for the
> >> transplantability.
> >>
> >> However, it is difficult to design the tx_prepare callback for bonding
> >> driver. Because when a bonded device sends packets, the bonded device
> >> allocates the packets to different slave devices based on the real-time
> >> link status and bonding mode. That is, it is very difficult for the
> >> bonding device to determine which slave device's prepare function should
> >> be invoked. In addition, if the link status changes after the packets are
> >> prepared, the packets may fail to be sent because packets allocation may
> >> change.
> >>
> >> So, in this patch, the tx_prepare callback of bonding driver is not
> >> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
> >> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
> >> tx_offloads can be processed correctly for all NIC devices in these modes.
> >> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
> >> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
> >> In these cases, the impact on performance will be very limited. It is
> >> the responsibility of the slave PMDs to decide when the real tx_prepare
> >> needs to be used. The information from dev_config/queue_setup is
> >> sufficient for them to make these decisions.
> >>
> >> Note:
> >> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
> >> because in broadcast mode, a packet needs to be sent by all slave ports.
> >> Different PMDs process the packets differently in tx_prepare. As a result,
> >> the sent packet may be incorrect.
> >>
> >> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> >> ---
> >>  drivers/net/bonding/rte_eth_bond.h     |  1 -
> >>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
> >>  2 files changed, 24 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
> >> index 874aa91..1e6cc6d 100644
> >> --- a/drivers/net/bonding/rte_eth_bond.h
> >> +++ b/drivers/net/bonding/rte_eth_bond.h
> >> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
> >>  int
> >>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
> >>
> >> -
> >>  #ifdef __cplusplus
> >>  }
> >>  #endif
> >> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> >> index 2e9cea5..84af348 100644
> >> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> >> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> >> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
> >>  	/* Send packet burst on each slave device */
> >>  	for (i = 0; i < num_of_slaves; i++) {
> >>  		if (slave_nb_pkts[i] > 0) {
> >> +			int nb_prep_pkts;
> >> +
> >> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
> >> +					bd_tx_q->queue_id, slave_bufs[i],
> >> +					slave_nb_pkts[i]);
> >> +
> >
> > Shouldn't it be called iff queue Tx offloads are not zero?
> > It will allow to decrease performance degradation if no
> > Tx offloads are enabled. Same in all cases below.
> 
> Regarding this point, it has been discussed in the previous RFC:
> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
> 
> According to the TX_OFFLOAD status of the current device, PMDs can determine
> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
> to NULL, so that the actual tx_prepare processing will be skipped directly in
> rte_eth_tx_prepare().
> 
> >
> >>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >> -					slave_bufs[i], slave_nb_pkts[i]);
> >> +					slave_bufs[i], nb_prep_pkts);
> >
> > In fact it is a problem here and really big problems.
> > Tx prepare may fail and return less packets. Tx prepare
> > of some packet may always fail. If application tries to
> > send packets in a loop until success, it will be a
> > forever loop here. Since application calls Tx burst,
> > it is 100% legal behaviour of the function to return 0
> > if Tx ring is full. It is not an error indication.
> > However, in the case of Tx prepare it is an error
> > indication.
Yes, that sounds like a problem and existing apps might be affected.
> >
> > Should we change Tx burst description and enforce callers
> > to check for rte_errno? It sounds like a major change...
> >
Agree, rte_errno for tx_burst() is probably a simplest and sanest way,
but yes, it is a change in behaviour and apps will need to be updated.  
Another option for bond PMD - just silently free mbufs for which prepare()
fails (and probably update some stats counter).
Again it is a change in behaviour, but now just for one PMD, with tx offloads enabled.
Also as, I can see some tx_burst() function for that PMD already free packets silently:
bond_ethdev_tx_burst_alb(), bond_ethdev_tx_burst_broadcast().
Actually another question - why the patch adds tx_prepare() only to some
TX modes but not all?
Is that itended? 
> 
> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
> But what about the failure caused by other reasons? At present, it is possible
> for some PMDs to fail during tx_burst due to other reasons. In this case,
> repeated tries to send will also fail.
> 
> I'm not sure if all PMDs need to support the behavior of sending packets in a
> loop until it succeeds. If not, I think the current problem can be reminded to
> the user by adding a description to the bonding. If it is necessary, I think the
> description of tx_burst should also add related instructions, so that the developers
> of PMDs can better understand how tx_burst should be designed, such as putting all
> hardware-related constraint checks into tx_prepare. And another prerequisite for
> the above behavior is that the packets must be prepared (i.e. checked by
> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
> to use rte_eth_tx_prepare() in more scenarios.
> 
> What's Ferruh's opinion on this?
> 
> > [snip]
> >
> > .
> >
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding
  2021-06-09  9:37           ` Andrew Rybchenko
@ 2021-06-10  6:29             ` Chengchang Tang
  2021-06-14 11:05               ` Ananyev, Konstantin
  0 siblings, 1 reply; 61+ messages in thread
From: Chengchang Tang @ 2021-06-10  6:29 UTC (permalink / raw)
  To: Andrew Rybchenko, Ananyev, Konstantin, dev
  Cc: linuxarm, chas3, humin29, Yigit, Ferruh
Hi, Andrew and Ananyev
On 2021/6/9 17:37, Andrew Rybchenko wrote:
> On 6/9/21 12:11 PM, Ananyev, Konstantin wrote:
>>
>>>
>>>
>>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>>>> "for bonding" is redundant in the summary since it is already
>>>> "net/bonding"
>>>>
>>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>>> Currently, the TX offloading of the bonding device will not take effect by
>>>>
>>>> TX -> Tx
>>>>
>>>>> using dev_configure. Because the related configuration will not be
>>>>> delivered to the slave devices in this way.
>>>>
>>>> I think it is a major problem that Tx offloads are actually
>>>> ignored. It should be a patches with "Fixes:" which addresses
>>>> it.
>>>>
>>>>> The Tx offloading capability of the bonding device is the intersection of
>>>>> the capability of all slave devices. Based on this, the following functions
>>>>> are added to the bonding driver:
>>>>> 1. If a Tx offloading is within the capability of the bonding device (i.e.
>>>>> all the slave devices support this Tx offloading), the enabling status of
>>>>> the offloading of all slave devices depends on the configuration of the
>>>>> bonding device.
>>>>>
>>>>> 2. For the Tx offloading that is not within the Tx offloading capability
>>>>> of the bonding device, the enabling status of the offloading on the slave
>>>>> devices is irrelevant to the bonding device configuration. And it depends
>>>>> on the original configuration of the slave devices.
>>>>>
>>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>>> ---
>>>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 13 +++++++++++++
>>>>>  1 file changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> index 84af348..9922657 100644
>>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> @@ -1712,6 +1712,8 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>>>>>  	struct rte_flow_error flow_error;
>>>>>
>>>>>  	struct bond_dev_private *internals = bonded_eth_dev->data->dev_private;
>>>>> +	uint64_t tx_offload_cap = internals->tx_offload_capa;
>>>>> +	uint64_t tx_offload;
>>>>>
>>>>>  	/* Stop slave */
>>>>>  	errval = rte_eth_dev_stop(slave_eth_dev->data->port_id);
>>>>> @@ -1759,6 +1761,17 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>>>>>  		slave_eth_dev->data->dev_conf.rxmode.offloads &=
>>>>>  				~DEV_RX_OFFLOAD_JUMBO_FRAME;
>>>>>
>>>>> +	while (tx_offload_cap != 0) {
>>>>> +		tx_offload = 1ULL << __builtin_ctzll(tx_offload_cap);
>>>>> +		if (bonded_eth_dev->data->dev_conf.txmode.offloads & tx_offload)
>>>>> +			slave_eth_dev->data->dev_conf.txmode.offloads |=
>>>>> +				tx_offload;
>>>>> +		else
>>>>> +			slave_eth_dev->data->dev_conf.txmode.offloads &=
>>>>> +				~tx_offload;
>>>>> +		tx_offload_cap &= ~tx_offload;
>>>>> +	}
>>>>> +
>>>>
>>>> Frankly speaking I don't understand why it is that complicated.
>>>> ethdev rejects of unsupported Tx offloads. So, can't we simply:
>>>> slave_eth_dev->data->dev_conf.txmode.offloads =
>>>>     bonded_eth_dev->data->dev_conf.txmode.offloads;
>>>>
>>>
>>> Using such a complicated method is to increase the flexibility of the slave devices,
>>> allowing the Tx offloading of the slave devices to be incompletely consistent with
>>> the bond device. If some offloading can be turned on without bond device awareness,
>>> they can be retained in this case.
>>
>>
>> Not sure how that can that happen...
> 
> +1
> 
> @Chengchang could you provide an example how it could happen.
> 
For example:
device 1 capability: VLAN_INSERT | MBUF_FAST_FREE
device 2 capability: VLAN_INSERT
And the capability of bonded device will be VLAN_INSERT.
So, we can only set VLAN_INSERT for the bonded device. So what if we want to enable
MBUF_FAST_FREE in device 1 to improve performance? For the application, as long as it
can guarantee the condition of MBUF ref_cnt = 1, then it can run normally if
MBUF_FAST_FREE is turned on.
In my logic, if device 1 has been configured with MBUF_FAST_FREE, and then
added to the bonded device as a slave. The MBUF_FAST_FREE will be reserved.
>> From my understanding tx_offload for bond device has to be intersection of tx_offloads
>> of all slaves, no? Otherwise bond device might be misconfigured.
>> Anyway for that code snippet above, wouldn't the same be achived by:
>> slave_eth_dev->data->dev_conf.txmode.offloads &= internals->tx_offload_capa & bonded_eth_dev->data->dev_conf.txmode.offloads;
>> ?
> 
I think it will not achieved my purpose in the scenario I mentioned above.
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-06-09 10:25         ` Ananyev, Konstantin
@ 2021-06-10  6:46           ` Chengchang Tang
  2021-06-14 11:36             ` Ananyev, Konstantin
  0 siblings, 1 reply; 61+ messages in thread
From: Chengchang Tang @ 2021-06-10  6:46 UTC (permalink / raw)
  To: Ananyev, Konstantin, Andrew Rybchenko, dev
  Cc: linuxarm, chas3, humin29, Yigit, Ferruh
On 2021/6/9 18:25, Ananyev, Konstantin wrote:
>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>>> some adjustment to the packets before sending them (e.g. processing
>>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>>> callback of the bond driver is not implemented. Therefore, related
>>>> offloads can not be used unless the upper layer users process the packet
>>>> properly in their own application. But it is bad for the
>>>> transplantability.
>>>>
>>>> However, it is difficult to design the tx_prepare callback for bonding
>>>> driver. Because when a bonded device sends packets, the bonded device
>>>> allocates the packets to different slave devices based on the real-time
>>>> link status and bonding mode. That is, it is very difficult for the
>>>> bonding device to determine which slave device's prepare function should
>>>> be invoked. In addition, if the link status changes after the packets are
>>>> prepared, the packets may fail to be sent because packets allocation may
>>>> change.
>>>>
>>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>>> In these cases, the impact on performance will be very limited. It is
>>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>>> needs to be used. The information from dev_config/queue_setup is
>>>> sufficient for them to make these decisions.
>>>>
>>>> Note:
>>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>>> the sent packet may be incorrect.
>>>>
>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>> ---
>>>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>>  2 files changed, 24 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>>> index 874aa91..1e6cc6d 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>>  int
>>>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>>
>>>> -
>>>>  #ifdef __cplusplus
>>>>  }
>>>>  #endif
>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> index 2e9cea5..84af348 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>>  	/* Send packet burst on each slave device */
>>>>  	for (i = 0; i < num_of_slaves; i++) {
>>>>  		if (slave_nb_pkts[i] > 0) {
>>>> +			int nb_prep_pkts;
>>>> +
>>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>>> +					slave_nb_pkts[i]);
>>>> +
>>>
>>> Shouldn't it be called iff queue Tx offloads are not zero?
>>> It will allow to decrease performance degradation if no
>>> Tx offloads are enabled. Same in all cases below.
>>
>> Regarding this point, it has been discussed in the previous RFC:
>> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
>>
>> According to the TX_OFFLOAD status of the current device, PMDs can determine
>> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
>> to NULL, so that the actual tx_prepare processing will be skipped directly in
>> rte_eth_tx_prepare().
>>
>>>
>>>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>>> +					slave_bufs[i], nb_prep_pkts);
>>>
>>> In fact it is a problem here and really big problems.
>>> Tx prepare may fail and return less packets. Tx prepare
>>> of some packet may always fail. If application tries to
>>> send packets in a loop until success, it will be a
>>> forever loop here. Since application calls Tx burst,
>>> it is 100% legal behaviour of the function to return 0
>>> if Tx ring is full. It is not an error indication.
>>> However, in the case of Tx prepare it is an error
>>> indication.
> 
> Yes, that sounds like a problem and existing apps might be affected.
> 
>>>
>>> Should we change Tx burst description and enforce callers
>>> to check for rte_errno? It sounds like a major change...
>>>
> 
> Agree, rte_errno for tx_burst() is probably a simplest and sanest way,
> but yes, it is a change in behaviour and apps will need to be updated.  
> Another option for bond PMD - just silently free mbufs for which prepare()
> fails (and probably update some stats counter).
> Again it is a change in behaviour, but now just for one PMD, with tx offloads enabled.
> Also as, I can see some tx_burst() function for that PMD already free packets silently:
> bond_ethdev_tx_burst_alb(), bond_ethdev_tx_burst_broadcast().
> 
> Actually another question - why the patch adds tx_prepare() only to some
> TX modes but not all?
> Is that itended? 
> 
Yes. Currently, I have no ideal to perform tx_prepare() in broadcast mode with limited
impact on performance. In broadcast mode, same packets will be send in several devices.
In this process, we only update the ref_cnt of mbufs, but no copy of packets. As we know,
tx_prepare() may change the data, so it may cause some problem if we perform tx_prepare()
several times on the same packet.
>>
>> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
>> But what about the failure caused by other reasons? At present, it is possible
>> for some PMDs to fail during tx_burst due to other reasons. In this case,
>> repeated tries to send will also fail.
>>
>> I'm not sure if all PMDs need to support the behavior of sending packets in a
>> loop until it succeeds. If not, I think the current problem can be reminded to
>> the user by adding a description to the bonding. If it is necessary, I think the
>> description of tx_burst should also add related instructions, so that the developers
>> of PMDs can better understand how tx_burst should be designed, such as putting all
>> hardware-related constraint checks into tx_prepare. And another prerequisite for
>> the above behavior is that the packets must be prepared (i.e. checked by
>> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
>> to use rte_eth_tx_prepare() in more scenarios.
>>
>> What's Ferruh's opinion on this?
>>
>>> [snip]
>>>
>>> .
>>>
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-06-09  9:35         ` Andrew Rybchenko
@ 2021-06-10  7:32           ` Chengchang Tang
  2021-06-14 14:16             ` Andrew Rybchenko
  0 siblings, 1 reply; 61+ messages in thread
From: Chengchang Tang @ 2021-06-10  7:32 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
On 2021/6/9 17:35, Andrew Rybchenko wrote:
> On 6/9/21 9:42 AM, Chengchang Tang wrote:
>> Hi, Andrew and Ferruh
>>
>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>>> "for bonding" is redundant in the summary since it is already "net/bonding".
>>>
>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>>> some adjustment to the packets before sending them (e.g. processing
>>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>>> callback of the bond driver is not implemented. Therefore, related
>>>> offloads can not be used unless the upper layer users process the packet
>>>> properly in their own application. But it is bad for the
>>>> transplantability.
>>>>
>>>> However, it is difficult to design the tx_prepare callback for bonding
>>>> driver. Because when a bonded device sends packets, the bonded device
>>>> allocates the packets to different slave devices based on the real-time
>>>> link status and bonding mode. That is, it is very difficult for the
>>>> bonding device to determine which slave device's prepare function should
>>>> be invoked. In addition, if the link status changes after the packets are
>>>> prepared, the packets may fail to be sent because packets allocation may
>>>> change.
>>>>
>>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>>> In these cases, the impact on performance will be very limited. It is
>>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>>> needs to be used. The information from dev_config/queue_setup is
>>>> sufficient for them to make these decisions.
>>>>
>>>> Note:
>>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>>> the sent packet may be incorrect.
>>>>
>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>> ---
>>>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>>  2 files changed, 24 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>>> index 874aa91..1e6cc6d 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>>  int
>>>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>>
>>>> -
>>>>  #ifdef __cplusplus
>>>>  }
>>>>  #endif
>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> index 2e9cea5..84af348 100644
>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>>  	/* Send packet burst on each slave device */
>>>>  	for (i = 0; i < num_of_slaves; i++) {
>>>>  		if (slave_nb_pkts[i] > 0) {
>>>> +			int nb_prep_pkts;
>>>> +
>>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>>> +					slave_nb_pkts[i]);
>>>> +
>>>
>>> Shouldn't it be called iff queue Tx offloads are not zero?
>>> It will allow to decrease performance degradation if no
>>> Tx offloads are enabled. Same in all cases below.
>>
>> Regarding this point, it has been discussed in the previous RFC:
>> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
>>
>> According to the TX_OFFLOAD status of the current device, PMDs can determine
>> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
>> to NULL, so that the actual tx_prepare processing will be skipped directly in
>> rte_eth_tx_prepare().
> 
> I still think that the following is right:
> No Tx offloads at all => Tx prepare is not necessary
> 
> Am I wrong?
> 
Let PMDs determine whether tx_prepare() need be done could reduce the performance
loss in more scenarios. For example, some offload do not need a Tx prepare, and PMDs
could set tx_prepare to NULL in this scenario. Even if rte_eth_tx_prepare() is called,
it will not perform the tx_prepare callback, and then return. In this case, there is
only one judgment logic. If we judge whether tx_offloads are not zero, one more logical
judgment is added.
Of course, some PMDs currently do not optimize tx_prepare, which may have a performance
impact. We hope to force them to optimize tx_prepare in this way, just like they optimize
tx_burst. This makes it easier for users to use tx_prepare(), and no longer need to
consider that using tx_prepare() will introduce unnecessary performance degradation.
IMHO tx_prepare() should be extended to all scenarios for use, and the impact on
performance should be optimized by PMDs. Let the application consider when it should be
used and when it should not be used, in many cases it will not be used and then introduced
some problem.
>>>
>>>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>>> +					slave_bufs[i], nb_prep_pkts);
>>>
>>> In fact it is a problem here and really big problems.
>>> Tx prepare may fail and return less packets. Tx prepare
>>> of some packet may always fail. If application tries to
>>> send packets in a loop until success, it will be a
>>> forever loop here. Since application calls Tx burst,
>>> it is 100% legal behaviour of the function to return 0
>>> if Tx ring is full. It is not an error indication.
>>> However, in the case of Tx prepare it is an error
>>> indication.
>>>
>>> Should we change Tx burst description and enforce callers
>>> to check for rte_errno? It sounds like a major change...
>>>
>>
>> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
>> But what about the failure caused by other reasons? At present, it is possible
>> for some PMDs to fail during tx_burst due to other reasons. In this case,
>> repeated tries to send will also fail.
> 
> If so, packet should be simply dropped by Tx burst and Tx burst
> should move on. If a packet cannot be transmitted, it must be
> dropped (counted) and Tx burst should move to the next packet.
> 
>> I'm not sure if all PMDs need to support the behavior of sending packets in a
>> loop until it succeeds. If not, I think the current problem can be reminded to
>> the user by adding a description to the bonding. If it is necessary, I think the
>> description of tx_burst should also add related instructions, so that the developers
>> of PMDs can better understand how tx_burst should be designed, such as putting all
>> hardware-related constraint checks into tx_prepare. And another prerequisite for
>> the above behavior is that the packets must be prepared (i.e. checked by
>> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
>> to use rte_eth_tx_prepare() in more scenarios.
> 
> IMHO any PMD specific behaviour is a nightmare to application
> developer and must be avoided. Ideally application should not
> care if it is running on top of tap, virtio, failsafe or
> bonding. It should talk to ethdev in terms of ethdev API that's
> it. I know that net/bonding is designed that application should
> know about it, but IMHO the places where it requires the
> knowledge must be minimized to make applications more portable
> across various PMDs/HW.
> 
> I think that the only sensible solution for above problem is
> to skip a packet which prepare dislikes. count it as dropped
> and try to prepare/transmit subsequent packets.
Agree, I will fix this in the next version.
> 
> It is an interesting effect of the Tx prepare just before
> Tx burst inside bonding PMD. If Tx burst fails to send
> something because ring is full, a number of packets will
> be processed by Tx prepare again and again. I guess it is
> unavoidable.
> 
>> What's Ferruh's opinion on this?
>>
>>> [snip]
> 
> 
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding
  2021-06-10  6:29             ` Chengchang Tang
@ 2021-06-14 11:05               ` Ananyev, Konstantin
  2021-06-14 14:13                 ` Andrew Rybchenko
  0 siblings, 1 reply; 61+ messages in thread
From: Ananyev, Konstantin @ 2021-06-14 11:05 UTC (permalink / raw)
  To: Chengchang Tang, Andrew Rybchenko, dev
  Cc: linuxarm, chas3, humin29, Yigit, Ferruh
> Hi, Andrew and Ananyev
> 
> On 2021/6/9 17:37, Andrew Rybchenko wrote:
> > On 6/9/21 12:11 PM, Ananyev, Konstantin wrote:
> >>
> >>>
> >>>
> >>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
> >>>> "for bonding" is redundant in the summary since it is already
> >>>> "net/bonding"
> >>>>
> >>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
> >>>>> Currently, the TX offloading of the bonding device will not take effect by
> >>>>
> >>>> TX -> Tx
> >>>>
> >>>>> using dev_configure. Because the related configuration will not be
> >>>>> delivered to the slave devices in this way.
> >>>>
> >>>> I think it is a major problem that Tx offloads are actually
> >>>> ignored. It should be a patches with "Fixes:" which addresses
> >>>> it.
> >>>>
> >>>>> The Tx offloading capability of the bonding device is the intersection of
> >>>>> the capability of all slave devices. Based on this, the following functions
> >>>>> are added to the bonding driver:
> >>>>> 1. If a Tx offloading is within the capability of the bonding device (i.e.
> >>>>> all the slave devices support this Tx offloading), the enabling status of
> >>>>> the offloading of all slave devices depends on the configuration of the
> >>>>> bonding device.
> >>>>>
> >>>>> 2. For the Tx offloading that is not within the Tx offloading capability
> >>>>> of the bonding device, the enabling status of the offloading on the slave
> >>>>> devices is irrelevant to the bonding device configuration. And it depends
> >>>>> on the original configuration of the slave devices.
> >>>>>
> >>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> >>>>> ---
> >>>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 13 +++++++++++++
> >>>>>  1 file changed, 13 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>>> index 84af348..9922657 100644
> >>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>>> @@ -1712,6 +1712,8 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
> >>>>>  	struct rte_flow_error flow_error;
> >>>>>
> >>>>>  	struct bond_dev_private *internals = bonded_eth_dev->data->dev_private;
> >>>>> +	uint64_t tx_offload_cap = internals->tx_offload_capa;
> >>>>> +	uint64_t tx_offload;
> >>>>>
> >>>>>  	/* Stop slave */
> >>>>>  	errval = rte_eth_dev_stop(slave_eth_dev->data->port_id);
> >>>>> @@ -1759,6 +1761,17 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
> >>>>>  		slave_eth_dev->data->dev_conf.rxmode.offloads &=
> >>>>>  				~DEV_RX_OFFLOAD_JUMBO_FRAME;
> >>>>>
> >>>>> +	while (tx_offload_cap != 0) {
> >>>>> +		tx_offload = 1ULL << __builtin_ctzll(tx_offload_cap);
> >>>>> +		if (bonded_eth_dev->data->dev_conf.txmode.offloads & tx_offload)
> >>>>> +			slave_eth_dev->data->dev_conf.txmode.offloads |=
> >>>>> +				tx_offload;
> >>>>> +		else
> >>>>> +			slave_eth_dev->data->dev_conf.txmode.offloads &=
> >>>>> +				~tx_offload;
> >>>>> +		tx_offload_cap &= ~tx_offload;
> >>>>> +	}
> >>>>> +
> >>>>
> >>>> Frankly speaking I don't understand why it is that complicated.
> >>>> ethdev rejects of unsupported Tx offloads. So, can't we simply:
> >>>> slave_eth_dev->data->dev_conf.txmode.offloads =
> >>>>     bonded_eth_dev->data->dev_conf.txmode.offloads;
> >>>>
> >>>
> >>> Using such a complicated method is to increase the flexibility of the slave devices,
> >>> allowing the Tx offloading of the slave devices to be incompletely consistent with
> >>> the bond device. If some offloading can be turned on without bond device awareness,
> >>> they can be retained in this case.
> >>
> >>
> >> Not sure how that can that happen...
> >
> > +1
> >
> > @Chengchang could you provide an example how it could happen.
> >
> 
> For example:
> device 1 capability: VLAN_INSERT | MBUF_FAST_FREE
> device 2 capability: VLAN_INSERT
> And the capability of bonded device will be VLAN_INSERT.
> So, we can only set VLAN_INSERT for the bonded device. So what if we want to enable
> MBUF_FAST_FREE in device 1 to improve performance? For the application, as long as it
> can guarantee the condition of MBUF ref_cnt = 1, then it can run normally if
> MBUF_FAST_FREE is turned on.
> 
> In my logic, if device 1 has been configured with MBUF_FAST_FREE, and then
> added to the bonded device as a slave. The MBUF_FAST_FREE will be reserved.
So your intention is to allow slave device silently overrule master tx_offload settings?
If so, I don't think it is a good idea - sounds like potentially bogus and error prone approach.
Second thing - I still don't see how the code above can help you with it.
From what I read in your code - you clear tx_offload bits that are not not supported by the master.
> 
> >> From my understanding tx_offload for bond device has to be intersection of tx_offloads
> >> of all slaves, no? Otherwise bond device might be misconfigured.
> >> Anyway for that code snippet above, wouldn't the same be achived by:
> >> slave_eth_dev->data->dev_conf.txmode.offloads &= internals->tx_offload_capa & bonded_eth_dev->data->dev_conf.txmode.offloads;
> >> ?
> >
> 
> I think it will not achieved my purpose in the scenario I mentioned above.
> 
> > .
> >
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-06-10  6:46           ` Chengchang Tang
@ 2021-06-14 11:36             ` Ananyev, Konstantin
  0 siblings, 0 replies; 61+ messages in thread
From: Ananyev, Konstantin @ 2021-06-14 11:36 UTC (permalink / raw)
  To: Chengchang Tang, Andrew Rybchenko, dev
  Cc: linuxarm, chas3, humin29, Yigit, Ferruh
> On 2021/6/9 18:25, Ananyev, Konstantin wrote:
> >>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
> >>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
> >>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
> >>>> some adjustment to the packets before sending them (e.g. processing
> >>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
> >>>> callback of the bond driver is not implemented. Therefore, related
> >>>> offloads can not be used unless the upper layer users process the packet
> >>>> properly in their own application. But it is bad for the
> >>>> transplantability.
> >>>>
> >>>> However, it is difficult to design the tx_prepare callback for bonding
> >>>> driver. Because when a bonded device sends packets, the bonded device
> >>>> allocates the packets to different slave devices based on the real-time
> >>>> link status and bonding mode. That is, it is very difficult for the
> >>>> bonding device to determine which slave device's prepare function should
> >>>> be invoked. In addition, if the link status changes after the packets are
> >>>> prepared, the packets may fail to be sent because packets allocation may
> >>>> change.
> >>>>
> >>>> So, in this patch, the tx_prepare callback of bonding driver is not
> >>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
> >>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
> >>>> tx_offloads can be processed correctly for all NIC devices in these modes.
> >>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
> >>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
> >>>> In these cases, the impact on performance will be very limited. It is
> >>>> the responsibility of the slave PMDs to decide when the real tx_prepare
> >>>> needs to be used. The information from dev_config/queue_setup is
> >>>> sufficient for them to make these decisions.
> >>>>
> >>>> Note:
> >>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
> >>>> because in broadcast mode, a packet needs to be sent by all slave ports.
> >>>> Different PMDs process the packets differently in tx_prepare. As a result,
> >>>> the sent packet may be incorrect.
> >>>>
> >>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> >>>> ---
> >>>>  drivers/net/bonding/rte_eth_bond.h     |  1 -
> >>>>  drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
> >>>>  2 files changed, 24 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
> >>>> index 874aa91..1e6cc6d 100644
> >>>> --- a/drivers/net/bonding/rte_eth_bond.h
> >>>> +++ b/drivers/net/bonding/rte_eth_bond.h
> >>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
> >>>>  int
> >>>>  rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
> >>>>
> >>>> -
> >>>>  #ifdef __cplusplus
> >>>>  }
> >>>>  #endif
> >>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>> index 2e9cea5..84af348 100644
> >>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> >>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
> >>>>  	/* Send packet burst on each slave device */
> >>>>  	for (i = 0; i < num_of_slaves; i++) {
> >>>>  		if (slave_nb_pkts[i] > 0) {
> >>>> +			int nb_prep_pkts;
> >>>> +
> >>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
> >>>> +					bd_tx_q->queue_id, slave_bufs[i],
> >>>> +					slave_nb_pkts[i]);
> >>>> +
> >>>
> >>> Shouldn't it be called iff queue Tx offloads are not zero?
> >>> It will allow to decrease performance degradation if no
> >>> Tx offloads are enabled. Same in all cases below.
> >>
> >> Regarding this point, it has been discussed in the previous RFC:
> >> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
> >>
> >> According to the TX_OFFLOAD status of the current device, PMDs can determine
> >> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
> >> to NULL, so that the actual tx_prepare processing will be skipped directly in
> >> rte_eth_tx_prepare().
> >>
> >>>
> >>>>  			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >>>> -					slave_bufs[i], slave_nb_pkts[i]);
> >>>> +					slave_bufs[i], nb_prep_pkts);
> >>>
> >>> In fact it is a problem here and really big problems.
> >>> Tx prepare may fail and return less packets. Tx prepare
> >>> of some packet may always fail. If application tries to
> >>> send packets in a loop until success, it will be a
> >>> forever loop here. Since application calls Tx burst,
> >>> it is 100% legal behaviour of the function to return 0
> >>> if Tx ring is full. It is not an error indication.
> >>> However, in the case of Tx prepare it is an error
> >>> indication.
> >
> > Yes, that sounds like a problem and existing apps might be affected.
> >
> >>>
> >>> Should we change Tx burst description and enforce callers
> >>> to check for rte_errno? It sounds like a major change...
> >>>
> >
> > Agree, rte_errno for tx_burst() is probably a simplest and sanest way,
> > but yes, it is a change in behaviour and apps will need to be updated.
> > Another option for bond PMD - just silently free mbufs for which prepare()
> > fails (and probably update some stats counter).
> > Again it is a change in behaviour, but now just for one PMD, with tx offloads enabled.
> > Also as, I can see some tx_burst() function for that PMD already free packets silently:
> > bond_ethdev_tx_burst_alb(), bond_ethdev_tx_burst_broadcast().
> >
> > Actually another question - why the patch adds tx_prepare() only to some
> > TX modes but not all?
> > Is that itended?
> >
> 
> Yes. Currently, I have no ideal to perform tx_prepare() in broadcast mode with limited
> impact on performance. In broadcast mode, same packets will be send in several devices.
> In this process, we only update the ref_cnt of mbufs, but no copy of packets.
> As we know,
> tx_prepare() may change the data, so it may cause some problem if we perform tx_prepare()
> several times on the same packet.
You mean tx_prepare() for second dev can void changes made by tx_prepare() for first dev?
I suppose in theory it is possible, even if it is probably not the case right now in practise
(at least I am not aware about such cases).
Actually that's an interesting topic - same can happen even with user implementing multicast
on his own (see examples/ipv4_multicast/). 
I think these new limitations have to be documented clearly (at least).
Also probably  we need extra changes fo bond device dev_confgiure()/dev_get_info():
to check currently selected mode and based on that allow/reject tx offloads.
The question arises (again) how to figure out for which tx offloads dev->tx_prepare()
modifies the packet, for which not? 
Any thoughts here? 
> 
> >>
> >> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
> >> But what about the failure caused by other reasons? At present, it is possible
> >> for some PMDs to fail during tx_burst due to other reasons. In this case,
> >> repeated tries to send will also fail.
> >>
> >> I'm not sure if all PMDs need to support the behavior of sending packets in a
> >> loop until it succeeds. If not, I think the current problem can be reminded to
> >> the user by adding a description to the bonding. If it is necessary, I think the
> >> description of tx_burst should also add related instructions, so that the developers
> >> of PMDs can better understand how tx_burst should be designed, such as putting all
> >> hardware-related constraint checks into tx_prepare. And another prerequisite for
> >> the above behavior is that the packets must be prepared (i.e. checked by
> >> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
> >> to use rte_eth_tx_prepare() in more scenarios.
> >>
> >> What's Ferruh's opinion on this?
> >>
> >>> [snip]
> >>>
> >>> .
> >>>
> >
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding
  2021-06-14 11:05               ` Ananyev, Konstantin
@ 2021-06-14 14:13                 ` Andrew Rybchenko
  0 siblings, 0 replies; 61+ messages in thread
From: Andrew Rybchenko @ 2021-06-14 14:13 UTC (permalink / raw)
  To: Ananyev, Konstantin, Chengchang Tang, dev
  Cc: linuxarm, chas3, humin29, Yigit, Ferruh
On 6/14/21 2:05 PM, Ananyev, Konstantin wrote:
> 
> 
>> Hi, Andrew and Ananyev
>>
>> On 2021/6/9 17:37, Andrew Rybchenko wrote:
>>> On 6/9/21 12:11 PM, Ananyev, Konstantin wrote:
>>>>
>>>>>
>>>>>
>>>>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>>>>>> "for bonding" is redundant in the summary since it is already
>>>>>> "net/bonding"
>>>>>>
>>>>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>>>>> Currently, the TX offloading of the bonding device will not take effect by
>>>>>>
>>>>>> TX -> Tx
>>>>>>
>>>>>>> using dev_configure. Because the related configuration will not be
>>>>>>> delivered to the slave devices in this way.
>>>>>>
>>>>>> I think it is a major problem that Tx offloads are actually
>>>>>> ignored. It should be a patches with "Fixes:" which addresses
>>>>>> it.
>>>>>>
>>>>>>> The Tx offloading capability of the bonding device is the intersection of
>>>>>>> the capability of all slave devices. Based on this, the following functions
>>>>>>> are added to the bonding driver:
>>>>>>> 1. If a Tx offloading is within the capability of the bonding device (i.e.
>>>>>>> all the slave devices support this Tx offloading), the enabling status of
>>>>>>> the offloading of all slave devices depends on the configuration of the
>>>>>>> bonding device.
>>>>>>>
>>>>>>> 2. For the Tx offloading that is not within the Tx offloading capability
>>>>>>> of the bonding device, the enabling status of the offloading on the slave
>>>>>>> devices is irrelevant to the bonding device configuration. And it depends
>>>>>>> on the original configuration of the slave devices.
>>>>>>>
>>>>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>>>>> ---
>>>>>>>   drivers/net/bonding/rte_eth_bond_pmd.c | 13 +++++++++++++
>>>>>>>   1 file changed, 13 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>>>> index 84af348..9922657 100644
>>>>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>>>> @@ -1712,6 +1712,8 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>>>>>>>   	struct rte_flow_error flow_error;
>>>>>>>
>>>>>>>   	struct bond_dev_private *internals = bonded_eth_dev->data->dev_private;
>>>>>>> +	uint64_t tx_offload_cap = internals->tx_offload_capa;
>>>>>>> +	uint64_t tx_offload;
>>>>>>>
>>>>>>>   	/* Stop slave */
>>>>>>>   	errval = rte_eth_dev_stop(slave_eth_dev->data->port_id);
>>>>>>> @@ -1759,6 +1761,17 @@ slave_configure(struct rte_eth_dev *bonded_eth_dev,
>>>>>>>   		slave_eth_dev->data->dev_conf.rxmode.offloads &=
>>>>>>>   				~DEV_RX_OFFLOAD_JUMBO_FRAME;
>>>>>>>
>>>>>>> +	while (tx_offload_cap != 0) {
>>>>>>> +		tx_offload = 1ULL << __builtin_ctzll(tx_offload_cap);
>>>>>>> +		if (bonded_eth_dev->data->dev_conf.txmode.offloads & tx_offload)
>>>>>>> +			slave_eth_dev->data->dev_conf.txmode.offloads |=
>>>>>>> +				tx_offload;
>>>>>>> +		else
>>>>>>> +			slave_eth_dev->data->dev_conf.txmode.offloads &=
>>>>>>> +				~tx_offload;
>>>>>>> +		tx_offload_cap &= ~tx_offload;
>>>>>>> +	}
>>>>>>> +
>>>>>>
>>>>>> Frankly speaking I don't understand why it is that complicated.
>>>>>> ethdev rejects of unsupported Tx offloads. So, can't we simply:
>>>>>> slave_eth_dev->data->dev_conf.txmode.offloads =
>>>>>>      bonded_eth_dev->data->dev_conf.txmode.offloads;
>>>>>>
>>>>>
>>>>> Using such a complicated method is to increase the flexibility of the slave devices,
>>>>> allowing the Tx offloading of the slave devices to be incompletely consistent with
>>>>> the bond device. If some offloading can be turned on without bond device awareness,
>>>>> they can be retained in this case.
>>>>
>>>>
>>>> Not sure how that can that happen...
>>>
>>> +1
>>>
>>> @Chengchang could you provide an example how it could happen.
>>>
>>
>> For example:
>> device 1 capability: VLAN_INSERT | MBUF_FAST_FREE
>> device 2 capability: VLAN_INSERT
>> And the capability of bonded device will be VLAN_INSERT.
>> So, we can only set VLAN_INSERT for the bonded device. So what if we want to enable
>> MBUF_FAST_FREE in device 1 to improve performance? For the application, as long as it
>> can guarantee the condition of MBUF ref_cnt = 1, then it can run normally if
>> MBUF_FAST_FREE is turned on.
>>
>> In my logic, if device 1 has been configured with MBUF_FAST_FREE, and then
>> added to the bonded device as a slave. The MBUF_FAST_FREE will be reserved.
> 
> So your intention is to allow slave device silently overrule master tx_offload settings?
> If so, I don't think it is a good idea - sounds like potentially bogus and error prone approach.
+1
> Second thing - I still don't see how the code above can help you with it.
>  From what I read in your code - you clear tx_offload bits that are not not supported by the master.
+1
>>
>>>>  From my understanding tx_offload for bond device has to be intersection of tx_offloads
>>>> of all slaves, no? Otherwise bond device might be misconfigured.
>>>> Anyway for that code snippet above, wouldn't the same be achived by:
>>>> slave_eth_dev->data->dev_conf.txmode.offloads &= internals->tx_offload_capa & bonded_eth_dev->data->dev_conf.txmode.offloads;
>>>> ?
>>>
>>
>> I think it will not achieved my purpose in the scenario I mentioned above.
>>
>>> .
>>>
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-06-10  7:32           ` Chengchang Tang
@ 2021-06-14 14:16             ` Andrew Rybchenko
  0 siblings, 0 replies; 61+ messages in thread
From: Andrew Rybchenko @ 2021-06-14 14:16 UTC (permalink / raw)
  To: Chengchang Tang, dev
  Cc: linuxarm, chas3, humin29, ferruh.yigit, konstantin.ananyev
On 6/10/21 10:32 AM, Chengchang Tang wrote:
> On 2021/6/9 17:35, Andrew Rybchenko wrote:
>> On 6/9/21 9:42 AM, Chengchang Tang wrote:
>>> Hi, Andrew and Ferruh
>>>
>>> On 2021/6/8 17:49, Andrew Rybchenko wrote:
>>>> "for bonding" is redundant in the summary since it is already "net/bonding".
>>>>
>>>> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>>>>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>>>>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>>>>> some adjustment to the packets before sending them (e.g. processing
>>>>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>>>>> callback of the bond driver is not implemented. Therefore, related
>>>>> offloads can not be used unless the upper layer users process the packet
>>>>> properly in their own application. But it is bad for the
>>>>> transplantability.
>>>>>
>>>>> However, it is difficult to design the tx_prepare callback for bonding
>>>>> driver. Because when a bonded device sends packets, the bonded device
>>>>> allocates the packets to different slave devices based on the real-time
>>>>> link status and bonding mode. That is, it is very difficult for the
>>>>> bonding device to determine which slave device's prepare function should
>>>>> be invoked. In addition, if the link status changes after the packets are
>>>>> prepared, the packets may fail to be sent because packets allocation may
>>>>> change.
>>>>>
>>>>> So, in this patch, the tx_prepare callback of bonding driver is not
>>>>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>>>>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>>>>> tx_offloads can be processed correctly for all NIC devices in these modes.
>>>>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>>>>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>>>>> In these cases, the impact on performance will be very limited. It is
>>>>> the responsibility of the slave PMDs to decide when the real tx_prepare
>>>>> needs to be used. The information from dev_config/queue_setup is
>>>>> sufficient for them to make these decisions.
>>>>>
>>>>> Note:
>>>>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>>>>> because in broadcast mode, a packet needs to be sent by all slave ports.
>>>>> Different PMDs process the packets differently in tx_prepare. As a result,
>>>>> the sent packet may be incorrect.
>>>>>
>>>>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>>>>> ---
>>>>>   drivers/net/bonding/rte_eth_bond.h     |  1 -
>>>>>   drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>>>>   2 files changed, 24 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>>>>> index 874aa91..1e6cc6d 100644
>>>>> --- a/drivers/net/bonding/rte_eth_bond.h
>>>>> +++ b/drivers/net/bonding/rte_eth_bond.h
>>>>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>>>>   int
>>>>>   rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>>>>
>>>>> -
>>>>>   #ifdef __cplusplus
>>>>>   }
>>>>>   #endif
>>>>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> index 2e9cea5..84af348 100644
>>>>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>>>>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>>>>   	/* Send packet burst on each slave device */
>>>>>   	for (i = 0; i < num_of_slaves; i++) {
>>>>>   		if (slave_nb_pkts[i] > 0) {
>>>>> +			int nb_prep_pkts;
>>>>> +
>>>>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>>>>> +					bd_tx_q->queue_id, slave_bufs[i],
>>>>> +					slave_nb_pkts[i]);
>>>>> +
>>>>
>>>> Shouldn't it be called iff queue Tx offloads are not zero?
>>>> It will allow to decrease performance degradation if no
>>>> Tx offloads are enabled. Same in all cases below.
>>>
>>> Regarding this point, it has been discussed in the previous RFC:
>>> https://inbox.dpdk.org/dev/47f907cf-3933-1de9-9c45-6734b912eccd@huawei.com/
>>>
>>> According to the TX_OFFLOAD status of the current device, PMDs can determine
>>> whether tx_prepare is currently needed. If it is not needed, set pkt_tx_prepare
>>> to NULL, so that the actual tx_prepare processing will be skipped directly in
>>> rte_eth_tx_prepare().
>>
>> I still think that the following is right:
>> No Tx offloads at all => Tx prepare is not necessary
>>
>> Am I wrong?
>>
> 
> Let PMDs determine whether tx_prepare() need be done could reduce the performance
> loss in more scenarios. For example, some offload do not need a Tx prepare, and PMDs
> could set tx_prepare to NULL in this scenario. Even if rte_eth_tx_prepare() is called,
> it will not perform the tx_prepare callback, and then return. In this case, there is
> only one judgment logic. If we judge whether tx_offloads are not zero, one more logical
> judgment is added.
I'll wait for net/bonding maintainers decision here.
IMHO all above assumptions should be proven by performance measurements.
> Of course, some PMDs currently do not optimize tx_prepare, which may have a performance
> impact. We hope to force them to optimize tx_prepare in this way, just like they optimize
> tx_burst. This makes it easier for users to use tx_prepare(), and no longer need to
> consider that using tx_prepare() will introduce unnecessary performance degradation.
> 
> IMHO tx_prepare() should be extended to all scenarios for use, and the impact on
> performance should be optimized by PMDs. Let the application consider when it should be
> used and when it should not be used, in many cases it will not be used and then introduced
> some problem.
> 
>>>>
>>>>>   			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>> -					slave_bufs[i], slave_nb_pkts[i]);
>>>>> +					slave_bufs[i], nb_prep_pkts);
>>>>
>>>> In fact it is a problem here and really big problems.
>>>> Tx prepare may fail and return less packets. Tx prepare
>>>> of some packet may always fail. If application tries to
>>>> send packets in a loop until success, it will be a
>>>> forever loop here. Since application calls Tx burst,
>>>> it is 100% legal behaviour of the function to return 0
>>>> if Tx ring is full. It is not an error indication.
>>>> However, in the case of Tx prepare it is an error
>>>> indication.
>>>>
>>>> Should we change Tx burst description and enforce callers
>>>> to check for rte_errno? It sounds like a major change...
>>>>
>>>
>>> I agree that if the failure is caused by Tx ring full, it is a legal behaviour.
>>> But what about the failure caused by other reasons? At present, it is possible
>>> for some PMDs to fail during tx_burst due to other reasons. In this case,
>>> repeated tries to send will also fail.
>>
>> If so, packet should be simply dropped by Tx burst and Tx burst
>> should move on. If a packet cannot be transmitted, it must be
>> dropped (counted) and Tx burst should move to the next packet.
>>
>>> I'm not sure if all PMDs need to support the behavior of sending packets in a
>>> loop until it succeeds. If not, I think the current problem can be reminded to
>>> the user by adding a description to the bonding. If it is necessary, I think the
>>> description of tx_burst should also add related instructions, so that the developers
>>> of PMDs can better understand how tx_burst should be designed, such as putting all
>>> hardware-related constraint checks into tx_prepare. And another prerequisite for
>>> the above behavior is that the packets must be prepared (i.e. checked by
>>> rte_eth_tx_prepare()). Otherwise, it may also fail to send. This means that we have
>>> to use rte_eth_tx_prepare() in more scenarios.
>>
>> IMHO any PMD specific behaviour is a nightmare to application
>> developer and must be avoided. Ideally application should not
>> care if it is running on top of tap, virtio, failsafe or
>> bonding. It should talk to ethdev in terms of ethdev API that's
>> it. I know that net/bonding is designed that application should
>> know about it, but IMHO the places where it requires the
>> knowledge must be minimized to make applications more portable
>> across various PMDs/HW.
>>
>> I think that the only sensible solution for above problem is
>> to skip a packet which prepare dislikes. count it as dropped
>> and try to prepare/transmit subsequent packets.
> 
> Agree, I will fix this in the next version.
> 
>>
>> It is an interesting effect of the Tx prepare just before
>> Tx burst inside bonding PMD. If Tx burst fails to send
>> something because ring is full, a number of packets will
>> be processed by Tx prepare again and again. I guess it is
>> unavoidable.
>>
>>> What's Ferruh's opinion on this?
>>>
>>>> [snip]
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding
  2021-06-08  9:49     ` Andrew Rybchenko
  2021-06-09  6:42       ` Chengchang Tang
@ 2022-05-24 12:11       ` Min Hu (Connor)
  1 sibling, 0 replies; 61+ messages in thread
From: Min Hu (Connor) @ 2022-05-24 12:11 UTC (permalink / raw)
  To: Andrew Rybchenko, Chengchang Tang, dev
  Cc: linuxarm, chas3, ferruh.yigit, konstantin.ananyev
Hi, Andrew,
在 2021/6/8 17:49, Andrew Rybchenko 写道:
> "for bonding" is redundant in the summary since it is already "net/bonding".
> 
> On 4/23/21 12:46 PM, Chengchang Tang wrote:
>> To use the HW offloads capability (e.g. checksum and TSO) in the Tx
>> direction, the upper-layer users need to call rte_eth_dev_prepare to do
>> some adjustment to the packets before sending them (e.g. processing
>> pseudo headers when Tx checksum offoad enabled). But, the tx_prepare
>> callback of the bond driver is not implemented. Therefore, related
>> offloads can not be used unless the upper layer users process the packet
>> properly in their own application. But it is bad for the
>> transplantability.
>>
>> However, it is difficult to design the tx_prepare callback for bonding
>> driver. Because when a bonded device sends packets, the bonded device
>> allocates the packets to different slave devices based on the real-time
>> link status and bonding mode. That is, it is very difficult for the
>> bonding device to determine which slave device's prepare function should
>> be invoked. In addition, if the link status changes after the packets are
>> prepared, the packets may fail to be sent because packets allocation may
>> change.
>>
>> So, in this patch, the tx_prepare callback of bonding driver is not
>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>> all the fast path packet in mode 0, 1, 2, 4, 5, 6. In this way, all
>> tx_offloads can be processed correctly for all NIC devices in these modes.
>> If tx_prepare is not required in some cases, then slave PMDs tx_prepare
>> pointer should be NULL and rte_eth_tx_prepare() will be just a NOOP.
>> In these cases, the impact on performance will be very limited. It is
>> the responsibility of the slave PMDs to decide when the real tx_prepare
>> needs to be used. The information from dev_config/queue_setup is
>> sufficient for them to make these decisions.
>>
>> Note:
>> The rte_eth_tx_prepare is not added to bond mode 3(Broadcast). This is
>> because in broadcast mode, a packet needs to be sent by all slave ports.
>> Different PMDs process the packets differently in tx_prepare. As a result,
>> the sent packet may be incorrect.
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> ---
>>   drivers/net/bonding/rte_eth_bond.h     |  1 -
>>   drivers/net/bonding/rte_eth_bond_pmd.c | 28 ++++++++++++++++++++++++----
>>   2 files changed, 24 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
>> index 874aa91..1e6cc6d 100644
>> --- a/drivers/net/bonding/rte_eth_bond.h
>> +++ b/drivers/net/bonding/rte_eth_bond.h
>> @@ -343,7 +343,6 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
>>   int
>>   rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
>>
>> -
>>   #ifdef __cplusplus
>>   }
>>   #endif
>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
>> index 2e9cea5..84af348 100644
>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>> @@ -606,8 +606,14 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>>   	/* Send packet burst on each slave device */
>>   	for (i = 0; i < num_of_slaves; i++) {
>>   		if (slave_nb_pkts[i] > 0) {
>> +			int nb_prep_pkts;
>> +
>> +			nb_prep_pkts = rte_eth_tx_prepare(slaves[i],
>> +					bd_tx_q->queue_id, slave_bufs[i],
>> +					slave_nb_pkts[i]);
>> +
> 
> Shouldn't it be called iff queue Tx offloads are not zero?
> It will allow to decrease performance degradation if no
> Tx offloads are enabled. Same in all cases below.
> 
>>   			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> -					slave_bufs[i], slave_nb_pkts[i]);
>> +					slave_bufs[i], nb_prep_pkts);
> 
> In fact it is a problem here and really big problems.
> Tx prepare may fail and return less packets. Tx prepare
> of some packet may always fail. If application tries to
> send packets in a loop until success, it will be a
> forever loop here. Since application calls Tx burst,
> it is 100% legal behaviour of the function to return 0
> if Tx ring is full. It is not an error indication.
> However, in the case of Tx prepare it is an error
> indication.
Just regardless of this patch, I think the problem already exit in
'rte_eth_tx_burst'.
For example, there exits one 'bad' rte_mbuf in tx_pkts[].
set net driver 'fm10k' as an example, in 'fm10k_xmit_pkts',
when one rte_mbuf is 'bad'(mb->nb_segs == 0), it will break
and return the pkt num( < nb_pkts) which successfully xmited.
The code is like :
"
uint16_t
fm10k_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
	uint16_t nb_pkts)
{
	....
	for (count = 0; count < nb_pkts; ++count) {
		/* sanity check to make sure the mbuf is valid */
		if ((mb->nb_segs == 0) ||
		    ((mb->nb_segs > 1) && (mb->next == NULL)))
			break;
	}
	return count;
}
"
So, if APP send packets in a loop until success, it will be
also forever loop here.
And one solution to fix it is depending on APP itself, like what testpmd
has done: it adds delay time, like that:
"
	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
	/*
	 * Retry if necessary
	 */
	if (unlikely(nb_tx < nb_rx) && fs->retry_enabled) {
		retry = 0;
		while (nb_tx < nb_rx && retry++ < burst_tx_retry_num) {
			rte_delay_us(burst_tx_delay_time);
			nb_tx += rte_eth_tx_burst(fs->tx_port, fs->tx_queue,
					&pkts_burst[nb_tx], nb_rx - nb_tx);
		}
	}
"
what I mean, this patch does not introduce new 'bugs' to 
'rte_eth_tx_burst'. And also, the known bug in retry situation can be 
fixed in APP.
> 
> Should we change Tx burst description and enforce callers
> to check for rte_errno? It sounds like a major change...
> 
> [snip]
> .
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v2 0/3] add Tx prepare support for bonding driver
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
  2021-06-08  9:49     ` Andrew Rybchenko
@ 2022-07-25  4:08     ` Chengwen Feng
  2022-07-25  4:08       ` [PATCH v2 1/3] net/bonding: support Tx prepare Chengwen Feng
                         ` (4 more replies)
  2022-09-17  4:15     ` [PATCH v3 " Chengwen Feng
                       ` (2 subsequent siblings)
  4 siblings, 5 replies; 61+ messages in thread
From: Chengwen Feng @ 2022-07-25  4:08 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev
This patchset adds Tx prepare for bonding driver.
Chengchang Tang (1):
  net/bonding: add testpmd cmd for Tx prepare
Chengwen Feng (2):
  net/bonding: support Tx prepare
  net/bonding: support Tx prepare fail stats
 .../link_bonding_poll_mode_drv_lib.rst        | 14 ++++
 doc/guides/rel_notes/release_22_11.rst        |  6 ++
 drivers/net/bonding/bonding_testpmd.c         | 73 ++++++++++++++++-
 drivers/net/bonding/eth_bond_private.h        |  8 ++
 drivers/net/bonding/rte_eth_bond.h            | 24 ++++++
 drivers/net/bonding/rte_eth_bond_api.c        | 32 ++++++++
 drivers/net/bonding/rte_eth_bond_pmd.c        | 81 +++++++++++++++++--
 drivers/net/bonding/version.map               |  5 ++
 8 files changed, 234 insertions(+), 9 deletions(-)
-- 
2.33.0
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-07-25  4:08     ` [PATCH v2 0/3] add Tx prepare support for bonding driver Chengwen Feng
@ 2022-07-25  4:08       ` Chengwen Feng
  2022-09-13 10:22         ` Ferruh Yigit
  2022-07-25  4:08       ` [PATCH v2 2/3] net/bonding: support Tx prepare fail stats Chengwen Feng
                         ` (3 subsequent siblings)
  4 siblings, 1 reply; 61+ messages in thread
From: Chengwen Feng @ 2022-07-25  4:08 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev
Normally, to use the HW offloads capability (e.g. checksum and TSO) in
the Tx direction, the application needs to call rte_eth_dev_prepare to
do some adjustment with the packets before sending them (e.g. processing
pseudo headers when Tx checksum offload enabled). But, the tx_prepare
callback of the bonding driver is not implemented. Therefore, the
sent packets may have errors (e.g. checksum errors).
However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonded device to determine which slave device's prepare function should
be invoked.
So, in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
all the fast path packets in mode 0, 1, 2, 4, 5, 6 (mode 3 is not
included, see[1]). In this way, all tx_offloads can be processed
correctly for all NIC devices in these modes.
As previously discussed (see V1), if the tx_prepare fails, the bonding
driver will free the cossesponding packets internally, and only the
packets of the tx_prepare OK are xmit.
To minimize performance impact, this patch adds one new
'tx_prepare_enabled' field, and corresponding control and get API:
rte_eth_bond_tx_prepare_set() and rte_eth_bond_tx_prepare_get().
[1]: In bond mode 3 (broadcast), a packet needs to be sent by all slave
ports. Different slave PMDs process the packets differently in
tx_prepare. If call tx_prepare before each slave port sending, the sent
packet may be incorrect.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 .../link_bonding_poll_mode_drv_lib.rst        |  5 ++
 doc/guides/rel_notes/release_22_11.rst        |  6 ++
 drivers/net/bonding/eth_bond_private.h        |  1 +
 drivers/net/bonding/rte_eth_bond.h            | 24 +++++++
 drivers/net/bonding/rte_eth_bond_api.c        | 32 +++++++++
 drivers/net/bonding/rte_eth_bond_pmd.c        | 65 ++++++++++++++++---
 drivers/net/bonding/version.map               |  5 ++
 7 files changed, 130 insertions(+), 8 deletions(-)
diff --git a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
index 9510368103..a3d91b2091 100644
--- a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
+++ b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
@@ -359,6 +359,11 @@ The link status of a bonded device is dictated by that of its slaves, if all
 slave device link status are down or if all slaves are removed from the link
 bonding device then the link status of the bonding device will go down.
 
+Unlike normal PMD drivers, the Tx prepare for the bonding driver is controlled
+by ``rte_eth_bond_tx_prepare_set`` (all bond modes except mode 3 (broadcast)
+are supported). The ``rte_eth_bond_tx_prepare_get`` for querying the enabling
+status is provided.
+
 It is also possible to configure / query the configuration of the control
 parameters of a bonded device using the provided APIs
 ``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``,
diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
index 8c021cf050..6e28a6c0af 100644
--- a/doc/guides/rel_notes/release_22_11.rst
+++ b/doc/guides/rel_notes/release_22_11.rst
@@ -55,6 +55,12 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added Tx prepare for bonding driver.**
+
+  * Added ``rte_eth_bond_tx_prepare_set`` to set whether enable Tx prepare for bonded port.
+    All bond modes except mode 3 (broadcast) are supported.
+  * Added ``rte_eth_bond_tx_prepare_get`` to get whether Tx prepare enabled for bonded port.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/bonding/eth_bond_private.h b/drivers/net/bonding/eth_bond_private.h
index 8222e3cd38..9996f6673c 100644
--- a/drivers/net/bonding/eth_bond_private.h
+++ b/drivers/net/bonding/eth_bond_private.h
@@ -117,6 +117,7 @@ struct bond_dev_private {
 	uint16_t user_defined_primary_port;
 	/**< Flag for whether primary port is user defined or not */
 
+	uint8_t tx_prepare_enabled;
 	uint8_t balance_xmit_policy;
 	/**< Transmit policy - l2 / l23 / l34 for operation in balance mode */
 	burst_xmit_hash_t burst_xmit_hash;
diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
index 874aa91a5f..deae2dd9ad 100644
--- a/drivers/net/bonding/rte_eth_bond.h
+++ b/drivers/net/bonding/rte_eth_bond.h
@@ -343,6 +343,30 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
 int
 rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
 
+/**
+ * Set whether enable Tx-prepare (rte_eth_tx_prepare) for bonded port
+ *
+ * @param bonded_port_id      Bonded device id
+ * @param en                  Enable flag
+ *
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_tx_prepare_set(uint16_t bonded_port_id, bool en);
+
+/**
+ * Get whether Tx-prepare (rte_eth_tx_prepare) is enabled for bonded port
+ *
+ * @param bonded_port_id      Bonded device id
+ *
+ * @return
+ *   0-disabled, 1-enabled, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_tx_prepare_get(uint16_t bonded_port_id);
 
 #ifdef __cplusplus
 }
diff --git a/drivers/net/bonding/rte_eth_bond_api.c b/drivers/net/bonding/rte_eth_bond_api.c
index 4ac191c468..47841289f4 100644
--- a/drivers/net/bonding/rte_eth_bond_api.c
+++ b/drivers/net/bonding/rte_eth_bond_api.c
@@ -1070,3 +1070,35 @@ rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id)
 
 	return internals->link_up_delay_ms;
 }
+
+int
+rte_eth_bond_tx_prepare_set(uint16_t bonded_port_id, bool en)
+{
+	struct bond_dev_private *internals;
+
+	if (valid_bonded_port_id(bonded_port_id) != 0)
+		return -1;
+
+	internals = rte_eth_devices[bonded_port_id].data->dev_private;
+	if (internals->mode == BONDING_MODE_BROADCAST) {
+		RTE_BOND_LOG(ERR, "Mode broadcast don't support to configure Tx-prepare");
+		return -ENOTSUP;
+	}
+
+	internals->tx_prepare_enabled = en ? 1 : 0;
+
+	return 0;
+}
+
+int
+rte_eth_bond_tx_prepare_get(uint16_t bonded_port_id)
+{
+	struct bond_dev_private *internals;
+
+	if (valid_bonded_port_id(bonded_port_id) != 0)
+		return -1;
+
+	internals = rte_eth_devices[bonded_port_id].data->dev_private;
+
+	return internals->tx_prepare_enabled;
+}
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index 73e6972035..c32c7e6c6c 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -559,6 +559,56 @@ bond_ethdev_rx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	return nb_recv_pkts;
 }
 
+static inline uint16_t
+bond_ethdev_tx_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
+		    struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	struct bond_dev_private *internals = bd_tx_q->dev_private;
+	uint16_t queue_id = bd_tx_q->queue_id;
+	struct rte_mbuf *fail_pkts[nb_pkts];
+	uint8_t fail_mark[nb_pkts];
+	uint16_t nb_pre, index;
+	uint16_t fail_cnt = 0;
+	int i;
+
+	if (!internals->tx_prepare_enabled)
+		goto tx_burst;
+
+	nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id, tx_pkts, nb_pkts);
+	if (nb_pre == nb_pkts)
+		goto tx_burst;
+
+	fail_pkts[fail_cnt++] = tx_pkts[nb_pre];
+	memset(fail_mark, 0, sizeof(fail_mark));
+	fail_mark[nb_pre] = 1;
+	for (i = nb_pre + 1; i < nb_pkts; /* update in inner loop */) {
+		nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id,
+					    tx_pkts + i, nb_pkts - i);
+		if (nb_pre == nb_pkts - i)
+			break;
+		fail_pkts[fail_cnt++] = tx_pkts[i + nb_pre];
+		fail_mark[i + nb_pre] = 1;
+		i += nb_pre + 1;
+	}
+
+	/* move tx-prepare OK mbufs to the end */
+	for (i = index = nb_pkts - 1; i >= 0; i--) {
+		if (!fail_mark[i])
+			tx_pkts[index--] = tx_pkts[i];
+	}
+	/* move tx-prepare fail mbufs to the begin, and free them */
+	for (i = 0; i < fail_cnt; i++) {
+		tx_pkts[i] = fail_pkts[i];
+		rte_pktmbuf_free(fail_pkts[i]);
+	}
+
+	if (fail_cnt == nb_pkts)
+		return nb_pkts;
+tx_burst:
+	return fail_cnt + rte_eth_tx_burst(slave_port_id, queue_id,
+				tx_pkts + fail_cnt, nb_pkts - fail_cnt);
+}
+
 static uint16_t
 bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 		uint16_t nb_pkts)
@@ -602,7 +652,7 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < num_of_slaves; i++) {
 		if (slave_nb_pkts[i] > 0) {
-			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+			num_tx_slave = bond_ethdev_tx_wrap(bd_tx_q, slaves[i],
 					slave_bufs[i], slave_nb_pkts[i]);
 
 			/* if tx burst fails move packets to end of bufs */
@@ -635,8 +685,8 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 	if (internals->active_slave_count < 1)
 		return 0;
 
-	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+	return bond_ethdev_tx_wrap(bd_tx_q, internals->current_primary_port,
+				bufs, nb_pkts);
 }
 
 static inline uint16_t
@@ -951,8 +1001,8 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 		}
 
-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-				bufs + num_tx_total, nb_pkts - num_tx_total);
+		num_tx_total += bond_ethdev_tx_wrap(bd_tx_q, slaves[i],
+					bufs + num_tx_total, nb_pkts - num_tx_total);
 
 		if (num_tx_total == nb_pkts)
 			break;
@@ -1158,9 +1208,8 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 		if (slave_nb_bufs[i] == 0)
 			continue;
 
-		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
-				bd_tx_q->queue_id, slave_bufs[i],
-				slave_nb_bufs[i]);
+		slave_tx_count = bond_ethdev_tx_wrap(bd_tx_q,
+			slave_port_ids[i], slave_bufs[i], slave_nb_bufs[i]);
 
 		total_tx_count += slave_tx_count;
 
diff --git a/drivers/net/bonding/version.map b/drivers/net/bonding/version.map
index 9333923b4e..2c121f2559 100644
--- a/drivers/net/bonding/version.map
+++ b/drivers/net/bonding/version.map
@@ -31,3 +31,8 @@ DPDK_23 {
 
 	local: *;
 };
+
+EXPERIMENTAL {
+	rte_eth_bond_tx_prepare_get;
+	rte_eth_bond_tx_prepare_set;
+};
\ No newline at end of file
-- 
2.33.0
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v2 2/3] net/bonding: support Tx prepare fail stats
  2022-07-25  4:08     ` [PATCH v2 0/3] add Tx prepare support for bonding driver Chengwen Feng
  2022-07-25  4:08       ` [PATCH v2 1/3] net/bonding: support Tx prepare Chengwen Feng
@ 2022-07-25  4:08       ` Chengwen Feng
  2022-07-25  4:08       ` [PATCH v2 3/3] net/bonding: add testpmd cmd for Tx prepare Chengwen Feng
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 61+ messages in thread
From: Chengwen Feng @ 2022-07-25  4:08 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev
If the Tx prepare fails, the bonding driver will free the corresponding
packets internally, and only the packets of the Tx prepare OK are xmit.
In this patch, the number of Tx prepare fails will be counted, the
result is added in the 'struct rte_eth_stats' oerrors field.
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 drivers/net/bonding/eth_bond_private.h |  7 +++++++
 drivers/net/bonding/rte_eth_bond_pmd.c | 16 ++++++++++++++++
 2 files changed, 23 insertions(+)
diff --git a/drivers/net/bonding/eth_bond_private.h b/drivers/net/bonding/eth_bond_private.h
index 9996f6673c..aa33fa0043 100644
--- a/drivers/net/bonding/eth_bond_private.h
+++ b/drivers/net/bonding/eth_bond_private.h
@@ -72,6 +72,13 @@ struct bond_tx_queue {
 	/**< Number of TX descriptors available for the queue */
 	struct rte_eth_txconf tx_conf;
 	/**< Copy of TX configuration structure for queue */
+
+	/*
+	 * The following fields are statistical value, and maybe update
+	 * at runtime, so start with one new cache line.
+	 */
+	uint64_t prepare_fails __rte_cache_aligned;
+	/**< Tx prepare fail cnt */
 };
 
 /** Bonded slave devices structure */
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index c32c7e6c6c..84fd0e5a73 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -602,6 +602,7 @@ bond_ethdev_tx_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
 		rte_pktmbuf_free(fail_pkts[i]);
 	}
 
+	bd_tx_q->prepare_fails += fail_cnt;
 	if (fail_cnt == nb_pkts)
 		return nb_pkts;
 tx_burst:
@@ -2399,6 +2400,8 @@ bond_ethdev_tx_queue_setup(struct rte_eth_dev *dev, uint16_t tx_queue_id,
 	bd_tx_q->nb_tx_desc = nb_tx_desc;
 	memcpy(&(bd_tx_q->tx_conf), tx_conf, sizeof(bd_tx_q->tx_conf));
 
+	bd_tx_q->prepare_fails = 0;
+
 	dev->data->tx_queues[tx_queue_id] = bd_tx_q;
 
 	return 0;
@@ -2609,6 +2612,7 @@ bond_ethdev_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
 {
 	struct bond_dev_private *internals = dev->data->dev_private;
 	struct rte_eth_stats slave_stats;
+	struct bond_tx_queue *bd_tx_q;
 	int i, j;
 
 	for (i = 0; i < internals->slave_count; i++) {
@@ -2630,7 +2634,12 @@ bond_ethdev_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
 			stats->q_obytes[j] += slave_stats.q_obytes[j];
 			stats->q_errors[j] += slave_stats.q_errors[j];
 		}
+	}
 
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		bd_tx_q = (struct bond_tx_queue *)dev->data->tx_queues[i];
+		if (bd_tx_q)
+			stats->oerrors += bd_tx_q->prepare_fails;
 	}
 
 	return 0;
@@ -2640,6 +2649,7 @@ static int
 bond_ethdev_stats_reset(struct rte_eth_dev *dev)
 {
 	struct bond_dev_private *internals = dev->data->dev_private;
+	struct bond_tx_queue *bd_tx_q;
 	int i;
 	int err;
 	int ret;
@@ -2650,6 +2660,12 @@ bond_ethdev_stats_reset(struct rte_eth_dev *dev)
 			err = ret;
 	}
 
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		bd_tx_q = (struct bond_tx_queue *)dev->data->tx_queues[i];
+		if (bd_tx_q)
+			bd_tx_q->prepare_fails = 0;
+	}
+
 	return err;
 }
 
-- 
2.33.0
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v2 3/3] net/bonding: add testpmd cmd for Tx prepare
  2022-07-25  4:08     ` [PATCH v2 0/3] add Tx prepare support for bonding driver Chengwen Feng
  2022-07-25  4:08       ` [PATCH v2 1/3] net/bonding: support Tx prepare Chengwen Feng
  2022-07-25  4:08       ` [PATCH v2 2/3] net/bonding: support Tx prepare fail stats Chengwen Feng
@ 2022-07-25  4:08       ` Chengwen Feng
  2022-07-25  7:04       ` [PATCH v2 0/3] add Tx prepare support for bonding driver humin (Q)
  2022-09-13  1:41       ` fengchengwen
  4 siblings, 0 replies; 61+ messages in thread
From: Chengwen Feng @ 2022-07-25  4:08 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev
From: Chengchang Tang <tangchengchang@huawei.com>
Add new command to support enable/disable Tx prepare for bonded
devices. This helps to test some Tx HW offloads (e.g. checksum and TSO)
for bonded devices in testpmd. The command is:
set bonding tx_prepare <port_id> (enable|disable)
This patch also support display Tx prepare enabling status in
'show bonding config <port_id>' command.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
---
 .../link_bonding_poll_mode_drv_lib.rst        |  9 +++
 drivers/net/bonding/bonding_testpmd.c         | 73 ++++++++++++++++++-
 2 files changed, 81 insertions(+), 1 deletion(-)
diff --git a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
index a3d91b2091..428c7d67c7 100644
--- a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
+++ b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
@@ -623,6 +623,15 @@ Enable one of the specific aggregators mode when in mode 4 (link-aggregation-802
    testpmd> set bonding agg_mode (port_id) (bandwidth|count|stable)
 
 
+set bonding tx_prepare
+~~~~~~~~~~~~~~~~~~~~~~
+
+Enable Tx prepare on bonding devices to help the slave devices prepare the packets for
+some HW offloading (e.g. checksum and TSO)::
+
+   testpmd> set bonding tx_prepare (port_id) (enable|disable)
+
+
 show bonding config
 ~~~~~~~~~~~~~~~~~~~
 
diff --git a/drivers/net/bonding/bonding_testpmd.c b/drivers/net/bonding/bonding_testpmd.c
index 3941f4cf23..da3fe03f7e 100644
--- a/drivers/net/bonding/bonding_testpmd.c
+++ b/drivers/net/bonding/bonding_testpmd.c
@@ -413,7 +413,7 @@ static void cmd_show_bonding_config_parsed(void *parsed_result,
 	__rte_unused struct cmdline *cl, __rte_unused void *data)
 {
 	struct cmd_show_bonding_config_result *res = parsed_result;
-	int bonding_mode, agg_mode;
+	int bonding_mode, agg_mode, tx_prepare_flag;
 	portid_t slaves[RTE_MAX_ETHPORTS];
 	int num_slaves, num_active_slaves;
 	int primary_id;
@@ -429,6 +429,10 @@ static void cmd_show_bonding_config_parsed(void *parsed_result,
 	}
 	printf("\tBonding mode: %d\n", bonding_mode);
 
+	/* Display the Tx-prepare flag. */
+	tx_prepare_flag = rte_eth_bond_tx_prepare_get(port_id);
+	printf("\tTx-prepare state: %s\n", tx_prepare_flag == 1 ? "on" : "off");
+
 	if (bonding_mode == BONDING_MODE_BALANCE ||
 		bonding_mode == BONDING_MODE_8023AD) {
 		int balance_xmit_policy;
@@ -962,6 +966,68 @@ static cmdline_parse_inst_t cmd_set_bonding_agg_mode_policy = {
 	}
 };
 
+struct cmd_set_bonding_tx_prepare_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t bonding;
+	cmdline_fixed_string_t tx_prepare;
+	portid_t port_id;
+	cmdline_fixed_string_t mode;
+};
+
+static void
+cmd_set_bonding_tx_prepare_parsed(void *parsed_result,
+		__rte_unused  struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_set_bonding_tx_prepare_result *res = parsed_result;
+	portid_t port_id = res->port_id;
+
+	if (!strcmp(res->mode, "enable")) {
+		if (rte_eth_bond_tx_prepare_set(port_id, true) == 0)
+			printf("Tx prepare for bonding device enabled\n");
+		else
+			printf("Enabling bonding device Tx prepare "
+					"on port %d failed\n", port_id);
+	} else if (!strcmp(res->mode, "disable")) {
+		if (rte_eth_bond_tx_prepare_set(port_id, false) == 0)
+			printf("Tx prepare for bonding device disabled\n");
+		else
+			printf("Disabling bonding device Tx prepare "
+					"on port %d failed\n", port_id);
+	}
+}
+
+static cmdline_parse_token_string_t cmd_setbonding_tx_prepare_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			set, "set");
+static cmdline_parse_token_string_t cmd_setbonding_tx_prepare_bonding =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			bonding, "bonding");
+static cmdline_parse_token_string_t cmd_setbonding_tx_prepare_tx_prepare =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			tx_prepare, "tx_prepare");
+static cmdline_parse_token_num_t cmd_setbonding_tx_prepare_port_id =
+	TOKEN_NUM_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			port_id, RTE_UINT16);
+static cmdline_parse_token_string_t cmd_setbonding_tx_prepare_mode =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			mode, "enable#disable");
+
+static cmdline_parse_inst_t cmd_set_bond_tx_prepare = {
+		.f = cmd_set_bonding_tx_prepare_parsed,
+		.help_str = "set bonding tx_prepare <port_id> enable|disable: "
+			"Enable/disable tx_prepare for port_id",
+		.data = NULL,
+		.tokens = {
+			(void *)&cmd_setbonding_tx_prepare_set,
+			(void *)&cmd_setbonding_tx_prepare_bonding,
+			(void *)&cmd_setbonding_tx_prepare_tx_prepare,
+			(void *)&cmd_setbonding_tx_prepare_port_id,
+			(void *)&cmd_setbonding_tx_prepare_mode,
+			NULL
+		}
+};
+
 static struct testpmd_driver_commands bonding_cmds = {
 	.commands = {
 	{
@@ -1024,6 +1090,11 @@ static struct testpmd_driver_commands bonding_cmds = {
 		"set bonding mode IEEE802.3AD aggregator policy (port_id) (agg_name)\n"
 		"	Set Aggregation mode for IEEE802.3AD (mode 4)\n",
 	},
+	{
+		&cmd_set_bond_tx_prepare,
+		"set bonding tx_prepare <port_id> (enable|disable)\n"
+		"	Enable/disable tx_prepare for bonded device\n",
+	},
 	{ NULL, NULL },
 	},
 };
-- 
2.33.0
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 0/3] add Tx prepare support for bonding driver
  2022-07-25  4:08     ` [PATCH v2 0/3] add Tx prepare support for bonding driver Chengwen Feng
                         ` (2 preceding siblings ...)
  2022-07-25  4:08       ` [PATCH v2 3/3] net/bonding: add testpmd cmd for Tx prepare Chengwen Feng
@ 2022-07-25  7:04       ` humin (Q)
  2022-09-13  1:41       ` fengchengwen
  4 siblings, 0 replies; 61+ messages in thread
From: humin (Q) @ 2022-07-25  7:04 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit
  Cc: dev, chas3, andrew.rybchenko, konstantin.ananyev
Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
在 2022/7/25 12:08, Chengwen Feng 写道:
> This patchset adds Tx prepare for bonding driver.
>
> Chengchang Tang (1):
>    net/bonding: add testpmd cmd for Tx prepare
>
> Chengwen Feng (2):
>    net/bonding: support Tx prepare
>    net/bonding: support Tx prepare fail stats
>
>   .../link_bonding_poll_mode_drv_lib.rst        | 14 ++++
>   doc/guides/rel_notes/release_22_11.rst        |  6 ++
>   drivers/net/bonding/bonding_testpmd.c         | 73 ++++++++++++++++-
>   drivers/net/bonding/eth_bond_private.h        |  8 ++
>   drivers/net/bonding/rte_eth_bond.h            | 24 ++++++
>   drivers/net/bonding/rte_eth_bond_api.c        | 32 ++++++++
>   drivers/net/bonding/rte_eth_bond_pmd.c        | 81 +++++++++++++++++--
>   drivers/net/bonding/version.map               |  5 ++
>   8 files changed, 234 insertions(+), 9 deletions(-)
>
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 0/3] add Tx prepare support for bonding driver
  2022-07-25  4:08     ` [PATCH v2 0/3] add Tx prepare support for bonding driver Chengwen Feng
                         ` (3 preceding siblings ...)
  2022-07-25  7:04       ` [PATCH v2 0/3] add Tx prepare support for bonding driver humin (Q)
@ 2022-09-13  1:41       ` fengchengwen
  4 siblings, 0 replies; 61+ messages in thread
From: fengchengwen @ 2022-09-13  1:41 UTC (permalink / raw)
  To: thomas, ferruh.yigit, andrew.rybchenko
  Cc: dev, chas3, humin29, konstantin.ananyev
Kindly ping.
On 2022/7/25 12:08, Chengwen Feng wrote:
> This patchset adds Tx prepare for bonding driver.
> 
> Chengchang Tang (1):
>   net/bonding: add testpmd cmd for Tx prepare
> 
> Chengwen Feng (2):
>   net/bonding: support Tx prepare
>   net/bonding: support Tx prepare fail stats
> 
>  .../link_bonding_poll_mode_drv_lib.rst        | 14 ++++
>  doc/guides/rel_notes/release_22_11.rst        |  6 ++
>  drivers/net/bonding/bonding_testpmd.c         | 73 ++++++++++++++++-
>  drivers/net/bonding/eth_bond_private.h        |  8 ++
>  drivers/net/bonding/rte_eth_bond.h            | 24 ++++++
>  drivers/net/bonding/rte_eth_bond_api.c        | 32 ++++++++
>  drivers/net/bonding/rte_eth_bond_pmd.c        | 81 +++++++++++++++++--
>  drivers/net/bonding/version.map               |  5 ++
>  8 files changed, 234 insertions(+), 9 deletions(-)
> 
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-07-25  4:08       ` [PATCH v2 1/3] net/bonding: support Tx prepare Chengwen Feng
@ 2022-09-13 10:22         ` Ferruh Yigit
  2022-09-13 15:08           ` Chas Williams
  2022-09-14  0:46           ` fengchengwen
  0 siblings, 2 replies; 61+ messages in thread
From: Ferruh Yigit @ 2022-09-13 10:22 UTC (permalink / raw)
  To: Chengwen Feng, thomas, andrew.rybchenko, konstantin.ananyev
  Cc: dev, chas3, humin29
On 7/25/2022 5:08 AM, Chengwen Feng wrote:
> 
> Normally, to use the HW offloads capability (e.g. checksum and TSO) in
> the Tx direction, the application needs to call rte_eth_dev_prepare to
> do some adjustment with the packets before sending them (e.g. processing
> pseudo headers when Tx checksum offload enabled). But, the tx_prepare
> callback of the bonding driver is not implemented. Therefore, the
> sent packets may have errors (e.g. checksum errors).
> 
> However, it is difficult to design the tx_prepare callback for bonding
> driver. Because when a bonded device sends packets, the bonded device
> allocates the packets to different slave devices based on the real-time
> link status and bonding mode. That is, it is very difficult for the
> bonded device to determine which slave device's prepare function should
> be invoked.
> 
> So, in this patch, the tx_prepare callback of bonding driver is not
> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
> all the fast path packets in mode 0, 1, 2, 4, 5, 6 (mode 3 is not
> included, see[1]). In this way, all tx_offloads can be processed
> correctly for all NIC devices in these modes.
> 
> As previously discussed (see V1), if the tx_prepare fails, the bonding
> driver will free the cossesponding packets internally, and only the
> packets of the tx_prepare OK are xmit.
> 
Please provide link to discussion you refer to.
> To minimize performance impact, this patch adds one new
> 'tx_prepare_enabled' field, and corresponding control and get API:
> rte_eth_bond_tx_prepare_set() and rte_eth_bond_tx_prepare_get().
> 
> [1]: In bond mode 3 (broadcast), a packet needs to be sent by all slave
> ports. Different slave PMDs process the packets differently in
> tx_prepare. If call tx_prepare before each slave port sending, the sent
> packet may be incorrect.
> 
> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
<...>
> +static inline uint16_t
> +bond_ethdev_tx_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
> +                   struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
> +{
> +       struct bond_dev_private *internals = bd_tx_q->dev_private;
> +       uint16_t queue_id = bd_tx_q->queue_id;
> +       struct rte_mbuf *fail_pkts[nb_pkts];
> +       uint8_t fail_mark[nb_pkts];
> +       uint16_t nb_pre, index;
> +       uint16_t fail_cnt = 0;
> +       int i;
> +
> +       if (!internals->tx_prepare_enabled)
> +               goto tx_burst;
> +
> +       nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id, tx_pkts, nb_pkts);
> +       if (nb_pre == nb_pkts)
> +               goto tx_burst;
> +
> +       fail_pkts[fail_cnt++] = tx_pkts[nb_pre];
> +       memset(fail_mark, 0, sizeof(fail_mark));
> +       fail_mark[nb_pre] = 1;
> +       for (i = nb_pre + 1; i < nb_pkts; /* update in inner loop */) {
> +               nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id,
> +                                           tx_pkts + i, nb_pkts - i);
I assume intention is to make this as transparent as possible to the 
user, that is why you are using a wrapper that combines 
`rte_eth_tx_prepare()` & `rte_eth_tx_burst()` APIs. But for other PMDs 
`rte_eth_tx_burst()` is called explicitly by the application.
Path is also adding two new bonding specific APIs to enable/disable Tx 
prepare.
Instead if you leave calling `rte_eth_tx_prepare()` decision to user, 
there will be no need for the enable/disable Tx prepare APIs and the 
wrapper.
The `tx_pkt_prepare()` implementation in bonding can do the mode check, 
call Tx prepare for all slaves and apply failure recovery, as done in 
this wrapper function, what do you think, will it work?
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-13 10:22         ` Ferruh Yigit
@ 2022-09-13 15:08           ` Chas Williams
  2022-09-14  0:46           ` fengchengwen
  1 sibling, 0 replies; 61+ messages in thread
From: Chas Williams @ 2022-09-13 15:08 UTC (permalink / raw)
  To: Ferruh Yigit, Chengwen Feng, thomas, andrew.rybchenko,
	konstantin.ananyev
  Cc: dev, chas3, humin29
On 9/13/22 06:22, Ferruh Yigit wrote:
> On 7/25/2022 5:08 AM, Chengwen Feng wrote:
>
>
> I assume intention is to make this as transparent as possible to the
> user, that is why you are using a wrapper that combines
> `rte_eth_tx_prepare()` & `rte_eth_tx_burst()` APIs. But for other PMDs
> `rte_eth_tx_burst()` is called explicitly by the application.
>
> Path is also adding two new bonding specific APIs to enable/disable Tx
> prepare.
> Instead if you leave calling `rte_eth_tx_prepare()` decision to user,
> there will be no need for the enable/disable Tx prepare APIs and the
> wrapper.
>
> The `tx_pkt_prepare()` implementation in bonding can do the mode check,
> call Tx prepare for all slaves and apply failure recovery, as done in
> this wrapper function, what do you think, will it work?
>
If you leave calling tx_prepare to the user, you need to perform a
hash calculation to determine an output device. Since the output
devices might be different types, you really need to know which
device's tx_prepare needs to be called. You potentially will calculate
the packet hash twice. While you could be clever with storing the hash
in the mbuf somewhere, that's likely more complicated.
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-13 10:22         ` Ferruh Yigit
  2022-09-13 15:08           ` Chas Williams
@ 2022-09-14  0:46           ` fengchengwen
  2022-09-14 16:59             ` Chas Williams
  1 sibling, 1 reply; 61+ messages in thread
From: fengchengwen @ 2022-09-14  0:46 UTC (permalink / raw)
  To: Ferruh Yigit, thomas, andrew.rybchenko, konstantin.ananyev
  Cc: dev, chas3, humin29, 3chas3
Hi, Ferruh
On 2022/9/13 18:22, Ferruh Yigit wrote:
> On 7/25/2022 5:08 AM, Chengwen Feng wrote:
> 
>>
>> Normally, to use the HW offloads capability (e.g. checksum and TSO) in
>> the Tx direction, the application needs to call rte_eth_dev_prepare to
>> do some adjustment with the packets before sending them (e.g. processing
>> pseudo headers when Tx checksum offload enabled). But, the tx_prepare
>> callback of the bonding driver is not implemented. Therefore, the
>> sent packets may have errors (e.g. checksum errors).
>>
>> However, it is difficult to design the tx_prepare callback for bonding
>> driver. Because when a bonded device sends packets, the bonded device
>> allocates the packets to different slave devices based on the real-time
>> link status and bonding mode. That is, it is very difficult for the
>> bonded device to determine which slave device's prepare function should
>> be invoked.
>>
>> So, in this patch, the tx_prepare callback of bonding driver is not
>> implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
>> all the fast path packets in mode 0, 1, 2, 4, 5, 6 (mode 3 is not
>> included, see[1]). In this way, all tx_offloads can be processed
>> correctly for all NIC devices in these modes.
>>
>> As previously discussed (see V1), if the tx_prepare fails, the bonding
>> driver will free the cossesponding packets internally, and only the
>> packets of the tx_prepare OK are xmit.
>>
> 
> Please provide link to discussion you refer to.
https://inbox.dpdk.org/dev/1618571071-5927-2-git-send-email-tangchengchang@huawei.com/
Should I push a new version for it?
> 
>> To minimize performance impact, this patch adds one new
>> 'tx_prepare_enabled' field, and corresponding control and get API:
>> rte_eth_bond_tx_prepare_set() and rte_eth_bond_tx_prepare_get().
>>
>> [1]: In bond mode 3 (broadcast), a packet needs to be sent by all slave
>> ports. Different slave PMDs process the packets differently in
>> tx_prepare. If call tx_prepare before each slave port sending, the sent
>> packet may be incorrect.
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> 
> <...>
> 
>> +static inline uint16_t
>> +bond_ethdev_tx_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
>> +                   struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
>> +{
>> +       struct bond_dev_private *internals = bd_tx_q->dev_private;
>> +       uint16_t queue_id = bd_tx_q->queue_id;
>> +       struct rte_mbuf *fail_pkts[nb_pkts];
>> +       uint8_t fail_mark[nb_pkts];
>> +       uint16_t nb_pre, index;
>> +       uint16_t fail_cnt = 0;
>> +       int i;
>> +
>> +       if (!internals->tx_prepare_enabled)
>> +               goto tx_burst;
>> +
>> +       nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id, tx_pkts, nb_pkts);
>> +       if (nb_pre == nb_pkts)
>> +               goto tx_burst;
>> +
>> +       fail_pkts[fail_cnt++] = tx_pkts[nb_pre];
>> +       memset(fail_mark, 0, sizeof(fail_mark));
>> +       fail_mark[nb_pre] = 1;
>> +       for (i = nb_pre + 1; i < nb_pkts; /* update in inner loop */) {
>> +               nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id,
>> +                                           tx_pkts + i, nb_pkts - i);
> 
> 
> I assume intention is to make this as transparent as possible to the user, that is why you are using a wrapper that combines `rte_eth_tx_prepare()` & `rte_eth_tx_burst()` APIs. But for other PMDs `rte_eth_tx_burst()` is called explicitly by the application.
> 
> Path is also adding two new bonding specific APIs to enable/disable Tx prepare.
> Instead if you leave calling `rte_eth_tx_prepare()` decision to user, there will be no need for the enable/disable Tx prepare APIs and the wrapper.
> 
> The `tx_pkt_prepare()` implementation in bonding can do the mode check, call Tx prepare for all slaves and apply failure recovery, as done in this wrapper function, what do you think, will it work?
I see Chas Williams also reply this thread, thanks.
The main problem is hard to design a tx_prepare for bonding device:
1. as Chas Williams said, there maybe twice hash calc to get target slave
   devices.
2. also more important, if the slave devices have changes(e.g. slave device
   link down or remove), and if the changes happens between bond-tx-prepare and
   bond-tx-burst, the output slave will changes, and this may lead to checksum
   failed. (Note: a bond device with slave devices may from different vendors,
   and slave devices may have different requirements, e.g. slave-A support calc
   IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
   pre-calc).
Current design cover the above two scenarios by using in-place tx-prepare. and
in addition, bond devices are not transparent to applications, I think it's a
practical method to provide tx-prepare support in this way.
> 
> .
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-14  0:46           ` fengchengwen
@ 2022-09-14 16:59             ` Chas Williams
  2022-09-17  2:35               ` fengchengwen
  0 siblings, 1 reply; 61+ messages in thread
From: Chas Williams @ 2022-09-14 16:59 UTC (permalink / raw)
  To: fengchengwen, Ferruh Yigit, thomas, andrew.rybchenko, konstantin.ananyev
  Cc: dev, chas3, humin29
On 9/13/22 20:46, fengchengwen wrote:
> 
> The main problem is hard to design a tx_prepare for bonding device:
> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>     devices.
> 2. also more important, if the slave devices have changes(e.g. slave device
>     link down or remove), and if the changes happens between bond-tx-prepare and
>     bond-tx-burst, the output slave will changes, and this may lead to checksum
>     failed. (Note: a bond device with slave devices may from different vendors,
>     and slave devices may have different requirements, e.g. slave-A support calc
>     IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>     pre-calc).
> 
> Current design cover the above two scenarios by using in-place tx-prepare. and
> in addition, bond devices are not transparent to applications, I think it's a
> practical method to provide tx-prepare support in this way.
> 
I don't think you need to export an enable/disable routine for the use of
rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
implemented. You are just trading one branch in DPDK librte_eth_dev for a
branch in drivers/net/bonding.
I think you missed fixing tx_machine in 802.3ad support. We have been using
the following patch locally which I never got around to submitting.
 From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
From: "Charles (Chas) Williams" <chwillia@ciena.com>
Date: Tue, 3 May 2022 16:52:37 -0400
Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
Some PMDs might require a call to rte_eth_tx_prepare before sending the
packets for transmission. Typically, the prepare step handles the VLAN
headers, but it may need to do other things.
Signed-off-by: Chas Williams <chwillia@ciena.com>
---
  drivers/net/bonding/rte_eth_bond_8023ad.c | 16 +++++++-
  drivers/net/bonding/rte_eth_bond_pmd.c    | 50 +++++++++++++++++++----
  2 files changed, 55 insertions(+), 11 deletions(-)
diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
index b3cddd8a20..0680be8c06 100644
--- a/drivers/net/bonding/rte_eth_bond_8023ad.c
+++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
@@ -636,9 +636,15 @@ tx_machine(struct bond_dev_private *internals, uint16_t slave_id)
  			return;
  		}
  	} else {
-		uint16_t pkts_sent = rte_eth_tx_burst(slave_id,
+		uint16_t pkts_sent;
+		uint16_t nb_prep;
+
+		nb_prep = rte_eth_tx_prepare(slave_id,
  				internals->mode4.dedicated_queues.tx_qid,
  				&lacp_pkt, 1);
+		pkts_sent = rte_eth_tx_burst(slave_id,
+				internals->mode4.dedicated_queues.tx_qid,
+				&lacp_pkt, nb_prep);
  		if (pkts_sent != 1) {
  			rte_pktmbuf_free(lacp_pkt);
  			set_warning_flags(port, WRN_TX_QUEUE_FULL);
@@ -1371,9 +1377,15 @@ bond_mode_8023ad_handle_slow_pkt(struct bond_dev_private *internals,
  			}
  		} else {
  			/* Send packet directly to the slow queue */
-			uint16_t tx_count = rte_eth_tx_burst(slave_id,
+			uint16_t tx_count;
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(slave_id,
  					internals->mode4.dedicated_queues.tx_qid,
  					&pkt, 1);
+			tx_count = rte_eth_tx_burst(slave_id,
+					internals->mode4.dedicated_queues.tx_qid,
+					&pkt, nb_prep);
  			if (tx_count != 1) {
  				/* reset timer */
  				port->rx_marker_timer = 0;
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index b305b6a35b..c27073e66f 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -602,8 +602,12 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
  	/* Send packet burst on each slave device */
  	for (i = 0; i < num_of_slaves; i++) {
  		if (slave_nb_pkts[i] > 0) {
-			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
  					slave_bufs[i], slave_nb_pkts[i]);
+			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+					slave_bufs[i], nb_prep);
  
  			/* if tx burst fails move packets to end of bufs */
  			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
@@ -628,6 +632,7 @@ bond_ethdev_tx_burst_active_backup(void *queue,
  {
  	struct bond_dev_private *internals;
  	struct bond_tx_queue *bd_tx_q;
+	uint16_t nb_prep;
  
  	bd_tx_q = (struct bond_tx_queue *)queue;
  	internals = bd_tx_q->dev_private;
@@ -635,8 +640,10 @@ bond_ethdev_tx_burst_active_backup(void *queue,
  	if (internals->active_slave_count < 1)
  		return 0;
  
-	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+	nb_prep = rte_eth_tx_prepare(internals->current_primary_port,
+				     bd_tx_q->queue_id, bufs, nb_pkts);
+	return rte_eth_tx_burst(internals->current_primary_port,
+				bd_tx_q->queue_id, bufs, nb_prep);
  }
  
  static inline uint16_t
@@ -935,6 +942,8 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
  	}
  
  	for (i = 0; i < num_of_slaves; i++) {
+		uint16_t nb_prep;
+
  		rte_eth_macaddr_get(slaves[i], &active_slave_addr);
  		for (j = num_tx_total; j < nb_pkts; j++) {
  			if (j + 3 < nb_pkts)
@@ -951,8 +960,10 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
  #endif
  		}
  
-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+		nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
  				bufs + num_tx_total, nb_pkts - num_tx_total);
+		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+				bufs + num_tx_total, nb_prep);
  
  		if (num_tx_total == nb_pkts)
  			break;
@@ -1064,8 +1075,12 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
  	/* Send ARP packets on proper slaves */
  	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
  		if (slave_bufs_pkts[i] > 0) {
-			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
  					slave_bufs[i], slave_bufs_pkts[i]);
+			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
+					slave_bufs[i], nb_prep);
  			for (j = 0; j < slave_bufs_pkts[i] - num_send; j++) {
  				bufs[nb_pkts - 1 - num_not_send - j] =
  						slave_bufs[i][nb_pkts - 1 - j];
@@ -1088,8 +1103,12 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
  	/* Send update packets on proper slaves */
  	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
  		if (update_bufs_pkts[i] > 0) {
-			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, update_bufs[i],
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(i, bd_tx_q->queue_id, update_bufs[i],
  					update_bufs_pkts[i]);
+			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, update_bufs[i],
+					nb_prep);
  			for (j = num_send; j < update_bufs_pkts[i]; j++) {
  				rte_pktmbuf_free(update_bufs[i][j]);
  			}
@@ -1155,12 +1174,17 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
  
  	/* Send packet burst on each slave device */
  	for (i = 0; i < slave_count; i++) {
+		uint16_t nb_prep;
+
  		if (slave_nb_bufs[i] == 0)
  			continue;
  
-		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+		nb_prep = rte_eth_tx_prepare(slave_port_ids[i],
  				bd_tx_q->queue_id, slave_bufs[i],
  				slave_nb_bufs[i]);
+		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+				bd_tx_q->queue_id, slave_bufs[i],
+				nb_prep);
  
  		total_tx_count += slave_tx_count;
  
@@ -1243,8 +1267,12 @@ tx_burst_8023ad(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
  
  		if (rte_ring_dequeue(port->tx_ring,
  				     (void **)&ctrl_pkt) != -ENOENT) {
-			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+			uint16_t nb_prep;
+
+			nb_prep = rte_eth_tx_prepare(slave_port_ids[i],
  					bd_tx_q->queue_id, &ctrl_pkt, 1);
+			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+					bd_tx_q->queue_id, &ctrl_pkt, nb_prep);
  			/*
  			 * re-enqueue LAG control plane packets to buffering
  			 * ring if transmission fails so the packet isn't lost.
@@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
  
  	/* Transmit burst on each active slave */
  	for (i = 0; i < num_of_slaves; i++) {
-		slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+		uint16_t nb_prep;
+
+		nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
  					bufs, nb_pkts);
+		slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+					bufs, nb_prep);
  
  		if (unlikely(slave_tx_total[i] < nb_pkts))
  			tx_failed_flag = 1;
-- 
2.30.2
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-14 16:59             ` Chas Williams
@ 2022-09-17  2:35               ` fengchengwen
  2022-09-17 13:38                 ` Chas Williams
  0 siblings, 1 reply; 61+ messages in thread
From: fengchengwen @ 2022-09-17  2:35 UTC (permalink / raw)
  To: Chas Williams, Ferruh Yigit, thomas, andrew.rybchenko,
	konstantin.ananyev
  Cc: dev, chas3, humin29
Hi Chas,
On 2022/9/15 0:59, Chas Williams wrote:
> On 9/13/22 20:46, fengchengwen wrote:
>>
>> The main problem is hard to design a tx_prepare for bonding device:
>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>     devices.
>> 2. also more important, if the slave devices have changes(e.g. slave device
>>     link down or remove), and if the changes happens between bond-tx-prepare and
>>     bond-tx-burst, the output slave will changes, and this may lead to checksum
>>     failed. (Note: a bond device with slave devices may from different vendors,
>>     and slave devices may have different requirements, e.g. slave-A support calc
>>     IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>     pre-calc).
>>
>> Current design cover the above two scenarios by using in-place tx-prepare. and
>> in addition, bond devices are not transparent to applications, I think it's a
>> practical method to provide tx-prepare support in this way.
>>
> 
> 
> I don't think you need to export an enable/disable routine for the use of
> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
> implemented. You are just trading one branch in DPDK librte_eth_dev for a
> branch in drivers/net/bonding.
Our first patch was just like yours (just add tx-prepare default), but community
is concerned about impacting performance.
As a trade-off, I think we can add the enable/disable API.
> 
> I think you missed fixing tx_machine in 802.3ad support. We have been using
> the following patch locally which I never got around to submitting.
You are right, I will send V3 fix it.
> 
> 
> From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
> From: "Charles (Chas) Williams" <chwillia@ciena.com>
> Date: Tue, 3 May 2022 16:52:37 -0400
> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
> 
> Some PMDs might require a call to rte_eth_tx_prepare before sending the
> packets for transmission. Typically, the prepare step handles the VLAN
> headers, but it may need to do other things.
> 
> Signed-off-by: Chas Williams <chwillia@ciena.com>
...
>               * ring if transmission fails so the packet isn't lost.
> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>  
>      /* Transmit burst on each active slave */
>      for (i = 0; i < num_of_slaves; i++) {
> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> +        uint16_t nb_prep;
> +
> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>                      bufs, nb_pkts);
> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> +                    bufs, nb_prep);
The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
the packet data is sent and edited at the same time. Is this likely to cause problems ?
>  
>          if (unlikely(slave_tx_total[i] < nb_pkts))
>              tx_failed_flag = 1;
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v3 0/3] add Tx prepare support for bonding driver
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
  2021-06-08  9:49     ` Andrew Rybchenko
  2022-07-25  4:08     ` [PATCH v2 0/3] add Tx prepare support for bonding driver Chengwen Feng
@ 2022-09-17  4:15     ` Chengwen Feng
  2022-09-17  4:15       ` [PATCH v3 1/3] net/bonding: support Tx prepare Chengwen Feng
                         ` (2 more replies)
  2022-10-09  3:36     ` [PATCH v4] net/bonding: call Tx prepare before Tx burst Chengwen Feng
  2022-10-11 13:20     ` [PATCH v5] " Chengwen Feng
  4 siblings, 3 replies; 61+ messages in thread
From: Chengwen Feng @ 2022-09-17  4:15 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev, 3chas3
This patchset adds Tx prepare for bonding driver.
Chengchang Tang (1):
  net/bonding: add testpmd cmd for Tx prepare
Chengwen Feng (2):
  net/bonding: support Tx prepare
  net/bonding: support Tx prepare fail stats
---
v3: support tx-prepare when Tx internal generate mbufs.
v2: support tx-prepare enable flag and fail stats.
 .../link_bonding_poll_mode_drv_lib.rst        |  14 +++
 doc/guides/rel_notes/release_22_11.rst        |   6 +
 drivers/net/bonding/bonding_testpmd.c         |  73 ++++++++++-
 drivers/net/bonding/eth_bond_private.h        |  13 ++
 drivers/net/bonding/rte_eth_bond.h            |  24 ++++
 drivers/net/bonding/rte_eth_bond_8023ad.c     |   8 +-
 drivers/net/bonding/rte_eth_bond_api.c        |  32 +++++
 drivers/net/bonding/rte_eth_bond_pmd.c        | 117 +++++++++++++++---
 drivers/net/bonding/version.map               |   5 +
 9 files changed, 273 insertions(+), 19 deletions(-)
-- 
2.17.1
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v3 1/3] net/bonding: support Tx prepare
  2022-09-17  4:15     ` [PATCH v3 " Chengwen Feng
@ 2022-09-17  4:15       ` Chengwen Feng
  2022-09-17  4:15       ` [PATCH v3 2/3] net/bonding: support Tx prepare fail stats Chengwen Feng
  2022-09-17  4:15       ` [PATCH v3 3/3] net/bonding: add testpmd cmd for Tx prepare Chengwen Feng
  2 siblings, 0 replies; 61+ messages in thread
From: Chengwen Feng @ 2022-09-17  4:15 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev, 3chas3
Normally, to use the HW offloads capability (e.g. checksum and TSO) in
the Tx direction, the application needs to call rte_eth_dev_prepare to
do some adjustment with the packets before sending them (e.g. processing
pseudo headers when Tx checksum offload enabled). But, the tx_prepare
callback of the bonding driver is not implemented. Therefore, the
sent packets may have errors (e.g. checksum errors).
However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonded device to determine which slave device's prepare function should
be invoked.
So, in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the rte_eth_dev_tx_prepare() will be called for
all the fast path packets in mode 0, 1, 2, 4, 5, 6 (mode 3 is not
included, see[1]). In this way, all tx_offloads can be processed
correctly for all NIC devices in these modes.
As previously discussed (see [2]), for user input mbufs, if the
tx_prepare fails, the bonding driver will free the cossesponding
packets internally, and only the packets of the tx_prepare OK are xmit.
To minimize performance impact, this patch adds one new
'tx_prepare_enabled' field, and corresponding control and get API:
rte_eth_bond_tx_prepare_set() and rte_eth_bond_tx_prepare_get().
[1]: In bond mode 3 (broadcast), a packet needs to be sent by all slave
ports. Different slave PMDs process the packets differently in
tx_prepare. If call tx_prepare before each slave port sending, the sent
packet may be incorrect.
[2]: https://inbox.dpdk.org/dev/1618571071-5927-2-git-send-email-tangchengchang@huawei.com/
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
---
 .../link_bonding_poll_mode_drv_lib.rst        |   5 +
 doc/guides/rel_notes/release_22_11.rst        |   6 ++
 drivers/net/bonding/eth_bond_private.h        |   6 ++
 drivers/net/bonding/rte_eth_bond.h            |  24 +++++
 drivers/net/bonding/rte_eth_bond_8023ad.c     |   8 +-
 drivers/net/bonding/rte_eth_bond_api.c        |  32 ++++++
 drivers/net/bonding/rte_eth_bond_pmd.c        | 101 +++++++++++++++---
 drivers/net/bonding/version.map               |   5 +
 8 files changed, 169 insertions(+), 18 deletions(-)
diff --git a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
index 9510368103..a3d91b2091 100644
--- a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
+++ b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
@@ -359,6 +359,11 @@ The link status of a bonded device is dictated by that of its slaves, if all
 slave device link status are down or if all slaves are removed from the link
 bonding device then the link status of the bonding device will go down.
 
+Unlike normal PMD drivers, the Tx prepare for the bonding driver is controlled
+by ``rte_eth_bond_tx_prepare_set`` (all bond modes except mode 3 (broadcast)
+are supported). The ``rte_eth_bond_tx_prepare_get`` for querying the enabling
+status is provided.
+
 It is also possible to configure / query the configuration of the control
 parameters of a bonded device using the provided APIs
 ``rte_eth_bond_mode_set/ get``, ``rte_eth_bond_primary_set/get``,
diff --git a/doc/guides/rel_notes/release_22_11.rst b/doc/guides/rel_notes/release_22_11.rst
index 8c021cf050..6e28a6c0af 100644
--- a/doc/guides/rel_notes/release_22_11.rst
+++ b/doc/guides/rel_notes/release_22_11.rst
@@ -55,6 +55,12 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added Tx prepare for bonding driver.**
+
+  * Added ``rte_eth_bond_tx_prepare_set`` to set whether enable Tx prepare for bonded port.
+    All bond modes except mode 3 (broadcast) are supported.
+  * Added ``rte_eth_bond_tx_prepare_get`` to get whether Tx prepare enabled for bonded port.
+
 
 Removed Items
 -------------
diff --git a/drivers/net/bonding/eth_bond_private.h b/drivers/net/bonding/eth_bond_private.h
index 8222e3cd38..976163b06b 100644
--- a/drivers/net/bonding/eth_bond_private.h
+++ b/drivers/net/bonding/eth_bond_private.h
@@ -117,6 +117,7 @@ struct bond_dev_private {
 	uint16_t user_defined_primary_port;
 	/**< Flag for whether primary port is user defined or not */
 
+	uint8_t tx_prepare_enabled;
 	uint8_t balance_xmit_policy;
 	/**< Transmit policy - l2 / l23 / l34 for operation in balance mode */
 	burst_xmit_hash_t burst_xmit_hash;
@@ -258,6 +259,11 @@ void
 slave_add(struct bond_dev_private *internals,
 		struct rte_eth_dev *slave_eth_dev);
 
+uint16_t
+bond_ethdev_tx_ctrl_wrap(struct bond_dev_private *internals,
+			uint16_t slave_port_id, uint16_t queue_id,
+			struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
+
 void
 burst_xmit_l2_hash(struct rte_mbuf **buf, uint16_t nb_pkts,
 		uint16_t slave_count, uint16_t *slaves);
diff --git a/drivers/net/bonding/rte_eth_bond.h b/drivers/net/bonding/rte_eth_bond.h
index 874aa91a5f..deae2dd9ad 100644
--- a/drivers/net/bonding/rte_eth_bond.h
+++ b/drivers/net/bonding/rte_eth_bond.h
@@ -343,6 +343,30 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t bonded_port_id,
 int
 rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id);
 
+/**
+ * Set whether enable Tx-prepare (rte_eth_tx_prepare) for bonded port
+ *
+ * @param bonded_port_id      Bonded device id
+ * @param en                  Enable flag
+ *
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_tx_prepare_set(uint16_t bonded_port_id, bool en);
+
+/**
+ * Get whether Tx-prepare (rte_eth_tx_prepare) is enabled for bonded port
+ *
+ * @param bonded_port_id      Bonded device id
+ *
+ * @return
+ *   0-disabled, 1-enabled, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_tx_prepare_get(uint16_t bonded_port_id);
 
 #ifdef __cplusplus
 }
diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
index b3cddd8a20..d5951d684e 100644
--- a/drivers/net/bonding/rte_eth_bond_8023ad.c
+++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
@@ -636,8 +636,8 @@ tx_machine(struct bond_dev_private *internals, uint16_t slave_id)
 			return;
 		}
 	} else {
-		uint16_t pkts_sent = rte_eth_tx_burst(slave_id,
-				internals->mode4.dedicated_queues.tx_qid,
+		uint16_t pkts_sent = bond_ethdev_tx_ctrl_wrap(internals,
+				slave_id, internals->mode4.dedicated_queues.tx_qid,
 				&lacp_pkt, 1);
 		if (pkts_sent != 1) {
 			rte_pktmbuf_free(lacp_pkt);
@@ -1371,8 +1371,8 @@ bond_mode_8023ad_handle_slow_pkt(struct bond_dev_private *internals,
 			}
 		} else {
 			/* Send packet directly to the slow queue */
-			uint16_t tx_count = rte_eth_tx_burst(slave_id,
-					internals->mode4.dedicated_queues.tx_qid,
+			uint16_t tx_count = bond_ethdev_tx_ctrl_wrap(internals,
+					slave_id, internals->mode4.dedicated_queues.tx_qid,
 					&pkt, 1);
 			if (tx_count != 1) {
 				/* reset timer */
diff --git a/drivers/net/bonding/rte_eth_bond_api.c b/drivers/net/bonding/rte_eth_bond_api.c
index 4ac191c468..47841289f4 100644
--- a/drivers/net/bonding/rte_eth_bond_api.c
+++ b/drivers/net/bonding/rte_eth_bond_api.c
@@ -1070,3 +1070,35 @@ rte_eth_bond_link_up_prop_delay_get(uint16_t bonded_port_id)
 
 	return internals->link_up_delay_ms;
 }
+
+int
+rte_eth_bond_tx_prepare_set(uint16_t bonded_port_id, bool en)
+{
+	struct bond_dev_private *internals;
+
+	if (valid_bonded_port_id(bonded_port_id) != 0)
+		return -1;
+
+	internals = rte_eth_devices[bonded_port_id].data->dev_private;
+	if (internals->mode == BONDING_MODE_BROADCAST) {
+		RTE_BOND_LOG(ERR, "Mode broadcast don't support to configure Tx-prepare");
+		return -ENOTSUP;
+	}
+
+	internals->tx_prepare_enabled = en ? 1 : 0;
+
+	return 0;
+}
+
+int
+rte_eth_bond_tx_prepare_get(uint16_t bonded_port_id)
+{
+	struct bond_dev_private *internals;
+
+	if (valid_bonded_port_id(bonded_port_id) != 0)
+		return -1;
+
+	internals = rte_eth_devices[bonded_port_id].data->dev_private;
+
+	return internals->tx_prepare_enabled;
+}
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index 73e6972035..ec9d7d7bab 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -559,6 +559,76 @@ bond_ethdev_rx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	return nb_recv_pkts;
 }
 
+/** This function is used to transmit internally generated mbufs. */
+uint16_t
+bond_ethdev_tx_ctrl_wrap(struct bond_dev_private *internals,
+			 uint16_t slave_port_id, uint16_t queue_id,
+			 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	uint16_t nb_pre = nb_pkts;
+
+	if (internals->tx_prepare_enabled)
+		nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id, tx_pkts,
+					    nb_pkts);
+
+	return rte_eth_tx_burst(slave_port_id, queue_id, tx_pkts, nb_pre);
+}
+
+/**
+ * This function is used to transmit the mbufs input by the user.
+ * If the tx-prepare is enabled, the mbufs that fail with tx-prepare will be
+ * freed internally.
+ */
+static inline uint16_t
+bond_ethdev_tx_user_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
+		    struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
+{
+	struct bond_dev_private *internals = bd_tx_q->dev_private;
+	uint16_t queue_id = bd_tx_q->queue_id;
+	struct rte_mbuf *fail_pkts[nb_pkts];
+	uint8_t fail_mark[nb_pkts];
+	uint16_t nb_pre, index;
+	uint16_t fail_cnt = 0;
+	int i;
+
+	if (!internals->tx_prepare_enabled)
+		goto tx_burst;
+
+	nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id, tx_pkts, nb_pkts);
+	if (nb_pre == nb_pkts)
+		goto tx_burst;
+
+	fail_pkts[fail_cnt++] = tx_pkts[nb_pre];
+	memset(fail_mark, 0, sizeof(fail_mark));
+	fail_mark[nb_pre] = 1;
+	for (i = nb_pre + 1; i < nb_pkts; /* update in inner loop */) {
+		nb_pre = rte_eth_tx_prepare(slave_port_id, queue_id,
+					    tx_pkts + i, nb_pkts - i);
+		if (nb_pre == nb_pkts - i)
+			break;
+		fail_pkts[fail_cnt++] = tx_pkts[i + nb_pre];
+		fail_mark[i + nb_pre] = 1;
+		i += nb_pre + 1;
+	}
+
+	/* move tx-prepare OK mbufs to the end */
+	for (i = index = nb_pkts - 1; i >= 0; i--) {
+		if (!fail_mark[i])
+			tx_pkts[index--] = tx_pkts[i];
+	}
+	/* move tx-prepare fail mbufs to the begin, and free them */
+	for (i = 0; i < fail_cnt; i++) {
+		tx_pkts[i] = fail_pkts[i];
+		rte_pktmbuf_free(fail_pkts[i]);
+	}
+
+	if (fail_cnt == nb_pkts)
+		return nb_pkts;
+tx_burst:
+	return fail_cnt + rte_eth_tx_burst(slave_port_id, queue_id,
+				tx_pkts + fail_cnt, nb_pkts - fail_cnt);
+}
+
 static uint16_t
 bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 		uint16_t nb_pkts)
@@ -602,8 +672,9 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < num_of_slaves; i++) {
 		if (slave_nb_pkts[i] > 0) {
-			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-					slave_bufs[i], slave_nb_pkts[i]);
+			num_tx_slave = bond_ethdev_tx_user_wrap(bd_tx_q,
+					slaves[i], slave_bufs[i],
+					slave_nb_pkts[i]);
 
 			/* if tx burst fails move packets to end of bufs */
 			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
@@ -635,8 +706,8 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 	if (internals->active_slave_count < 1)
 		return 0;
 
-	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+	return bond_ethdev_tx_user_wrap(bd_tx_q,
+			internals->current_primary_port, bufs, nb_pkts);
 }
 
 static inline uint16_t
@@ -951,8 +1022,8 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 		}
 
-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-				bufs + num_tx_total, nb_pkts - num_tx_total);
+		num_tx_total += bond_ethdev_tx_user_wrap(bd_tx_q, slaves[i],
+					bufs + num_tx_total, nb_pkts - num_tx_total);
 
 		if (num_tx_total == nb_pkts)
 			break;
@@ -1064,8 +1135,9 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	/* Send ARP packets on proper slaves */
 	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
 		if (slave_bufs_pkts[i] > 0) {
-			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
-					slave_bufs[i], slave_bufs_pkts[i]);
+			num_send = bond_ethdev_tx_ctrl_wrap(internals, i,
+					bd_tx_q->queue_id, slave_bufs[i],
+					slave_bufs_pkts[i]);
 			for (j = 0; j < slave_bufs_pkts[i] - num_send; j++) {
 				bufs[nb_pkts - 1 - num_not_send - j] =
 						slave_bufs[i][nb_pkts - 1 - j];
@@ -1088,7 +1160,8 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	/* Send update packets on proper slaves */
 	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
 		if (update_bufs_pkts[i] > 0) {
-			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, update_bufs[i],
+			num_send = bond_ethdev_tx_ctrl_wrap(internals, i,
+					bd_tx_q->queue_id, update_bufs[i],
 					update_bufs_pkts[i]);
 			for (j = num_send; j < update_bufs_pkts[i]; j++) {
 				rte_pktmbuf_free(update_bufs[i][j]);
@@ -1158,9 +1231,8 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 		if (slave_nb_bufs[i] == 0)
 			continue;
 
-		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
-				bd_tx_q->queue_id, slave_bufs[i],
-				slave_nb_bufs[i]);
+		slave_tx_count = bond_ethdev_tx_user_wrap(bd_tx_q,
+			slave_port_ids[i], slave_bufs[i], slave_nb_bufs[i]);
 
 		total_tx_count += slave_tx_count;
 
@@ -1243,8 +1315,9 @@ tx_burst_8023ad(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 
 		if (rte_ring_dequeue(port->tx_ring,
 				     (void **)&ctrl_pkt) != -ENOENT) {
-			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
-					bd_tx_q->queue_id, &ctrl_pkt, 1);
+			slave_tx_count = bond_ethdev_tx_ctrl_wrap(internals,
+					slave_port_ids[i], bd_tx_q->queue_id,
+					&ctrl_pkt, 1);
 			/*
 			 * re-enqueue LAG control plane packets to buffering
 			 * ring if transmission fails so the packet isn't lost.
diff --git a/drivers/net/bonding/version.map b/drivers/net/bonding/version.map
index 9333923b4e..2c121f2559 100644
--- a/drivers/net/bonding/version.map
+++ b/drivers/net/bonding/version.map
@@ -31,3 +31,8 @@ DPDK_23 {
 
 	local: *;
 };
+
+EXPERIMENTAL {
+	rte_eth_bond_tx_prepare_get;
+	rte_eth_bond_tx_prepare_set;
+};
\ No newline at end of file
-- 
2.17.1
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v3 2/3] net/bonding: support Tx prepare fail stats
  2022-09-17  4:15     ` [PATCH v3 " Chengwen Feng
  2022-09-17  4:15       ` [PATCH v3 1/3] net/bonding: support Tx prepare Chengwen Feng
@ 2022-09-17  4:15       ` Chengwen Feng
  2022-09-17  4:15       ` [PATCH v3 3/3] net/bonding: add testpmd cmd for Tx prepare Chengwen Feng
  2 siblings, 0 replies; 61+ messages in thread
From: Chengwen Feng @ 2022-09-17  4:15 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev, 3chas3
If the Tx prepare fails, the bonding driver will free the corresponding
packets internally, and only the packets of the Tx prepare OK are xmit.
In this patch, the number of Tx prepare fails will be counted, the
result is added in the 'struct rte_eth_stats' oerrors field.
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
---
 drivers/net/bonding/eth_bond_private.h |  7 +++++++
 drivers/net/bonding/rte_eth_bond_pmd.c | 16 ++++++++++++++++
 2 files changed, 23 insertions(+)
diff --git a/drivers/net/bonding/eth_bond_private.h b/drivers/net/bonding/eth_bond_private.h
index 976163b06b..077f180f94 100644
--- a/drivers/net/bonding/eth_bond_private.h
+++ b/drivers/net/bonding/eth_bond_private.h
@@ -72,6 +72,13 @@ struct bond_tx_queue {
 	/**< Number of TX descriptors available for the queue */
 	struct rte_eth_txconf tx_conf;
 	/**< Copy of TX configuration structure for queue */
+
+	/*
+	 * The following fields are statistical value, and maybe update
+	 * at runtime, so start with one new cache line.
+	 */
+	uint64_t prepare_fails __rte_cache_aligned;
+	/**< Tx prepare fail cnt */
 };
 
 /** Bonded slave devices structure */
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index ec9d7d7bab..72d97ab7c8 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -622,6 +622,7 @@ bond_ethdev_tx_user_wrap(struct bond_tx_queue *bd_tx_q, uint16_t slave_port_id,
 		rte_pktmbuf_free(fail_pkts[i]);
 	}
 
+	bd_tx_q->prepare_fails += fail_cnt;
 	if (fail_cnt == nb_pkts)
 		return nb_pkts;
 tx_burst:
@@ -2423,6 +2424,8 @@ bond_ethdev_tx_queue_setup(struct rte_eth_dev *dev, uint16_t tx_queue_id,
 	bd_tx_q->nb_tx_desc = nb_tx_desc;
 	memcpy(&(bd_tx_q->tx_conf), tx_conf, sizeof(bd_tx_q->tx_conf));
 
+	bd_tx_q->prepare_fails = 0;
+
 	dev->data->tx_queues[tx_queue_id] = bd_tx_q;
 
 	return 0;
@@ -2633,6 +2636,7 @@ bond_ethdev_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
 {
 	struct bond_dev_private *internals = dev->data->dev_private;
 	struct rte_eth_stats slave_stats;
+	struct bond_tx_queue *bd_tx_q;
 	int i, j;
 
 	for (i = 0; i < internals->slave_count; i++) {
@@ -2654,7 +2658,12 @@ bond_ethdev_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
 			stats->q_obytes[j] += slave_stats.q_obytes[j];
 			stats->q_errors[j] += slave_stats.q_errors[j];
 		}
+	}
 
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		bd_tx_q = (struct bond_tx_queue *)dev->data->tx_queues[i];
+		if (bd_tx_q)
+			stats->oerrors += bd_tx_q->prepare_fails;
 	}
 
 	return 0;
@@ -2664,6 +2673,7 @@ static int
 bond_ethdev_stats_reset(struct rte_eth_dev *dev)
 {
 	struct bond_dev_private *internals = dev->data->dev_private;
+	struct bond_tx_queue *bd_tx_q;
 	int i;
 	int err;
 	int ret;
@@ -2674,6 +2684,12 @@ bond_ethdev_stats_reset(struct rte_eth_dev *dev)
 			err = ret;
 	}
 
+	for (i = 0; i < dev->data->nb_tx_queues; i++) {
+		bd_tx_q = (struct bond_tx_queue *)dev->data->tx_queues[i];
+		if (bd_tx_q)
+			bd_tx_q->prepare_fails = 0;
+	}
+
 	return err;
 }
 
-- 
2.17.1
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v3 3/3] net/bonding: add testpmd cmd for Tx prepare
  2022-09-17  4:15     ` [PATCH v3 " Chengwen Feng
  2022-09-17  4:15       ` [PATCH v3 1/3] net/bonding: support Tx prepare Chengwen Feng
  2022-09-17  4:15       ` [PATCH v3 2/3] net/bonding: support Tx prepare fail stats Chengwen Feng
@ 2022-09-17  4:15       ` Chengwen Feng
  2 siblings, 0 replies; 61+ messages in thread
From: Chengwen Feng @ 2022-09-17  4:15 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev, 3chas3
From: Chengchang Tang <tangchengchang@huawei.com>
Add new command to support enable/disable Tx prepare for bonded
devices. This helps to test some Tx HW offloads (e.g. checksum and TSO)
for bonded devices in testpmd. The command is:
set bonding tx_prepare <port_id> (enable|disable)
This patch also support display Tx prepare enabling status in
'show bonding config <port_id>' command.
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
---
 .../link_bonding_poll_mode_drv_lib.rst        |  9 +++
 drivers/net/bonding/bonding_testpmd.c         | 73 ++++++++++++++++++-
 2 files changed, 81 insertions(+), 1 deletion(-)
diff --git a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
index a3d91b2091..428c7d67c7 100644
--- a/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
+++ b/doc/guides/prog_guide/link_bonding_poll_mode_drv_lib.rst
@@ -623,6 +623,15 @@ Enable one of the specific aggregators mode when in mode 4 (link-aggregation-802
    testpmd> set bonding agg_mode (port_id) (bandwidth|count|stable)
 
 
+set bonding tx_prepare
+~~~~~~~~~~~~~~~~~~~~~~
+
+Enable Tx prepare on bonding devices to help the slave devices prepare the packets for
+some HW offloading (e.g. checksum and TSO)::
+
+   testpmd> set bonding tx_prepare (port_id) (enable|disable)
+
+
 show bonding config
 ~~~~~~~~~~~~~~~~~~~
 
diff --git a/drivers/net/bonding/bonding_testpmd.c b/drivers/net/bonding/bonding_testpmd.c
index 3941f4cf23..da3fe03f7e 100644
--- a/drivers/net/bonding/bonding_testpmd.c
+++ b/drivers/net/bonding/bonding_testpmd.c
@@ -413,7 +413,7 @@ static void cmd_show_bonding_config_parsed(void *parsed_result,
 	__rte_unused struct cmdline *cl, __rte_unused void *data)
 {
 	struct cmd_show_bonding_config_result *res = parsed_result;
-	int bonding_mode, agg_mode;
+	int bonding_mode, agg_mode, tx_prepare_flag;
 	portid_t slaves[RTE_MAX_ETHPORTS];
 	int num_slaves, num_active_slaves;
 	int primary_id;
@@ -429,6 +429,10 @@ static void cmd_show_bonding_config_parsed(void *parsed_result,
 	}
 	printf("\tBonding mode: %d\n", bonding_mode);
 
+	/* Display the Tx-prepare flag. */
+	tx_prepare_flag = rte_eth_bond_tx_prepare_get(port_id);
+	printf("\tTx-prepare state: %s\n", tx_prepare_flag == 1 ? "on" : "off");
+
 	if (bonding_mode == BONDING_MODE_BALANCE ||
 		bonding_mode == BONDING_MODE_8023AD) {
 		int balance_xmit_policy;
@@ -962,6 +966,68 @@ static cmdline_parse_inst_t cmd_set_bonding_agg_mode_policy = {
 	}
 };
 
+struct cmd_set_bonding_tx_prepare_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t bonding;
+	cmdline_fixed_string_t tx_prepare;
+	portid_t port_id;
+	cmdline_fixed_string_t mode;
+};
+
+static void
+cmd_set_bonding_tx_prepare_parsed(void *parsed_result,
+		__rte_unused  struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_set_bonding_tx_prepare_result *res = parsed_result;
+	portid_t port_id = res->port_id;
+
+	if (!strcmp(res->mode, "enable")) {
+		if (rte_eth_bond_tx_prepare_set(port_id, true) == 0)
+			printf("Tx prepare for bonding device enabled\n");
+		else
+			printf("Enabling bonding device Tx prepare "
+					"on port %d failed\n", port_id);
+	} else if (!strcmp(res->mode, "disable")) {
+		if (rte_eth_bond_tx_prepare_set(port_id, false) == 0)
+			printf("Tx prepare for bonding device disabled\n");
+		else
+			printf("Disabling bonding device Tx prepare "
+					"on port %d failed\n", port_id);
+	}
+}
+
+static cmdline_parse_token_string_t cmd_setbonding_tx_prepare_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			set, "set");
+static cmdline_parse_token_string_t cmd_setbonding_tx_prepare_bonding =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			bonding, "bonding");
+static cmdline_parse_token_string_t cmd_setbonding_tx_prepare_tx_prepare =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			tx_prepare, "tx_prepare");
+static cmdline_parse_token_num_t cmd_setbonding_tx_prepare_port_id =
+	TOKEN_NUM_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			port_id, RTE_UINT16);
+static cmdline_parse_token_string_t cmd_setbonding_tx_prepare_mode =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_bonding_tx_prepare_result,
+			mode, "enable#disable");
+
+static cmdline_parse_inst_t cmd_set_bond_tx_prepare = {
+		.f = cmd_set_bonding_tx_prepare_parsed,
+		.help_str = "set bonding tx_prepare <port_id> enable|disable: "
+			"Enable/disable tx_prepare for port_id",
+		.data = NULL,
+		.tokens = {
+			(void *)&cmd_setbonding_tx_prepare_set,
+			(void *)&cmd_setbonding_tx_prepare_bonding,
+			(void *)&cmd_setbonding_tx_prepare_tx_prepare,
+			(void *)&cmd_setbonding_tx_prepare_port_id,
+			(void *)&cmd_setbonding_tx_prepare_mode,
+			NULL
+		}
+};
+
 static struct testpmd_driver_commands bonding_cmds = {
 	.commands = {
 	{
@@ -1024,6 +1090,11 @@ static struct testpmd_driver_commands bonding_cmds = {
 		"set bonding mode IEEE802.3AD aggregator policy (port_id) (agg_name)\n"
 		"	Set Aggregation mode for IEEE802.3AD (mode 4)\n",
 	},
+	{
+		&cmd_set_bond_tx_prepare,
+		"set bonding tx_prepare <port_id> (enable|disable)\n"
+		"	Enable/disable tx_prepare for bonded device\n",
+	},
 	{ NULL, NULL },
 	},
 };
-- 
2.17.1
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-17  2:35               ` fengchengwen
@ 2022-09-17 13:38                 ` Chas Williams
  2022-09-19 14:07                   ` Konstantin Ananyev
  0 siblings, 1 reply; 61+ messages in thread
From: Chas Williams @ 2022-09-17 13:38 UTC (permalink / raw)
  To: fengchengwen, Ferruh Yigit, thomas, andrew.rybchenko, konstantin.ananyev
  Cc: dev, chas3, humin29
On 9/16/22 22:35, fengchengwen wrote:
> Hi Chas,
> 
> On 2022/9/15 0:59, Chas Williams wrote:
>> On 9/13/22 20:46, fengchengwen wrote:
>>>
>>> The main problem is hard to design a tx_prepare for bonding device:
>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>      devices.
>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>      link down or remove), and if the changes happens between bond-tx-prepare and
>>>      bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>      failed. (Note: a bond device with slave devices may from different vendors,
>>>      and slave devices may have different requirements, e.g. slave-A support calc
>>>      IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>      pre-calc).
>>>
>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>> in addition, bond devices are not transparent to applications, I think it's a
>>> practical method to provide tx-prepare support in this way.
>>>
>>
>>
>> I don't think you need to export an enable/disable routine for the use of
>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>> branch in drivers/net/bonding.
> 
> Our first patch was just like yours (just add tx-prepare default), but community
> is concerned about impacting performance.
> 
> As a trade-off, I think we can add the enable/disable API.
IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
performance adversly, that is not a bonding problem. All applications
should be calling rte_eth_dev_tx_prepare. There's no defined API
to determine if rte_eth_dev_tx_prepare should be called. Therefore,
applications should always call rte_eth_dev_tx_prepare. Regardless,
as I previously mentioned, you are just trading the location of
the branch, especially in the bonding case.
If rte_eth_dev_tx_prepare is causing a performance drop, then that API
should be improved or rewritten. There are PMD that require you to use
that API. Locally, we had maintained a patch to eliminate the use of
rte_eth_dev_tx_prepare. However, that has been getting harder and harder
to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
was marginal.
> 
>>
>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>> the following patch locally which I never got around to submitting.
> 
> You are right, I will send V3 fix it.
> 
>>
>>
>>  From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>> Date: Tue, 3 May 2022 16:52:37 -0400
>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>
>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>> packets for transmission. Typically, the prepare step handles the VLAN
>> headers, but it may need to do other things.
>>
>> Signed-off-by: Chas Williams <chwillia@ciena.com>
> 
> ...
> 
>>                * ring if transmission fails so the packet isn't lost.
>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>   
>>       /* Transmit burst on each active slave */
>>       for (i = 0; i < num_of_slaves; i++) {
>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> +        uint16_t nb_prep;
>> +
>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>                       bufs, nb_pkts);
>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> +                    bufs, nb_prep);
> 
> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
> the packet data is sent and edited at the same time. Is this likely to cause problems ?
This routine is already broken. You can't just increment the refcount
and send the packet into a PMD's transmit routine. Nothing guarantees
that a transmit routine will not modify the packet. Many PMDs perform an
rte_vlan_insert. You should at least perform a clone of the packet so
that the mbuf headers aren't mangled by each PMD. Just to be safe you
should perform a partial deep copy of the packet headers in case some
PMD does an rte_vlan_insert and the other PMDs in the bonding group do
not need an rte_vlan_insert.
So doing a blind rte_eth_dev_tx_preprare isn't making anything much
worse.
> 
>>   
>>           if (unlikely(slave_tx_total[i] < nb_pkts))
>>               tx_failed_flag = 1;
^ permalink raw reply	[flat|nested] 61+ messages in thread
* RE: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-17 13:38                 ` Chas Williams
@ 2022-09-19 14:07                   ` Konstantin Ananyev
  2022-09-19 23:02                     ` Chas Williams
  0 siblings, 1 reply; 61+ messages in thread
From: Konstantin Ananyev @ 2022-09-19 14:07 UTC (permalink / raw)
  To: Chas Williams, Fengchengwen, Ferruh Yigit, thomas,
	andrew.rybchenko, konstantin.ananyev
  Cc: dev, chas3, humin (Q)
> 
> On 9/16/22 22:35, fengchengwen wrote:
> > Hi Chas,
> >
> > On 2022/9/15 0:59, Chas Williams wrote:
> >> On 9/13/22 20:46, fengchengwen wrote:
> >>>
> >>> The main problem is hard to design a tx_prepare for bonding device:
> >>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
> >>>      devices.
> >>> 2. also more important, if the slave devices have changes(e.g. slave device
> >>>      link down or remove), and if the changes happens between bond-tx-prepare and
> >>>      bond-tx-burst, the output slave will changes, and this may lead to checksum
> >>>      failed. (Note: a bond device with slave devices may from different vendors,
> >>>      and slave devices may have different requirements, e.g. slave-A support calc
> >>>      IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
> >>>      pre-calc).
> >>>
> >>> Current design cover the above two scenarios by using in-place tx-prepare. and
> >>> in addition, bond devices are not transparent to applications, I think it's a
> >>> practical method to provide tx-prepare support in this way.
> >>>
> >>
> >>
> >> I don't think you need to export an enable/disable routine for the use of
> >> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
> >> implemented. You are just trading one branch in DPDK librte_eth_dev for a
> >> branch in drivers/net/bonding.
> >
> > Our first patch was just like yours (just add tx-prepare default), but community
> > is concerned about impacting performance.
> >
> > As a trade-off, I think we can add the enable/disable API.
> 
> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
> performance adversly, that is not a bonding problem. All applications
> should be calling rte_eth_dev_tx_prepare. There's no defined API
> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
> applications should always call rte_eth_dev_tx_prepare. Regardless,
> as I previously mentioned, you are just trading the location of
> the branch, especially in the bonding case.
> 
> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
> should be improved or rewritten. There are PMD that require you to use
> that API. Locally, we had maintained a patch to eliminate the use of
> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
> was marginal.
> 
> >
> >>
> >> I think you missed fixing tx_machine in 802.3ad support. We have been using
> >> the following patch locally which I never got around to submitting.
> >
> > You are right, I will send V3 fix it.
> >
> >>
> >>
> >>  From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
> >> From: "Charles (Chas) Williams" <chwillia@ciena.com>
> >> Date: Tue, 3 May 2022 16:52:37 -0400
> >> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
> >>
> >> Some PMDs might require a call to rte_eth_tx_prepare before sending the
> >> packets for transmission. Typically, the prepare step handles the VLAN
> >> headers, but it may need to do other things.
> >>
> >> Signed-off-by: Chas Williams <chwillia@ciena.com>
> >
> > ...
> >
> >>                * ring if transmission fails so the packet isn't lost.
> >> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
> >>
> >>       /* Transmit burst on each active slave */
> >>       for (i = 0; i < num_of_slaves; i++) {
> >> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >> +        uint16_t nb_prep;
> >> +
> >> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
> >>                       bufs, nb_pkts);
> >> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >> +                    bufs, nb_prep);
> >
> > The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
> > the packet data is sent and edited at the same time. Is this likely to cause problems ?
> 
> This routine is already broken. You can't just increment the refcount
> and send the packet into a PMD's transmit routine. Nothing guarantees
> that a transmit routine will not modify the packet. Many PMDs perform an
> rte_vlan_insert. 
Hmm interesting.... 
My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
(except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
was introduced.
> You should at least perform a clone of the packet so
> that the mbuf headers aren't mangled by each PMD. Just to be safe you
> should perform a partial deep copy of the packet headers in case some
> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
> not need an rte_vlan_insert.
> 
> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
> worse.
> 
> >
> >>
> >>           if (unlikely(slave_tx_total[i] < nb_pkts))
> >>               tx_failed_flag = 1;
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-19 14:07                   ` Konstantin Ananyev
@ 2022-09-19 23:02                     ` Chas Williams
  2022-09-22  2:12                       ` fengchengwen
  2022-09-26 10:18                       ` Konstantin Ananyev
  0 siblings, 2 replies; 61+ messages in thread
From: Chas Williams @ 2022-09-19 23:02 UTC (permalink / raw)
  To: Konstantin Ananyev, Fengchengwen, Ferruh Yigit, thomas,
	andrew.rybchenko, konstantin.ananyev
  Cc: dev, chas3, humin (Q)
On 9/19/22 10:07, Konstantin Ananyev wrote:
> 
>>
>> On 9/16/22 22:35, fengchengwen wrote:
>>> Hi Chas,
>>>
>>> On 2022/9/15 0:59, Chas Williams wrote:
>>>> On 9/13/22 20:46, fengchengwen wrote:
>>>>>
>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>       devices.
>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>       link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>       bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>       failed. (Note: a bond device with slave devices may from different vendors,
>>>>>       and slave devices may have different requirements, e.g. slave-A support calc
>>>>>       IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>       pre-calc).
>>>>>
>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>> practical method to provide tx-prepare support in this way.
>>>>>
>>>>
>>>>
>>>> I don't think you need to export an enable/disable routine for the use of
>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>> branch in drivers/net/bonding.
>>>
>>> Our first patch was just like yours (just add tx-prepare default), but community
>>> is concerned about impacting performance.
>>>
>>> As a trade-off, I think we can add the enable/disable API.
>>
>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>> performance adversly, that is not a bonding problem. All applications
>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>> as I previously mentioned, you are just trading the location of
>> the branch, especially in the bonding case.
>>
>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>> should be improved or rewritten. There are PMD that require you to use
>> that API. Locally, we had maintained a patch to eliminate the use of
>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>> was marginal.
>>
>>>
>>>>
>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>> the following patch locally which I never got around to submitting.
>>>
>>> You are right, I will send V3 fix it.
>>>
>>>>
>>>>
>>>>   From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>
>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>> headers, but it may need to do other things.
>>>>
>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>
>>> ...
>>>
>>>>                 * ring if transmission fails so the packet isn't lost.
>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>
>>>>        /* Transmit burst on each active slave */
>>>>        for (i = 0; i < num_of_slaves; i++) {
>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>> +        uint16_t nb_prep;
>>>> +
>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>                        bufs, nb_pkts);
>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>> +                    bufs, nb_prep);
>>>
>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>
>> This routine is already broken. You can't just increment the refcount
>> and send the packet into a PMD's transmit routine. Nothing guarantees
>> that a transmit routine will not modify the packet. Many PMDs perform an
>> rte_vlan_insert.
> 
> Hmm interesting....
> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
> was introduced.
Is that documented anywhere? It's been my experience that the device PMD
can do practically anything and you need to protect yourself.  Currently,
the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
the virtio driver also used to call rte_vlan_insert during its transmit
path. Of course, rte_vlan_insert modifies the packet data and the mbuf
header. Regardless, it looks like rte_eth_dev_tx_prepare should always be
called. Handling that correctly in broadcast mode probably means always
make a deep copy of the packet, or check to see if all the members are
the same PMD type. If so, you can just call prepare once. You could track
the mismatched nature during additional/removal of the members. Or just
assume people aren't going to mismatch bonding members.
  
>> You should at least perform a clone of the packet so
>> that the mbuf headers aren't mangled by each PMD. Just to be safe you
>> should perform a partial deep copy of the packet headers in case some
>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>> not need an rte_vlan_insert.
>>
>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>> worse.
>>
>>>
>>>>
>>>>            if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>                tx_failed_flag = 1;
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-19 23:02                     ` Chas Williams
@ 2022-09-22  2:12                       ` fengchengwen
  2022-09-25 10:32                         ` Chas Williams
  2022-09-26 10:18                       ` Konstantin Ananyev
  1 sibling, 1 reply; 61+ messages in thread
From: fengchengwen @ 2022-09-22  2:12 UTC (permalink / raw)
  To: Chas Williams, Konstantin Ananyev, Ferruh Yigit, thomas,
	andrew.rybchenko, Ferruh Yigit
  Cc: dev, chas3, humin (Q)
On 2022/9/20 7:02, Chas Williams wrote:
> 
> 
> On 9/19/22 10:07, Konstantin Ananyev wrote:
>>
>>>
>>> On 9/16/22 22:35, fengchengwen wrote:
>>>> Hi Chas,
>>>>
>>>> On 2022/9/15 0:59, Chas Williams wrote:
>>>>> On 9/13/22 20:46, fengchengwen wrote:
>>>>>>
>>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>>       devices.
>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>>       link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>>       bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>>       failed. (Note: a bond device with slave devices may from different vendors,
>>>>>>       and slave devices may have different requirements, e.g. slave-A support calc
>>>>>>       IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>>       pre-calc).
>>>>>>
>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>>> practical method to provide tx-prepare support in this way.
>>>>>>
>>>>>
>>>>>
>>>>> I don't think you need to export an enable/disable routine for the use of
>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>>> branch in drivers/net/bonding.
>>>>
>>>> Our first patch was just like yours (just add tx-prepare default), but community
>>>> is concerned about impacting performance.
>>>>
>>>> As a trade-off, I think we can add the enable/disable API.
>>>
>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>>> performance adversly, that is not a bonding problem. All applications
>>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>>> as I previously mentioned, you are just trading the location of
>>> the branch, especially in the bonding case.
>>>
>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>>> should be improved or rewritten. There are PMD that require you to use
>>> that API. Locally, we had maintained a patch to eliminate the use of
>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>>> was marginal.
>>>
>>>>
>>>>>
>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>>> the following patch locally which I never got around to submitting.
>>>>
>>>> You are right, I will send V3 fix it.
>>>>
>>>>>
>>>>>
>>>>>   From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>>
>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>>> headers, but it may need to do other things.
>>>>>
>>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>>
>>>> ...
>>>>
>>>>>                 * ring if transmission fails so the packet isn't lost.
>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>>
>>>>>        /* Transmit burst on each active slave */
>>>>>        for (i = 0; i < num_of_slaves; i++) {
>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>> +        uint16_t nb_prep;
>>>>> +
>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>>                        bufs, nb_pkts);
>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>> +                    bufs, nb_prep);
>>>>
>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>>
>>> This routine is already broken. You can't just increment the refcount
>>> and send the packet into a PMD's transmit routine. Nothing guarantees
>>> that a transmit routine will not modify the packet. Many PMDs perform an
>>> rte_vlan_insert.
>>
>> Hmm interesting....
>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
>> was introduced.
> 
> Is that documented anywhere? It's been my experience that the device PMD
> can do practically anything and you need to protect yourself.  Currently,
> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
> the virtio driver also used to call rte_vlan_insert during its transmit
> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
> header. Regardless, it looks like rte_eth_dev_tx_prepare should always be
> called. Handling that correctly in broadcast mode probably means always
> make a deep copy of the packet, or check to see if all the members are
> the same PMD type. If so, you can just call prepare once. You could track
> the mismatched nature during additional/removal of the members. Or just
> assume people aren't going to mismatch bonding members.
the rte_eth_tx_prepare has notes:
    * Since this function can modify packet data, provided mbufs must be safely
    * writable (e.g. modified data cannot be in shared segment).
but rte_eth_tx_burst have not such requirement.
except above examples of rte_vlan_insert, there are also some PMDs modify mbuf's header
and data, e.g. hns3/ark/bnxt will invoke rte_pktmbuf_append in case of the pkt-len too small.
I prefer the rte_eth_tx_burst add such restricts: the PMD should not modify the mbuf except refcnt==1.
so that application could rely on there explicit definition to do business.
As for this bonding scenario, we have three alternatives:
1) as Chas provided patch, always do tx-prepare before tx-burst. it was simple, but have: it
may modify the mbuf but application could not detect (unless especial documents)
2) my patch, application could invoke the prepare_enable/disable to control whether to do prepare.
3) implement bonding PMD's tx-prepare, it do tx-preare for each slave, but existing some problem:
if the slave device changes (e.g. add new device), some packet errors may occur because we have not
do prepare for the new add device.
note1: the above 1/2 both violate rte_eth_tx_burst's requirement, so we should especial document.
note2: we can do some optimization for 3, e.g. if the same driver name is detected on multiple slave
       devices, here only need to perform tx-prepare once. but the problem above descripe still exist
       because of dynamic slave devices at runtime.
hope for more discuess. @Ferruh @Chas @Humin @Konstantin
> 
>  
>>> You should at least perform a clone of the packet so
>>> that the mbuf headers aren't mangled by each PMD. Just to be safe you
>>> should perform a partial deep copy of the packet headers in case some
>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>>> not need an rte_vlan_insert.
>>>
>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>>> worse.
>>>
>>>>
>>>>>
>>>>>            if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>>                tx_failed_flag = 1;
> .
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-22  2:12                       ` fengchengwen
@ 2022-09-25 10:32                         ` Chas Williams
  0 siblings, 0 replies; 61+ messages in thread
From: Chas Williams @ 2022-09-25 10:32 UTC (permalink / raw)
  To: fengchengwen, Konstantin Ananyev, Ferruh Yigit, thomas,
	andrew.rybchenko, Ferruh Yigit
  Cc: dev, chas3, humin (Q)
On 9/21/22 22:12, fengchengwen wrote:
> 
> 
> On 2022/9/20 7:02, Chas Williams wrote:
>>
>>
>> On 9/19/22 10:07, Konstantin Ananyev wrote:
>>>
>>>>
>>>> On 9/16/22 22:35, fengchengwen wrote:
>>>>> Hi Chas,
>>>>>
>>>>> On 2022/9/15 0:59, Chas Williams wrote:
>>>>>> On 9/13/22 20:46, fengchengwen wrote:
>>>>>>>
>>>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>>>        devices.
>>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>>>        link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>>>        bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>>>        failed. (Note: a bond device with slave devices may from different vendors,
>>>>>>>        and slave devices may have different requirements, e.g. slave-A support calc
>>>>>>>        IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>>>        pre-calc).
>>>>>>>
>>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>>>> practical method to provide tx-prepare support in this way.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I don't think you need to export an enable/disable routine for the use of
>>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>>>> branch in drivers/net/bonding.
>>>>>
>>>>> Our first patch was just like yours (just add tx-prepare default), but community
>>>>> is concerned about impacting performance.
>>>>>
>>>>> As a trade-off, I think we can add the enable/disable API.
>>>>
>>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>>>> performance adversly, that is not a bonding problem. All applications
>>>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>>>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>>>> as I previously mentioned, you are just trading the location of
>>>> the branch, especially in the bonding case.
>>>>
>>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>>>> should be improved or rewritten. There are PMD that require you to use
>>>> that API. Locally, we had maintained a patch to eliminate the use of
>>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>>>> was marginal.
>>>>
>>>>>
>>>>>>
>>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>>>> the following patch locally which I never got around to submitting.
>>>>>
>>>>> You are right, I will send V3 fix it.
>>>>>
>>>>>>
>>>>>>
>>>>>>    From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>>>
>>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>>>> headers, but it may need to do other things.
>>>>>>
>>>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>>>
>>>>> ...
>>>>>
>>>>>>                  * ring if transmission fails so the packet isn't lost.
>>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>>>
>>>>>>         /* Transmit burst on each active slave */
>>>>>>         for (i = 0; i < num_of_slaves; i++) {
>>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +        uint16_t nb_prep;
>>>>>> +
>>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>>>                         bufs, nb_pkts);
>>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +                    bufs, nb_prep);
>>>>>
>>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>>>
>>>> This routine is already broken. You can't just increment the refcount
>>>> and send the packet into a PMD's transmit routine. Nothing guarantees
>>>> that a transmit routine will not modify the packet. Many PMDs perform an
>>>> rte_vlan_insert.
>>>
>>> Hmm interesting....
>>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
>>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
>>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
>>> was introduced.
>>
>> Is that documented anywhere? It's been my experience that the device PMD
>> can do practically anything and you need to protect yourself.  Currently,
>> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
>> the virtio driver also used to call rte_vlan_insert during its transmit
>> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
>> header. Regardless, it looks like rte_eth_dev_tx_prepare should always be
>> called. Handling that correctly in broadcast mode probably means always
>> make a deep copy of the packet, or check to see if all the members are
>> the same PMD type. If so, you can just call prepare once. You could track
>> the mismatched nature during additional/removal of the members. Or just
>> assume people aren't going to mismatch bonding members.
> 
> the rte_eth_tx_prepare has notes:
>      * Since this function can modify packet data, provided mbufs must be safely
>      * writable (e.g. modified data cannot be in shared segment).
> but rte_eth_tx_burst have not such requirement.
> 
> except above examples of rte_vlan_insert, there are also some PMDs modify mbuf's header
> and data, e.g. hns3/ark/bnxt will invoke rte_pktmbuf_append in case of the pkt-len too small.
> 
> I prefer the rte_eth_tx_burst add such restricts: the PMD should not modify the mbuf except refcnt==1.
> so that application could rely on there explicit definition to do business.
> 
> 
> As for this bonding scenario, we have three alternatives:
> 1) as Chas provided patch, always do tx-prepare before tx-burst. it was simple, but have: it
> may modify the mbuf but application could not detect (unless especial documents)
> 2) my patch, application could invoke the prepare_enable/disable to control whether to do prepare.
> 3) implement bonding PMD's tx-prepare, it do tx-preare for each slave, but existing some problem:
> if the slave device changes (e.g. add new device), some packet errors may occur because we have not
> do prepare for the new add device.
> 
> note1: the above 1/2 both violate rte_eth_tx_burst's requirement, so we should especial document.
> note2: we can do some optimization for 3, e.g. if the same driver name is detected on multiple slave
>         devices, here only need to perform tx-prepare once. but the problem above descripe still exist
>         because of dynamic slave devices at runtime.
> 
> hope for more discuess. @Ferruh @Chas @Humin @Konstantin
I don't think adding additional API due to concerns about performance is
the solution to the performance problem. If the tx_prepare API is slow,
that's what needs to be fixed. I imagine that more drivers will be using
the tx_prepare API over time not less. It would be a good idea to get
used to calling it.
As for broadcast mode, let's just call tx_prepare once for any given
packet. For now, assume that no one would attempt to bond different
PMDs together. In my experience, that would be unusual. I have never
seen anyone do that in a production context. If a bug report comes in
about this failing for someone, we can fix it then.
>>>> You should at least perform a clone of the packet so
>>>> that the mbuf headers aren't mangled by each PMD. Just to be safe you
>>>> should perform a partial deep copy of the packet headers in case some
>>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>>>> not need an rte_vlan_insert.
>>>>
>>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>>>> worse.
>>>>
>>>>>
>>>>>>
>>>>>>             if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>>>                 tx_failed_flag = 1;
>> .
^ permalink raw reply	[flat|nested] 61+ messages in thread
* RE: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-19 23:02                     ` Chas Williams
  2022-09-22  2:12                       ` fengchengwen
@ 2022-09-26 10:18                       ` Konstantin Ananyev
  2022-09-26 16:36                         ` Chas Williams
  1 sibling, 1 reply; 61+ messages in thread
From: Konstantin Ananyev @ 2022-09-26 10:18 UTC (permalink / raw)
  To: Chas Williams, Fengchengwen, Ferruh Yigit, thomas,
	andrew.rybchenko, konstantin.ananyev
  Cc: dev, chas3, humin (Q)
Hi everyone,
Sorry for late reply.
> >>>>> The main problem is hard to design a tx_prepare for bonding device:
> >>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
> >>>>>       devices.
> >>>>> 2. also more important, if the slave devices have changes(e.g. slave device
> >>>>>       link down or remove), and if the changes happens between bond-tx-prepare and
> >>>>>       bond-tx-burst, the output slave will changes, and this may lead to checksum
> >>>>>       failed. (Note: a bond device with slave devices may from different vendors,
> >>>>>       and slave devices may have different requirements, e.g. slave-A support calc
> >>>>>       IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
> >>>>>       pre-calc).
> >>>>>
> >>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
> >>>>> in addition, bond devices are not transparent to applications, I think it's a
> >>>>> practical method to provide tx-prepare support in this way.
> >>>>>
> >>>>
> >>>>
> >>>> I don't think you need to export an enable/disable routine for the use of
> >>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
> >>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
> >>>> branch in drivers/net/bonding.
> >>>
> >>> Our first patch was just like yours (just add tx-prepare default), but community
> >>> is concerned about impacting performance.
> >>>
> >>> As a trade-off, I think we can add the enable/disable API.
> >>
> >> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
> >> performance adversly, that is not a bonding problem. All applications
> >> should be calling rte_eth_dev_tx_prepare. There's no defined API
> >> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
> >> applications should always call rte_eth_dev_tx_prepare. Regardless,
> >> as I previously mentioned, you are just trading the location of
> >> the branch, especially in the bonding case.
> >>
> >> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
> >> should be improved or rewritten. There are PMD that require you to use
> >> that API. Locally, we had maintained a patch to eliminate the use of
> >> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
> >> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
> >> was marginal.
> >>
> >>>
> >>>>
> >>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
> >>>> the following patch locally which I never got around to submitting.
> >>>
> >>> You are right, I will send V3 fix it.
> >>>
> >>>>
> >>>>
> >>>>   From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
> >>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
> >>>> Date: Tue, 3 May 2022 16:52:37 -0400
> >>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
> >>>>
> >>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
> >>>> packets for transmission. Typically, the prepare step handles the VLAN
> >>>> headers, but it may need to do other things.
> >>>>
> >>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
> >>>
> >>> ...
> >>>
> >>>>                 * ring if transmission fails so the packet isn't lost.
> >>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
> >>>>
> >>>>        /* Transmit burst on each active slave */
> >>>>        for (i = 0; i < num_of_slaves; i++) {
> >>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >>>> +        uint16_t nb_prep;
> >>>> +
> >>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
> >>>>                        bufs, nb_pkts);
> >>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> >>>> +                    bufs, nb_prep);
> >>>
> >>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
> >>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
> >>
> >> This routine is already broken. You can't just increment the refcount
> >> and send the packet into a PMD's transmit routine. Nothing guarantees
> >> that a transmit routine will not modify the packet. Many PMDs perform an
> >> rte_vlan_insert.
> >
> > Hmm interesting....
> > My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
> > (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
> > While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
> > was introduced.
> 
> Is that documented anywhere?
I looked through, but couldn't find too much except what was already mentioned by Fengcheng:
rte_eth_tx_prepare() notes:
    * Since this function can modify packet data, provided mbufs must be safely
    * writable (e.g. modified data cannot be in shared segment).
Probably that's not explicit enough, as it doesn't forbid modifying packets in tx_burst clearly.
> It's been my experience that the device PMD
> can do practically anything and you need to protect yourself.  Currently,
> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
> the virtio driver also used to call rte_vlan_insert during its transmit
> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
> header.
Interesting, usually apps that trying to use zero-copy multi-cast TX have packet-header portion
in a separate segment, so it might even keep working.. But definetly doesn't look right to me:
if mbuf->refnct > 1,  I think it should be treated as read-only. 
 Regardless, it looks like rte_eth_dev_tx_prepare should always be
> called.
Again, as I remember, initial agreement was: if any TX offload is enabled,
tx_prepare() needs to be called (or user has implement similar stuff on his own).
If no TX offload flags were specified for the packet, tx_prepare() is not necessary.
 > Handling that correctly in broadcast mode probably means always
> make a deep copy of the packet, or check to see if all the members are
> the same PMD type. If so, you can just call prepare once. You could track
> the mismatched nature during additional/removal of the members. Or just
> assume people aren't going to mismatch bonding members.
> 
> 
> >> You should at least perform a clone of the packet so
> >> that the mbuf headers aren't mangled by each PMD. 
Usually you don't need to clone the whole packet. In many cases it is enough to just attach
as first segment l2/l3/l4 header portion of the packet.
At least that's how ip_multicast sample works.  
Just to be safe you
> >> should perform a partial deep copy of the packet headers in case some
> >> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
> >> not need an rte_vlan_insert.
> >>
> >> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
> >> worse.
> >>
> >>>
> >>>>
> >>>>            if (unlikely(slave_tx_total[i] < nb_pkts))
> >>>>                tx_failed_flag = 1;
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v2 1/3] net/bonding: support Tx prepare
  2022-09-26 10:18                       ` Konstantin Ananyev
@ 2022-09-26 16:36                         ` Chas Williams
  0 siblings, 0 replies; 61+ messages in thread
From: Chas Williams @ 2022-09-26 16:36 UTC (permalink / raw)
  To: Konstantin Ananyev, Fengchengwen, Ferruh Yigit, thomas,
	andrew.rybchenko, konstantin.ananyev
  Cc: dev, chas3, humin (Q)
On 9/26/22 06:18, Konstantin Ananyev wrote:
> 
> Hi everyone,
> 
> Sorry for late reply.
> 
>>>>>>> The main problem is hard to design a tx_prepare for bonding device:
>>>>>>> 1. as Chas Williams said, there maybe twice hash calc to get target slave
>>>>>>>        devices.
>>>>>>> 2. also more important, if the slave devices have changes(e.g. slave device
>>>>>>>        link down or remove), and if the changes happens between bond-tx-prepare and
>>>>>>>        bond-tx-burst, the output slave will changes, and this may lead to checksum
>>>>>>>        failed. (Note: a bond device with slave devices may from different vendors,
>>>>>>>        and slave devices may have different requirements, e.g. slave-A support calc
>>>>>>>        IPv4 pseudo-head automatic (no need driver pre-calc), but slave-B need driver
>>>>>>>        pre-calc).
>>>>>>>
>>>>>>> Current design cover the above two scenarios by using in-place tx-prepare. and
>>>>>>> in addition, bond devices are not transparent to applications, I think it's a
>>>>>>> practical method to provide tx-prepare support in this way.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> I don't think you need to export an enable/disable routine for the use of
>>>>>> rte_eth_tx_prepare. It's safe to just call that routine, even if it isn't
>>>>>> implemented. You are just trading one branch in DPDK librte_eth_dev for a
>>>>>> branch in drivers/net/bonding.
>>>>>
>>>>> Our first patch was just like yours (just add tx-prepare default), but community
>>>>> is concerned about impacting performance.
>>>>>
>>>>> As a trade-off, I think we can add the enable/disable API.
>>>>
>>>> IMHO, that's a bad idea. If the rte_eth_dev_tx_prepare API affects
>>>> performance adversly, that is not a bonding problem. All applications
>>>> should be calling rte_eth_dev_tx_prepare. There's no defined API
>>>> to determine if rte_eth_dev_tx_prepare should be called. Therefore,
>>>> applications should always call rte_eth_dev_tx_prepare. Regardless,
>>>> as I previously mentioned, you are just trading the location of
>>>> the branch, especially in the bonding case.
>>>>
>>>> If rte_eth_dev_tx_prepare is causing a performance drop, then that API
>>>> should be improved or rewritten. There are PMD that require you to use
>>>> that API. Locally, we had maintained a patch to eliminate the use of
>>>> rte_eth_dev_tx_prepare. However, that has been getting harder and harder
>>>> to maintain. The performance lost by just calling rte_eth_dev_tx_prepare
>>>> was marginal.
>>>>
>>>>>
>>>>>>
>>>>>> I think you missed fixing tx_machine in 802.3ad support. We have been using
>>>>>> the following patch locally which I never got around to submitting.
>>>>>
>>>>> You are right, I will send V3 fix it.
>>>>>
>>>>>>
>>>>>>
>>>>>>    From a458654d68ff5144266807ef136ac3dd2adfcd98 Mon Sep 17 00:00:00 2001
>>>>>> From: "Charles (Chas) Williams" <chwillia@ciena.com>
>>>>>> Date: Tue, 3 May 2022 16:52:37 -0400
>>>>>> Subject: [PATCH] net/bonding: call rte_eth_tx_prepare before rte_eth_tx_burst
>>>>>>
>>>>>> Some PMDs might require a call to rte_eth_tx_prepare before sending the
>>>>>> packets for transmission. Typically, the prepare step handles the VLAN
>>>>>> headers, but it may need to do other things.
>>>>>>
>>>>>> Signed-off-by: Chas Williams <chwillia@ciena.com>
>>>>>
>>>>> ...
>>>>>
>>>>>>                  * ring if transmission fails so the packet isn't lost.
>>>>>> @@ -1322,8 +1350,12 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>>>>>
>>>>>>         /* Transmit burst on each active slave */
>>>>>>         for (i = 0; i < num_of_slaves; i++) {
>>>>>> -        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +        uint16_t nb_prep;
>>>>>> +
>>>>>> +        nb_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>>>>>                         bufs, nb_pkts);
>>>>>> +        slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>>>>>> +                    bufs, nb_prep);
>>>>>
>>>>> The tx-prepare may edit packet data, and the broadcast mode will send a packet to all slaves,
>>>>> the packet data is sent and edited at the same time. Is this likely to cause problems ?
>>>>
>>>> This routine is already broken. You can't just increment the refcount
>>>> and send the packet into a PMD's transmit routine. Nothing guarantees
>>>> that a transmit routine will not modify the packet. Many PMDs perform an
>>>> rte_vlan_insert.
>>>
>>> Hmm interesting....
>>> My uderstanding was quite opposite - tx_burst() can't modify packet data and metadata
>>> (except when refcnt==1 and tx_burst() going to free the mbuf and put it back to the mempool).
>>> While tx_prepare() can - actually as I remember that was one of the reasons why a separate routine
>>> was introduced.
>>
>> Is that documented anywhere?
> 
> I looked through, but couldn't find too much except what was already mentioned by Fengcheng:
> rte_eth_tx_prepare() notes:
>      * Since this function can modify packet data, provided mbufs must be safely
>      * writable (e.g. modified data cannot be in shared segment).
> Probably that's not explicit enough, as it doesn't forbid modifying packets in tx_burst clearly.
This certainly seems like one of those gray areas in the DPDK APIs. It
should be made clear what is expected as far as behavior.
> 
>> It's been my experience that the device PMD
>> can do practically anything and you need to protect yourself.  Currently,
>> the af_packet, dpaa2, and vhost driver call rte_vlan_insert. Before 2019,
>> the virtio driver also used to call rte_vlan_insert during its transmit
>> path. Of course, rte_vlan_insert modifies the packet data and the mbuf
>> header.
> Interesting, usually apps that trying to use zero-copy multi-cast TX have packet-header portion
> in a separate segment, so it might even keep working.. But definetly doesn't look right to me:
> if mbuf->refnct > 1,  I think it should be treated as read-only.
rte_vlan_insert might be a problem with broadcast mode. If the refcnt is
> 1, rte_vlan_insert is going to fail. So, the current broadcast mode
implementation probably doesn't work if any PMD uses rte_vlan_insert.
So again, a solution is call tx_pkt_prepare once, then increment the
reference count, and send to the all the members. That works if your
PMD correctly implements tx_pkt_prepare. If it doesn't and call rte_vlan_insert
in the transmit routine, that PMD will need to be fixed to work with
bonding.
  
>   Regardless, it looks like rte_eth_dev_tx_prepare should always be
>> called.
> 
> Again, as I remember, initial agreement was: if any TX offload is enabled,
> tx_prepare() needs to be called (or user has implement similar stuff on his own).
> If no TX offload flags were specified for the packet, tx_prepare() is not necessary.
For the bonding driver, we potentially have a mix of PMDs for the members.
It's difficult to know in advance if your packets will have TX offload flags
or not. If you have a tx_pkt_prepare stub, there's a good chance that your
packets will have some TX offload flags. So, calling tx_pkt_prepare is likely
the "best" intermediate solution.
> 
>   > Handling that correctly in broadcast mode probably means always
>> make a deep copy of the packet, or check to see if all the members are
>> the same PMD type. If so, you can just call prepare once. You could track
>> the mismatched nature during additional/removal of the members. Or just
>> assume people aren't going to mismatch bonding members.
>>
>>
>>>> You should at least perform a clone of the packet so
>>>> that the mbuf headers aren't mangled by each PMD.
> 
> Usually you don't need to clone the whole packet. In many cases it is enough to just attach
> as first segment l2/l3/l4 header portion of the packet.
> At least that's how ip_multicast sample works.
Yes, that's what I meant by deep copy the packet headers. You just copy
enough to modify what you need and keep the bulk of the packet otherwise.
> 
> Just to be safe you
>>>> should perform a partial deep copy of the packet headers in case some
>>>> PMD does an rte_vlan_insert and the other PMDs in the bonding group do
>>>> not need an rte_vlan_insert.
>>>>
>>>> So doing a blind rte_eth_dev_tx_preprare isn't making anything much
>>>> worse.
>>>>
>>>>>
>>>>>>
>>>>>>             if (unlikely(slave_tx_total[i] < nb_pkts))
>>>>>>                 tx_failed_flag = 1;
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v4] net/bonding: call Tx prepare before Tx burst
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
                       ` (2 preceding siblings ...)
  2022-09-17  4:15     ` [PATCH v3 " Chengwen Feng
@ 2022-10-09  3:36     ` Chengwen Feng
  2022-10-10 19:42       ` Chas Williams
  2022-10-11 13:20     ` [PATCH v5] " Chengwen Feng
  4 siblings, 1 reply; 61+ messages in thread
From: Chengwen Feng @ 2022-10-09  3:36 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev, 3chas3
Normally, to use the HW offloads capability (e.g. checksum and TSO) in
the Tx direction, the application needs to call rte_eth_tx_prepare() to
do some adjustment with the packets before sending them. But the
tx_prepare callback of the bonding driver is not implemented. Therefore,
the sent packets may have errors (e.g. checksum errors).
However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonded device to determine which slave device's prepare function should
be invoked.
So in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the rte_eth_tx_prepare() will be called before
rte_eth_tx_burst(). In this way, all tx_offloads can be processed
correctly for all NIC devices.
Note: because it is rara that bond different PMDs together, so just
call tx-prepare once in broadcast bonding mode.
Also the following description was added to the rte_eth_tx_burst()
function:
"@note This function must not modify mbufs (including packets data)
unless the refcnt is 1. The exception is the bonding PMD, which does not
have tx-prepare function, in this case, mbufs maybe modified."
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
---
v4: address Chas and Konstantin's comments.
v3: support tx-prepare when Tx internal generate mbufs.
v2: support tx-prepare enable flag and fail stats.
---
 drivers/net/bonding/rte_eth_bond_8023ad.c | 10 ++++-
 drivers/net/bonding/rte_eth_bond_pmd.c    | 45 ++++++++++++++++++-----
 lib/ethdev/rte_ethdev.h                   |  4 ++
 3 files changed, 47 insertions(+), 12 deletions(-)
diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
index b3cddd8a20..29a71ae0bf 100644
--- a/drivers/net/bonding/rte_eth_bond_8023ad.c
+++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
@@ -636,9 +636,12 @@ tx_machine(struct bond_dev_private *internals, uint16_t slave_id)
 			return;
 		}
 	} else {
-		uint16_t pkts_sent = rte_eth_tx_burst(slave_id,
+		uint16_t pkts_sent = rte_eth_tx_prepare(slave_id,
 				internals->mode4.dedicated_queues.tx_qid,
 				&lacp_pkt, 1);
+		pkts_sent = rte_eth_tx_burst(slave_id,
+				internals->mode4.dedicated_queues.tx_qid,
+				&lacp_pkt, pkts_sent);
 		if (pkts_sent != 1) {
 			rte_pktmbuf_free(lacp_pkt);
 			set_warning_flags(port, WRN_TX_QUEUE_FULL);
@@ -1371,9 +1374,12 @@ bond_mode_8023ad_handle_slow_pkt(struct bond_dev_private *internals,
 			}
 		} else {
 			/* Send packet directly to the slow queue */
-			uint16_t tx_count = rte_eth_tx_burst(slave_id,
+			uint16_t tx_count = rte_eth_tx_prepare(slave_id,
 					internals->mode4.dedicated_queues.tx_qid,
 					&pkt, 1);
+			tx_count = rte_eth_tx_burst(slave_id,
+					internals->mode4.dedicated_queues.tx_qid,
+					&pkt, tx_count);
 			if (tx_count != 1) {
 				/* reset timer */
 				port->rx_marker_timer = 0;
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index fd2d95a751..fdc004ba46 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -602,8 +602,11 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < num_of_slaves; i++) {
 		if (slave_nb_pkts[i] > 0) {
+			num_tx_slave = rte_eth_tx_prepare(slaves[i],
+					bd_tx_q->queue_id, slave_bufs[i],
+					slave_nb_pkts[i]);
 			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-					slave_bufs[i], slave_nb_pkts[i]);
+					slave_bufs[i], num_tx_slave);
 
 			/* if tx burst fails move packets to end of bufs */
 			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
@@ -628,6 +631,7 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 {
 	struct bond_dev_private *internals;
 	struct bond_tx_queue *bd_tx_q;
+	uint16_t nb_prep_pkts;
 
 	bd_tx_q = (struct bond_tx_queue *)queue;
 	internals = bd_tx_q->dev_private;
@@ -635,8 +639,11 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 	if (internals->active_slave_count < 1)
 		return 0;
 
+	nb_prep_pkts = rte_eth_tx_prepare(internals->current_primary_port,
+				bd_tx_q->queue_id, bufs, nb_pkts);
+
 	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+			bufs, nb_prep_pkts);
 }
 
 static inline uint16_t
@@ -910,7 +917,7 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	struct rte_eth_dev *primary_port =
 			&rte_eth_devices[internals->primary_port];
-	uint16_t num_tx_total = 0;
+	uint16_t num_tx_total = 0, num_tx_prep;
 	uint16_t i, j;
 
 	uint16_t num_of_slaves = internals->active_slave_count;
@@ -951,8 +958,10 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 		}
 
-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+		num_tx_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
 				bufs + num_tx_total, nb_pkts - num_tx_total);
+		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+				bufs + num_tx_total, num_tx_prep);
 
 		if (num_tx_total == nb_pkts)
 			break;
@@ -1064,8 +1073,10 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	/* Send ARP packets on proper slaves */
 	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
 		if (slave_bufs_pkts[i] > 0) {
-			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
+			num_send = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
 					slave_bufs[i], slave_bufs_pkts[i]);
+			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
+					slave_bufs[i], num_send);
 			for (j = 0; j < slave_bufs_pkts[i] - num_send; j++) {
 				bufs[nb_pkts - 1 - num_not_send - j] =
 						slave_bufs[i][nb_pkts - 1 - j];
@@ -1088,8 +1099,10 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	/* Send update packets on proper slaves */
 	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
 		if (update_bufs_pkts[i] > 0) {
+			num_send = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
+					update_bufs[i], update_bufs_pkts[i]);
 			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, update_bufs[i],
-					update_bufs_pkts[i]);
+					num_send);
 			for (j = num_send; j < update_bufs_pkts[i]; j++) {
 				rte_pktmbuf_free(update_bufs[i][j]);
 			}
@@ -1158,9 +1171,12 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 		if (slave_nb_bufs[i] == 0)
 			continue;
 
-		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+		slave_tx_count = rte_eth_tx_prepare(slave_port_ids[i],
 				bd_tx_q->queue_id, slave_bufs[i],
 				slave_nb_bufs[i]);
+		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+				bd_tx_q->queue_id, slave_bufs[i],
+				slave_tx_count);
 
 		total_tx_count += slave_tx_count;
 
@@ -1243,8 +1259,10 @@ tx_burst_8023ad(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 
 		if (rte_ring_dequeue(port->tx_ring,
 				     (void **)&ctrl_pkt) != -ENOENT) {
-			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+			slave_tx_count = rte_eth_tx_prepare(slave_port_ids[i],
 					bd_tx_q->queue_id, &ctrl_pkt, 1);
+			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+					bd_tx_q->queue_id, &ctrl_pkt, slave_tx_count);
 			/*
 			 * re-enqueue LAG control plane packets to buffering
 			 * ring if transmission fails so the packet isn't lost.
@@ -1298,6 +1316,7 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
 	uint16_t slaves[RTE_MAX_ETHPORTS];
 	uint8_t tx_failed_flag = 0;
 	uint16_t num_of_slaves;
+	uint16_t num_tx_prep;
 
 	uint16_t max_nb_of_tx_pkts = 0;
 
@@ -1320,12 +1339,18 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
 	for (i = 0; i < nb_pkts; i++)
 		rte_pktmbuf_refcnt_update(bufs[i], num_of_slaves - 1);
 
+	/* It is rare that bond different PMDs together, so just call tx-prepare once */
+	num_tx_prep = rte_eth_tx_prepare(slaves[0], bd_tx_q->queue_id,
+					bufs, nb_pkts);
+	if (unlikely(num_tx_prep < nb_pkts))
+		tx_failed_flag = 1;
+
 	/* Transmit burst on each active slave */
 	for (i = 0; i < num_of_slaves; i++) {
 		slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-					bufs, nb_pkts);
+					bufs, num_tx_prep);
 
-		if (unlikely(slave_tx_total[i] < nb_pkts))
+		if (unlikely(slave_tx_total[i] < num_tx_prep))
 			tx_failed_flag = 1;
 
 		/* record the value and slave index for the slave which transmits the
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index e8d1e1c658..b0396bb86e 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -6031,6 +6031,10 @@ uint16_t rte_eth_call_tx_callbacks(uint16_t port_id, uint16_t queue_id,
  * @see rte_eth_tx_prepare to perform some prior checks or adjustments
  * for offloads.
  *
+ * @note This function must not modify mbufs (including packets data) unless
+ * the refcnt is 1. The exception is the bonding PMD, which does not have
+ * tx-prepare function, in this case, mbufs maybe modified.
+ *
  * @param port_id
  *   The port identifier of the Ethernet device.
  * @param queue_id
-- 
2.17.1
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v4] net/bonding: call Tx prepare before Tx burst
  2022-10-09  3:36     ` [PATCH v4] net/bonding: call Tx prepare before Tx burst Chengwen Feng
@ 2022-10-10 19:42       ` Chas Williams
  2022-10-11 13:28         ` fengchengwen
  0 siblings, 1 reply; 61+ messages in thread
From: Chas Williams @ 2022-10-10 19:42 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev
On 10/8/22 23:36, Chengwen Feng wrote:
>   	uint16_t slaves[RTE_MAX_ETHPORTS];
>   	uint8_t tx_failed_flag = 0;
>   	uint16_t num_of_slaves;
> +	uint16_t num_tx_prep;
>   
>   	uint16_t max_nb_of_tx_pkts = 0;
>   
> @@ -1320,12 +1339,18 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>   	for (i = 0; i < nb_pkts; i++)
>   		rte_pktmbuf_refcnt_update(bufs[i], num_of_slaves - 1);
>   
> +	/* It is rare that bond different PMDs together, so just call tx-prepare once */
> +	num_tx_prep = rte_eth_tx_prepare(slaves[0], bd_tx_q->queue_id,
> +					bufs, nb_pkts);
You probably want to do this before you update the refcnt on the mbufs.
Otherwise, the common rte_eth_tx_prepare operation, rte_vlan_insert, will
fail since the refcnt will not be 1.
> +	if (unlikely(num_tx_prep < nb_pkts))
> +		tx_failed_flag = 1;
> +
>   	/* Transmit burst on each active slave */
>   	for (i = 0; i < num_of_slaves; i++) {
>   		slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> -					bufs, nb_pkts);
> +					bufs, num_tx_prep);
>   
> -		if (unlikely(slave_tx_total[i] < nb_pkts))
> +		if (unlikely(slave_tx_total[i] < num_tx_prep))
>   			tx_failed_flag = 1;
>   
>   		/* record the value and slave index for the slave which transmits the
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index e8d1e1c658..b0396bb86e 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -6031,6 +6031,10 @@ uint16_t rte_eth_call_tx_callbacks(uint16_t port_id, uint16_t queue_id,
>    * @see rte_eth_tx_prepare to perform some prior checks or adjustments
>    * for offloads.
>    *
> + * @note This function must not modify mbufs (including packets data) unless
> + * the refcnt is 1. The exception is the bonding PMD, which does not have
> + * tx-prepare function, in this case, mbufs maybe modified.
Exactly. See my comment about calling prepare before you modify the refcnt.
> + *
>    * @param port_id
>    *   The port identifier of the Ethernet device.
>    * @param queue_id
^ permalink raw reply	[flat|nested] 61+ messages in thread
* [PATCH v5] net/bonding: call Tx prepare before Tx burst
  2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
                       ` (3 preceding siblings ...)
  2022-10-09  3:36     ` [PATCH v4] net/bonding: call Tx prepare before Tx burst Chengwen Feng
@ 2022-10-11 13:20     ` Chengwen Feng
  2022-10-15 15:26       ` Chas Williams
  4 siblings, 1 reply; 61+ messages in thread
From: Chengwen Feng @ 2022-10-11 13:20 UTC (permalink / raw)
  To: thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev, 3chas3
Normally, to use the HW offloads capability (e.g. checksum and TSO) in
the Tx direction, the application needs to call rte_eth_tx_prepare() to
do some adjustment with the packets before sending them. But the
tx_prepare callback of the bonding driver is not implemented. Therefore,
the sent packets may have errors (e.g. checksum errors).
However, it is difficult to design the tx_prepare callback for bonding
driver. Because when a bonded device sends packets, the bonded device
allocates the packets to different slave devices based on the real-time
link status and bonding mode. That is, it is very difficult for the
bonded device to determine which slave device's prepare function should
be invoked.
So in this patch, the tx_prepare callback of bonding driver is not
implemented. Instead, the rte_eth_tx_prepare() will be called before
rte_eth_tx_burst(). In this way, all tx_offloads can be processed
correctly for all NIC devices.
Note: because it is rara that bond different PMDs together, so just
call tx-prepare once in broadcast bonding mode.
Also the following description was added to the rte_eth_tx_burst()
function:
"@note This function must not modify mbufs (including packets data)
unless the refcnt is 1. The exception is the bonding PMD, which does not
have tx-prepare function, in this case, mbufs maybe modified."
Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
---
v5: address Chas's comments.
v4: address Chas and Konstantin's comments.
v3: support tx-prepare when Tx internal generate mbufs.
v2: support tx-prepare enable flag and fail stats.
---
 drivers/net/bonding/rte_eth_bond_8023ad.c | 10 ++++--
 drivers/net/bonding/rte_eth_bond_pmd.c    | 37 ++++++++++++++++++-----
 lib/ethdev/rte_ethdev.h                   |  4 +++
 3 files changed, 41 insertions(+), 10 deletions(-)
diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
index b3cddd8a20..29a71ae0bf 100644
--- a/drivers/net/bonding/rte_eth_bond_8023ad.c
+++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
@@ -636,9 +636,12 @@ tx_machine(struct bond_dev_private *internals, uint16_t slave_id)
 			return;
 		}
 	} else {
-		uint16_t pkts_sent = rte_eth_tx_burst(slave_id,
+		uint16_t pkts_sent = rte_eth_tx_prepare(slave_id,
 				internals->mode4.dedicated_queues.tx_qid,
 				&lacp_pkt, 1);
+		pkts_sent = rte_eth_tx_burst(slave_id,
+				internals->mode4.dedicated_queues.tx_qid,
+				&lacp_pkt, pkts_sent);
 		if (pkts_sent != 1) {
 			rte_pktmbuf_free(lacp_pkt);
 			set_warning_flags(port, WRN_TX_QUEUE_FULL);
@@ -1371,9 +1374,12 @@ bond_mode_8023ad_handle_slow_pkt(struct bond_dev_private *internals,
 			}
 		} else {
 			/* Send packet directly to the slow queue */
-			uint16_t tx_count = rte_eth_tx_burst(slave_id,
+			uint16_t tx_count = rte_eth_tx_prepare(slave_id,
 					internals->mode4.dedicated_queues.tx_qid,
 					&pkt, 1);
+			tx_count = rte_eth_tx_burst(slave_id,
+					internals->mode4.dedicated_queues.tx_qid,
+					&pkt, tx_count);
 			if (tx_count != 1) {
 				/* reset timer */
 				port->rx_marker_timer = 0;
diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
index 4081b21338..a2c68ec9bc 100644
--- a/drivers/net/bonding/rte_eth_bond_pmd.c
+++ b/drivers/net/bonding/rte_eth_bond_pmd.c
@@ -602,8 +602,11 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
 	/* Send packet burst on each slave device */
 	for (i = 0; i < num_of_slaves; i++) {
 		if (slave_nb_pkts[i] > 0) {
+			num_tx_slave = rte_eth_tx_prepare(slaves[i],
+					bd_tx_q->queue_id, slave_bufs[i],
+					slave_nb_pkts[i]);
 			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
-					slave_bufs[i], slave_nb_pkts[i]);
+					slave_bufs[i], num_tx_slave);
 
 			/* if tx burst fails move packets to end of bufs */
 			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
@@ -628,6 +631,7 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 {
 	struct bond_dev_private *internals;
 	struct bond_tx_queue *bd_tx_q;
+	uint16_t nb_prep_pkts;
 
 	bd_tx_q = (struct bond_tx_queue *)queue;
 	internals = bd_tx_q->dev_private;
@@ -635,8 +639,11 @@ bond_ethdev_tx_burst_active_backup(void *queue,
 	if (internals->active_slave_count < 1)
 		return 0;
 
+	nb_prep_pkts = rte_eth_tx_prepare(internals->current_primary_port,
+				bd_tx_q->queue_id, bufs, nb_pkts);
+
 	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
-			bufs, nb_pkts);
+			bufs, nb_prep_pkts);
 }
 
 static inline uint16_t
@@ -910,7 +917,7 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 
 	struct rte_eth_dev *primary_port =
 			&rte_eth_devices[internals->primary_port];
-	uint16_t num_tx_total = 0;
+	uint16_t num_tx_total = 0, num_tx_prep;
 	uint16_t i, j;
 
 	uint16_t num_of_slaves = internals->active_slave_count;
@@ -951,8 +958,10 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 #endif
 		}
 
-		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+		num_tx_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
 				bufs + num_tx_total, nb_pkts - num_tx_total);
+		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
+				bufs + num_tx_total, num_tx_prep);
 
 		if (num_tx_total == nb_pkts)
 			break;
@@ -1064,8 +1073,10 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	/* Send ARP packets on proper slaves */
 	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
 		if (slave_bufs_pkts[i] > 0) {
-			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
+			num_send = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
 					slave_bufs[i], slave_bufs_pkts[i]);
+			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
+					slave_bufs[i], num_send);
 			for (j = 0; j < slave_bufs_pkts[i] - num_send; j++) {
 				bufs[nb_pkts - 1 - num_not_send - j] =
 						slave_bufs[i][nb_pkts - 1 - j];
@@ -1088,8 +1099,10 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
 	/* Send update packets on proper slaves */
 	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
 		if (update_bufs_pkts[i] > 0) {
+			num_send = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
+					update_bufs[i], update_bufs_pkts[i]);
 			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, update_bufs[i],
-					update_bufs_pkts[i]);
+					num_send);
 			for (j = num_send; j < update_bufs_pkts[i]; j++) {
 				rte_pktmbuf_free(update_bufs[i][j]);
 			}
@@ -1158,9 +1171,12 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 		if (slave_nb_bufs[i] == 0)
 			continue;
 
-		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+		slave_tx_count = rte_eth_tx_prepare(slave_port_ids[i],
 				bd_tx_q->queue_id, slave_bufs[i],
 				slave_nb_bufs[i]);
+		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+				bd_tx_q->queue_id, slave_bufs[i],
+				slave_tx_count);
 
 		total_tx_count += slave_tx_count;
 
@@ -1243,8 +1259,10 @@ tx_burst_8023ad(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
 
 		if (rte_ring_dequeue(port->tx_ring,
 				     (void **)&ctrl_pkt) != -ENOENT) {
-			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+			slave_tx_count = rte_eth_tx_prepare(slave_port_ids[i],
 					bd_tx_q->queue_id, &ctrl_pkt, 1);
+			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
+					bd_tx_q->queue_id, &ctrl_pkt, slave_tx_count);
 			/*
 			 * re-enqueue LAG control plane packets to buffering
 			 * ring if transmission fails so the packet isn't lost.
@@ -1316,6 +1334,9 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
 	if (num_of_slaves < 1)
 		return 0;
 
+	/* It is rare that bond different PMDs together, so just call tx-prepare once */
+	nb_pkts = rte_eth_tx_prepare(slaves[0], bd_tx_q->queue_id, bufs, nb_pkts);
+
 	/* Increment reference count on mbufs */
 	for (i = 0; i < nb_pkts; i++)
 		rte_pktmbuf_refcnt_update(bufs[i], num_of_slaves - 1);
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index d43a638aff..e92139f105 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -6095,6 +6095,10 @@ uint16_t rte_eth_call_tx_callbacks(uint16_t port_id, uint16_t queue_id,
  * @see rte_eth_tx_prepare to perform some prior checks or adjustments
  * for offloads.
  *
+ * @note This function must not modify mbufs (including packets data) unless
+ * the refcnt is 1. The exception is the bonding PMD, which does not have
+ * tx-prepare function, in this case, mbufs maybe modified.
+ *
  * @param port_id
  *   The port identifier of the Ethernet device.
  * @param queue_id
-- 
2.17.1
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v4] net/bonding: call Tx prepare before Tx burst
  2022-10-10 19:42       ` Chas Williams
@ 2022-10-11 13:28         ` fengchengwen
  0 siblings, 0 replies; 61+ messages in thread
From: fengchengwen @ 2022-10-11 13:28 UTC (permalink / raw)
  To: Chas Williams, thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev
Hi Chas,
On 2022/10/11 3:42, Chas Williams wrote:
> 
> 
> On 10/8/22 23:36, Chengwen Feng wrote:
>>       uint16_t slaves[RTE_MAX_ETHPORTS];
>>       uint8_t tx_failed_flag = 0;
>>       uint16_t num_of_slaves;
>> +    uint16_t num_tx_prep;
>>         uint16_t max_nb_of_tx_pkts = 0;
>>   @@ -1320,12 +1339,18 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>>       for (i = 0; i < nb_pkts; i++)
>>           rte_pktmbuf_refcnt_update(bufs[i], num_of_slaves - 1);
>>   +    /* It is rare that bond different PMDs together, so just call tx-prepare once */
>> +    num_tx_prep = rte_eth_tx_prepare(slaves[0], bd_tx_q->queue_id,
>> +                    bufs, nb_pkts);
> 
> You probably want to do this before you update the refcnt on the mbufs.
> Otherwise, the common rte_eth_tx_prepare operation, rte_vlan_insert, will
> fail since the refcnt will not be 1.
nice catch
v5 already sent to fix it, please review it. Thanks
> 
>> +    if (unlikely(num_tx_prep < nb_pkts))
>> +        tx_failed_flag = 1;
>> +
>>       /* Transmit burst on each active slave */
>>       for (i = 0; i < num_of_slaves; i++) {
>>           slave_tx_total[i] = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> -                    bufs, nb_pkts);
>> +                    bufs, num_tx_prep);
>>   -        if (unlikely(slave_tx_total[i] < nb_pkts))
>> +        if (unlikely(slave_tx_total[i] < num_tx_prep))
>>               tx_failed_flag = 1;
>>             /* record the value and slave index for the slave which transmits the
>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>> index e8d1e1c658..b0396bb86e 100644
>> --- a/lib/ethdev/rte_ethdev.h
>> +++ b/lib/ethdev/rte_ethdev.h
>> @@ -6031,6 +6031,10 @@ uint16_t rte_eth_call_tx_callbacks(uint16_t port_id, uint16_t queue_id,
>>    * @see rte_eth_tx_prepare to perform some prior checks or adjustments
>>    * for offloads.
>>    *
>> + * @note This function must not modify mbufs (including packets data) unless
>> + * the refcnt is 1. The exception is the bonding PMD, which does not have
>> + * tx-prepare function, in this case, mbufs maybe modified.
> 
> Exactly. See my comment about calling prepare before you modify the refcnt.
> 
>> + *
>>    * @param port_id
>>    *   The port identifier of the Ethernet device.
>>    * @param queue_id
> .
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v5] net/bonding: call Tx prepare before Tx burst
  2022-10-11 13:20     ` [PATCH v5] " Chengwen Feng
@ 2022-10-15 15:26       ` Chas Williams
  2022-10-18 14:25         ` fengchengwen
  2022-10-20  7:07         ` Andrew Rybchenko
  0 siblings, 2 replies; 61+ messages in thread
From: Chas Williams @ 2022-10-15 15:26 UTC (permalink / raw)
  To: Chengwen Feng, thomas, ferruh.yigit
  Cc: dev, chas3, humin29, andrew.rybchenko, konstantin.ananyev
This looks fine. Thanks for making the changes!
Signed-off-by: Chas Williams <3chas3@gmail.com>
On 10/11/22 09:20, Chengwen Feng wrote:
> Normally, to use the HW offloads capability (e.g. checksum and TSO) in
> the Tx direction, the application needs to call rte_eth_tx_prepare() to
> do some adjustment with the packets before sending them. But the
> tx_prepare callback of the bonding driver is not implemented. Therefore,
> the sent packets may have errors (e.g. checksum errors).
> 
> However, it is difficult to design the tx_prepare callback for bonding
> driver. Because when a bonded device sends packets, the bonded device
> allocates the packets to different slave devices based on the real-time
> link status and bonding mode. That is, it is very difficult for the
> bonded device to determine which slave device's prepare function should
> be invoked.
> 
> So in this patch, the tx_prepare callback of bonding driver is not
> implemented. Instead, the rte_eth_tx_prepare() will be called before
> rte_eth_tx_burst(). In this way, all tx_offloads can be processed
> correctly for all NIC devices.
> 
> Note: because it is rara that bond different PMDs together, so just
> call tx-prepare once in broadcast bonding mode.
> 
> Also the following description was added to the rte_eth_tx_burst()
> function:
> "@note This function must not modify mbufs (including packets data)
> unless the refcnt is 1. The exception is the bonding PMD, which does not
> have tx-prepare function, in this case, mbufs maybe modified."
> 
> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
> 
> ---
> v5: address Chas's comments.
> v4: address Chas and Konstantin's comments.
> v3: support tx-prepare when Tx internal generate mbufs.
> v2: support tx-prepare enable flag and fail stats.
> 
> ---
>   drivers/net/bonding/rte_eth_bond_8023ad.c | 10 ++++--
>   drivers/net/bonding/rte_eth_bond_pmd.c    | 37 ++++++++++++++++++-----
>   lib/ethdev/rte_ethdev.h                   |  4 +++
>   3 files changed, 41 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c b/drivers/net/bonding/rte_eth_bond_8023ad.c
> index b3cddd8a20..29a71ae0bf 100644
> --- a/drivers/net/bonding/rte_eth_bond_8023ad.c
> +++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
> @@ -636,9 +636,12 @@ tx_machine(struct bond_dev_private *internals, uint16_t slave_id)
>   			return;
>   		}
>   	} else {
> -		uint16_t pkts_sent = rte_eth_tx_burst(slave_id,
> +		uint16_t pkts_sent = rte_eth_tx_prepare(slave_id,
>   				internals->mode4.dedicated_queues.tx_qid,
>   				&lacp_pkt, 1);
> +		pkts_sent = rte_eth_tx_burst(slave_id,
> +				internals->mode4.dedicated_queues.tx_qid,
> +				&lacp_pkt, pkts_sent);
>   		if (pkts_sent != 1) {
>   			rte_pktmbuf_free(lacp_pkt);
>   			set_warning_flags(port, WRN_TX_QUEUE_FULL);
> @@ -1371,9 +1374,12 @@ bond_mode_8023ad_handle_slow_pkt(struct bond_dev_private *internals,
>   			}
>   		} else {
>   			/* Send packet directly to the slow queue */
> -			uint16_t tx_count = rte_eth_tx_burst(slave_id,
> +			uint16_t tx_count = rte_eth_tx_prepare(slave_id,
>   					internals->mode4.dedicated_queues.tx_qid,
>   					&pkt, 1);
> +			tx_count = rte_eth_tx_burst(slave_id,
> +					internals->mode4.dedicated_queues.tx_qid,
> +					&pkt, tx_count);
>   			if (tx_count != 1) {
>   				/* reset timer */
>   				port->rx_marker_timer = 0;
> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c b/drivers/net/bonding/rte_eth_bond_pmd.c
> index 4081b21338..a2c68ec9bc 100644
> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
> @@ -602,8 +602,11 @@ bond_ethdev_tx_burst_round_robin(void *queue, struct rte_mbuf **bufs,
>   	/* Send packet burst on each slave device */
>   	for (i = 0; i < num_of_slaves; i++) {
>   		if (slave_nb_pkts[i] > 0) {
> +			num_tx_slave = rte_eth_tx_prepare(slaves[i],
> +					bd_tx_q->queue_id, slave_bufs[i],
> +					slave_nb_pkts[i]);
>   			num_tx_slave = rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> -					slave_bufs[i], slave_nb_pkts[i]);
> +					slave_bufs[i], num_tx_slave);
>   
>   			/* if tx burst fails move packets to end of bufs */
>   			if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
> @@ -628,6 +631,7 @@ bond_ethdev_tx_burst_active_backup(void *queue,
>   {
>   	struct bond_dev_private *internals;
>   	struct bond_tx_queue *bd_tx_q;
> +	uint16_t nb_prep_pkts;
>   
>   	bd_tx_q = (struct bond_tx_queue *)queue;
>   	internals = bd_tx_q->dev_private;
> @@ -635,8 +639,11 @@ bond_ethdev_tx_burst_active_backup(void *queue,
>   	if (internals->active_slave_count < 1)
>   		return 0;
>   
> +	nb_prep_pkts = rte_eth_tx_prepare(internals->current_primary_port,
> +				bd_tx_q->queue_id, bufs, nb_pkts);
> +
>   	return rte_eth_tx_burst(internals->current_primary_port, bd_tx_q->queue_id,
> -			bufs, nb_pkts);
> +			bufs, nb_prep_pkts);
>   }
>   
>   static inline uint16_t
> @@ -910,7 +917,7 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
>   
>   	struct rte_eth_dev *primary_port =
>   			&rte_eth_devices[internals->primary_port];
> -	uint16_t num_tx_total = 0;
> +	uint16_t num_tx_total = 0, num_tx_prep;
>   	uint16_t i, j;
>   
>   	uint16_t num_of_slaves = internals->active_slave_count;
> @@ -951,8 +958,10 @@ bond_ethdev_tx_burst_tlb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
>   #endif
>   		}
>   
> -		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> +		num_tx_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>   				bufs + num_tx_total, nb_pkts - num_tx_total);
> +		num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
> +				bufs + num_tx_total, num_tx_prep);
>   
>   		if (num_tx_total == nb_pkts)
>   			break;
> @@ -1064,8 +1073,10 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
>   	/* Send ARP packets on proper slaves */
>   	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
>   		if (slave_bufs_pkts[i] > 0) {
> -			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
> +			num_send = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
>   					slave_bufs[i], slave_bufs_pkts[i]);
> +			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
> +					slave_bufs[i], num_send);
>   			for (j = 0; j < slave_bufs_pkts[i] - num_send; j++) {
>   				bufs[nb_pkts - 1 - num_not_send - j] =
>   						slave_bufs[i][nb_pkts - 1 - j];
> @@ -1088,8 +1099,10 @@ bond_ethdev_tx_burst_alb(void *queue, struct rte_mbuf **bufs, uint16_t nb_pkts)
>   	/* Send update packets on proper slaves */
>   	for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
>   		if (update_bufs_pkts[i] > 0) {
> +			num_send = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
> +					update_bufs[i], update_bufs_pkts[i]);
>   			num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, update_bufs[i],
> -					update_bufs_pkts[i]);
> +					num_send);
>   			for (j = num_send; j < update_bufs_pkts[i]; j++) {
>   				rte_pktmbuf_free(update_bufs[i][j]);
>   			}
> @@ -1158,9 +1171,12 @@ tx_burst_balance(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
>   		if (slave_nb_bufs[i] == 0)
>   			continue;
>   
> -		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
> +		slave_tx_count = rte_eth_tx_prepare(slave_port_ids[i],
>   				bd_tx_q->queue_id, slave_bufs[i],
>   				slave_nb_bufs[i]);
> +		slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
> +				bd_tx_q->queue_id, slave_bufs[i],
> +				slave_tx_count);
>   
>   		total_tx_count += slave_tx_count;
>   
> @@ -1243,8 +1259,10 @@ tx_burst_8023ad(void *queue, struct rte_mbuf **bufs, uint16_t nb_bufs,
>   
>   		if (rte_ring_dequeue(port->tx_ring,
>   				     (void **)&ctrl_pkt) != -ENOENT) {
> -			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
> +			slave_tx_count = rte_eth_tx_prepare(slave_port_ids[i],
>   					bd_tx_q->queue_id, &ctrl_pkt, 1);
> +			slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
> +					bd_tx_q->queue_id, &ctrl_pkt, slave_tx_count);
>   			/*
>   			 * re-enqueue LAG control plane packets to buffering
>   			 * ring if transmission fails so the packet isn't lost.
> @@ -1316,6 +1334,9 @@ bond_ethdev_tx_burst_broadcast(void *queue, struct rte_mbuf **bufs,
>   	if (num_of_slaves < 1)
>   		return 0;
>   
> +	/* It is rare that bond different PMDs together, so just call tx-prepare once */
> +	nb_pkts = rte_eth_tx_prepare(slaves[0], bd_tx_q->queue_id, bufs, nb_pkts);
> +
>   	/* Increment reference count on mbufs */
>   	for (i = 0; i < nb_pkts; i++)
>   		rte_pktmbuf_refcnt_update(bufs[i], num_of_slaves - 1);
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index d43a638aff..e92139f105 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -6095,6 +6095,10 @@ uint16_t rte_eth_call_tx_callbacks(uint16_t port_id, uint16_t queue_id,
>    * @see rte_eth_tx_prepare to perform some prior checks or adjustments
>    * for offloads.
>    *
> + * @note This function must not modify mbufs (including packets data) unless
> + * the refcnt is 1. The exception is the bonding PMD, which does not have
> + * tx-prepare function, in this case, mbufs maybe modified.
> + *
>    * @param port_id
>    *   The port identifier of the Ethernet device.
>    * @param queue_id
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v5] net/bonding: call Tx prepare before Tx burst
  2022-10-15 15:26       ` Chas Williams
@ 2022-10-18 14:25         ` fengchengwen
  2022-10-20  7:07         ` Andrew Rybchenko
  1 sibling, 0 replies; 61+ messages in thread
From: fengchengwen @ 2022-10-18 14:25 UTC (permalink / raw)
  To: Chas Williams, thomas, ferruh.yigit, Andrew Rybchenko
  Cc: dev, chas3, humin29, konstantin.ananyev
Hi Thomas, Ferruh and Andrew
   This patch already reviewed by Humin and Chas, Could it accepted in 
22.11 ?
Thanks
On 2022/10/15 23:26, Chas Williams wrote:
> This looks fine. Thanks for making the changes!
>
> Signed-off-by: Chas Williams <3chas3@gmail.com>
>
> On 10/11/22 09:20, Chengwen Feng wrote:
>> Normally, to use the HW offloads capability (e.g. checksum and TSO) in
>> the Tx direction, the application needs to call rte_eth_tx_prepare() to
>> do some adjustment with the packets before sending them. But the
>> tx_prepare callback of the bonding driver is not implemented. Therefore,
>> the sent packets may have errors (e.g. checksum errors).
>>
>> However, it is difficult to design the tx_prepare callback for bonding
>> driver. Because when a bonded device sends packets, the bonded device
>> allocates the packets to different slave devices based on the real-time
>> link status and bonding mode. That is, it is very difficult for the
>> bonded device to determine which slave device's prepare function should
>> be invoked.
>>
>> So in this patch, the tx_prepare callback of bonding driver is not
>> implemented. Instead, the rte_eth_tx_prepare() will be called before
>> rte_eth_tx_burst(). In this way, all tx_offloads can be processed
>> correctly for all NIC devices.
>>
>> Note: because it is rara that bond different PMDs together, so just
>> call tx-prepare once in broadcast bonding mode.
>>
>> Also the following description was added to the rte_eth_tx_burst()
>> function:
>> "@note This function must not modify mbufs (including packets data)
>> unless the refcnt is 1. The exception is the bonding PMD, which does not
>> have tx-prepare function, in this case, mbufs maybe modified."
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>> Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
>>
>> ---
>> v5: address Chas's comments.
>> v4: address Chas and Konstantin's comments.
>> v3: support tx-prepare when Tx internal generate mbufs.
>> v2: support tx-prepare enable flag and fail stats.
>>
>> ---
>>   drivers/net/bonding/rte_eth_bond_8023ad.c | 10 ++++--
>>   drivers/net/bonding/rte_eth_bond_pmd.c    | 37 ++++++++++++++++++-----
>>   lib/ethdev/rte_ethdev.h                   |  4 +++
>>   3 files changed, 41 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/net/bonding/rte_eth_bond_8023ad.c 
>> b/drivers/net/bonding/rte_eth_bond_8023ad.c
>> index b3cddd8a20..29a71ae0bf 100644
>> --- a/drivers/net/bonding/rte_eth_bond_8023ad.c
>> +++ b/drivers/net/bonding/rte_eth_bond_8023ad.c
>> @@ -636,9 +636,12 @@ tx_machine(struct bond_dev_private *internals, 
>> uint16_t slave_id)
>>               return;
>>           }
>>       } else {
>> -        uint16_t pkts_sent = rte_eth_tx_burst(slave_id,
>> +        uint16_t pkts_sent = rte_eth_tx_prepare(slave_id,
>>                   internals->mode4.dedicated_queues.tx_qid,
>>                   &lacp_pkt, 1);
>> +        pkts_sent = rte_eth_tx_burst(slave_id,
>> +                internals->mode4.dedicated_queues.tx_qid,
>> +                &lacp_pkt, pkts_sent);
>>           if (pkts_sent != 1) {
>>               rte_pktmbuf_free(lacp_pkt);
>>               set_warning_flags(port, WRN_TX_QUEUE_FULL);
>> @@ -1371,9 +1374,12 @@ bond_mode_8023ad_handle_slow_pkt(struct 
>> bond_dev_private *internals,
>>               }
>>           } else {
>>               /* Send packet directly to the slow queue */
>> -            uint16_t tx_count = rte_eth_tx_burst(slave_id,
>> +            uint16_t tx_count = rte_eth_tx_prepare(slave_id,
>> internals->mode4.dedicated_queues.tx_qid,
>>                       &pkt, 1);
>> +            tx_count = rte_eth_tx_burst(slave_id,
>> + internals->mode4.dedicated_queues.tx_qid,
>> +                    &pkt, tx_count);
>>               if (tx_count != 1) {
>>                   /* reset timer */
>>                   port->rx_marker_timer = 0;
>> diff --git a/drivers/net/bonding/rte_eth_bond_pmd.c 
>> b/drivers/net/bonding/rte_eth_bond_pmd.c
>> index 4081b21338..a2c68ec9bc 100644
>> --- a/drivers/net/bonding/rte_eth_bond_pmd.c
>> +++ b/drivers/net/bonding/rte_eth_bond_pmd.c
>> @@ -602,8 +602,11 @@ bond_ethdev_tx_burst_round_robin(void *queue, 
>> struct rte_mbuf **bufs,
>>       /* Send packet burst on each slave device */
>>       for (i = 0; i < num_of_slaves; i++) {
>>           if (slave_nb_pkts[i] > 0) {
>> +            num_tx_slave = rte_eth_tx_prepare(slaves[i],
>> +                    bd_tx_q->queue_id, slave_bufs[i],
>> +                    slave_nb_pkts[i]);
>>               num_tx_slave = rte_eth_tx_burst(slaves[i], 
>> bd_tx_q->queue_id,
>> -                    slave_bufs[i], slave_nb_pkts[i]);
>> +                    slave_bufs[i], num_tx_slave);
>>                 /* if tx burst fails move packets to end of bufs */
>>               if (unlikely(num_tx_slave < slave_nb_pkts[i])) {
>> @@ -628,6 +631,7 @@ bond_ethdev_tx_burst_active_backup(void *queue,
>>   {
>>       struct bond_dev_private *internals;
>>       struct bond_tx_queue *bd_tx_q;
>> +    uint16_t nb_prep_pkts;
>>         bd_tx_q = (struct bond_tx_queue *)queue;
>>       internals = bd_tx_q->dev_private;
>> @@ -635,8 +639,11 @@ bond_ethdev_tx_burst_active_backup(void *queue,
>>       if (internals->active_slave_count < 1)
>>           return 0;
>>   +    nb_prep_pkts = 
>> rte_eth_tx_prepare(internals->current_primary_port,
>> +                bd_tx_q->queue_id, bufs, nb_pkts);
>> +
>>       return rte_eth_tx_burst(internals->current_primary_port, 
>> bd_tx_q->queue_id,
>> -            bufs, nb_pkts);
>> +            bufs, nb_prep_pkts);
>>   }
>>     static inline uint16_t
>> @@ -910,7 +917,7 @@ bond_ethdev_tx_burst_tlb(void *queue, struct 
>> rte_mbuf **bufs, uint16_t nb_pkts)
>>         struct rte_eth_dev *primary_port =
>>               &rte_eth_devices[internals->primary_port];
>> -    uint16_t num_tx_total = 0;
>> +    uint16_t num_tx_total = 0, num_tx_prep;
>>       uint16_t i, j;
>>         uint16_t num_of_slaves = internals->active_slave_count;
>> @@ -951,8 +958,10 @@ bond_ethdev_tx_burst_tlb(void *queue, struct 
>> rte_mbuf **bufs, uint16_t nb_pkts)
>>   #endif
>>           }
>>   -        num_tx_total += rte_eth_tx_burst(slaves[i], 
>> bd_tx_q->queue_id,
>> +        num_tx_prep = rte_eth_tx_prepare(slaves[i], bd_tx_q->queue_id,
>>                   bufs + num_tx_total, nb_pkts - num_tx_total);
>> +        num_tx_total += rte_eth_tx_burst(slaves[i], bd_tx_q->queue_id,
>> +                bufs + num_tx_total, num_tx_prep);
>>             if (num_tx_total == nb_pkts)
>>               break;
>> @@ -1064,8 +1073,10 @@ bond_ethdev_tx_burst_alb(void *queue, struct 
>> rte_mbuf **bufs, uint16_t nb_pkts)
>>       /* Send ARP packets on proper slaves */
>>       for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
>>           if (slave_bufs_pkts[i] > 0) {
>> -            num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
>> +            num_send = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
>>                       slave_bufs[i], slave_bufs_pkts[i]);
>> +            num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id,
>> +                    slave_bufs[i], num_send);
>>               for (j = 0; j < slave_bufs_pkts[i] - num_send; j++) {
>>                   bufs[nb_pkts - 1 - num_not_send - j] =
>>                           slave_bufs[i][nb_pkts - 1 - j];
>> @@ -1088,8 +1099,10 @@ bond_ethdev_tx_burst_alb(void *queue, struct 
>> rte_mbuf **bufs, uint16_t nb_pkts)
>>       /* Send update packets on proper slaves */
>>       for (i = 0; i < RTE_MAX_ETHPORTS; i++) {
>>           if (update_bufs_pkts[i] > 0) {
>> +            num_send = rte_eth_tx_prepare(i, bd_tx_q->queue_id,
>> +                    update_bufs[i], update_bufs_pkts[i]);
>>               num_send = rte_eth_tx_burst(i, bd_tx_q->queue_id, 
>> update_bufs[i],
>> -                    update_bufs_pkts[i]);
>> +                    num_send);
>>               for (j = num_send; j < update_bufs_pkts[i]; j++) {
>>                   rte_pktmbuf_free(update_bufs[i][j]);
>>               }
>> @@ -1158,9 +1171,12 @@ tx_burst_balance(void *queue, struct rte_mbuf 
>> **bufs, uint16_t nb_bufs,
>>           if (slave_nb_bufs[i] == 0)
>>               continue;
>>   -        slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
>> +        slave_tx_count = rte_eth_tx_prepare(slave_port_ids[i],
>>                   bd_tx_q->queue_id, slave_bufs[i],
>>                   slave_nb_bufs[i]);
>> +        slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
>> +                bd_tx_q->queue_id, slave_bufs[i],
>> +                slave_tx_count);
>>             total_tx_count += slave_tx_count;
>>   @@ -1243,8 +1259,10 @@ tx_burst_8023ad(void *queue, struct rte_mbuf 
>> **bufs, uint16_t nb_bufs,
>>             if (rte_ring_dequeue(port->tx_ring,
>>                        (void **)&ctrl_pkt) != -ENOENT) {
>> -            slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
>> +            slave_tx_count = rte_eth_tx_prepare(slave_port_ids[i],
>>                       bd_tx_q->queue_id, &ctrl_pkt, 1);
>> +            slave_tx_count = rte_eth_tx_burst(slave_port_ids[i],
>> +                    bd_tx_q->queue_id, &ctrl_pkt, slave_tx_count);
>>               /*
>>                * re-enqueue LAG control plane packets to buffering
>>                * ring if transmission fails so the packet isn't lost.
>> @@ -1316,6 +1334,9 @@ bond_ethdev_tx_burst_broadcast(void *queue, 
>> struct rte_mbuf **bufs,
>>       if (num_of_slaves < 1)
>>           return 0;
>>   +    /* It is rare that bond different PMDs together, so just call 
>> tx-prepare once */
>> +    nb_pkts = rte_eth_tx_prepare(slaves[0], bd_tx_q->queue_id, bufs, 
>> nb_pkts);
>> +
>>       /* Increment reference count on mbufs */
>>       for (i = 0; i < nb_pkts; i++)
>>           rte_pktmbuf_refcnt_update(bufs[i], num_of_slaves - 1);
>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>> index d43a638aff..e92139f105 100644
>> --- a/lib/ethdev/rte_ethdev.h
>> +++ b/lib/ethdev/rte_ethdev.h
>> @@ -6095,6 +6095,10 @@ uint16_t rte_eth_call_tx_callbacks(uint16_t 
>> port_id, uint16_t queue_id,
>>    * @see rte_eth_tx_prepare to perform some prior checks or adjustments
>>    * for offloads.
>>    *
>> + * @note This function must not modify mbufs (including packets 
>> data) unless
>> + * the refcnt is 1. The exception is the bonding PMD, which does not 
>> have
>> + * tx-prepare function, in this case, mbufs maybe modified.
>> + *
>>    * @param port_id
>>    *   The port identifier of the Ethernet device.
>>    * @param queue_id
^ permalink raw reply	[flat|nested] 61+ messages in thread
* Re: [PATCH v5] net/bonding: call Tx prepare before Tx burst
  2022-10-15 15:26       ` Chas Williams
  2022-10-18 14:25         ` fengchengwen
@ 2022-10-20  7:07         ` Andrew Rybchenko
  1 sibling, 0 replies; 61+ messages in thread
From: Andrew Rybchenko @ 2022-10-20  7:07 UTC (permalink / raw)
  To: Chas Williams, Chengwen Feng, thomas, ferruh.yigit
  Cc: dev, chas3, humin29, konstantin.ananyev
On 10/15/22 18:26, Chas Williams wrote:
> On 10/11/22 09:20, Chengwen Feng wrote:
>> Normally, to use the HW offloads capability (e.g. checksum and TSO) in
>> the Tx direction, the application needs to call rte_eth_tx_prepare() to
>> do some adjustment with the packets before sending them. But the
>> tx_prepare callback of the bonding driver is not implemented. Therefore,
>> the sent packets may have errors (e.g. checksum errors).
>>
>> However, it is difficult to design the tx_prepare callback for bonding
>> driver. Because when a bonded device sends packets, the bonded device
>> allocates the packets to different slave devices based on the real-time
>> link status and bonding mode. That is, it is very difficult for the
>> bonded device to determine which slave device's prepare function should
>> be invoked.
>>
>> So in this patch, the tx_prepare callback of bonding driver is not
>> implemented. Instead, the rte_eth_tx_prepare() will be called before
>> rte_eth_tx_burst(). In this way, all tx_offloads can be processed
>> correctly for all NIC devices.
>>
>> Note: because it is rara that bond different PMDs together, so just
>> call tx-prepare once in broadcast bonding mode.
>>
>> Also the following description was added to the rte_eth_tx_burst()
>> function:
>> "@note This function must not modify mbufs (including packets data)
>> unless the refcnt is 1. The exception is the bonding PMD, which does not
>> have tx-prepare function, in this case, mbufs maybe modified."
>>
>> Signed-off-by: Chengchang Tang <tangchengchang@huawei.com>
>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>> Reviewed-by: Min Hu (Connor) <humin29@huawei.com>
>>
>> ---
>> v5: address Chas's comments.
>> v4: address Chas and Konstantin's comments.
>> v3: support tx-prepare when Tx internal generate mbufs.
>> v2: support tx-prepare enable flag and fail stats.
 >
 > This looks fine. Thanks for making the changes!
 >
 > Signed-off-by: Chas Williams <3chas3@gmail.com>
Treat it as
Acked-by: Chas Williams <3chas3@gmail.com>
Applied to dpdk-next-net/main, thanks.
^ permalink raw reply	[flat|nested] 61+ messages in thread
end of thread, other threads:[~2022-10-20  7:07 UTC | newest]
Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-16 11:04 [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Chengchang Tang
2021-04-16 11:04 ` [dpdk-dev] [RFC 1/2] net/bonding: add Tx prepare for bonding Chengchang Tang
2021-04-16 11:04 ` [dpdk-dev] [RFC 2/2] app/testpmd: add cmd for bonding Tx prepare Chengchang Tang
2021-04-16 11:12 ` [dpdk-dev] [RFC 0/2] add Tx prepare support for bonding device Min Hu (Connor)
2021-04-20  1:26 ` Ferruh Yigit
2021-04-20  2:44   ` Chengchang Tang
2021-04-20  8:33     ` Ananyev, Konstantin
2021-04-20 12:44       ` Chengchang Tang
2021-04-20 13:18         ` Ananyev, Konstantin
2021-04-20 14:06           ` Chengchang Tang
2021-04-23  9:46 ` [dpdk-dev] [PATCH " Chengchang Tang
2021-04-23  9:46   ` [dpdk-dev] [PATCH 1/2] net/bonding: support Tx prepare for bonding Chengchang Tang
2021-06-08  9:49     ` Andrew Rybchenko
2021-06-09  6:42       ` Chengchang Tang
2021-06-09  9:35         ` Andrew Rybchenko
2021-06-10  7:32           ` Chengchang Tang
2021-06-14 14:16             ` Andrew Rybchenko
2021-06-09 10:25         ` Ananyev, Konstantin
2021-06-10  6:46           ` Chengchang Tang
2021-06-14 11:36             ` Ananyev, Konstantin
2022-05-24 12:11       ` Min Hu (Connor)
2022-07-25  4:08     ` [PATCH v2 0/3] add Tx prepare support for bonding driver Chengwen Feng
2022-07-25  4:08       ` [PATCH v2 1/3] net/bonding: support Tx prepare Chengwen Feng
2022-09-13 10:22         ` Ferruh Yigit
2022-09-13 15:08           ` Chas Williams
2022-09-14  0:46           ` fengchengwen
2022-09-14 16:59             ` Chas Williams
2022-09-17  2:35               ` fengchengwen
2022-09-17 13:38                 ` Chas Williams
2022-09-19 14:07                   ` Konstantin Ananyev
2022-09-19 23:02                     ` Chas Williams
2022-09-22  2:12                       ` fengchengwen
2022-09-25 10:32                         ` Chas Williams
2022-09-26 10:18                       ` Konstantin Ananyev
2022-09-26 16:36                         ` Chas Williams
2022-07-25  4:08       ` [PATCH v2 2/3] net/bonding: support Tx prepare fail stats Chengwen Feng
2022-07-25  4:08       ` [PATCH v2 3/3] net/bonding: add testpmd cmd for Tx prepare Chengwen Feng
2022-07-25  7:04       ` [PATCH v2 0/3] add Tx prepare support for bonding driver humin (Q)
2022-09-13  1:41       ` fengchengwen
2022-09-17  4:15     ` [PATCH v3 " Chengwen Feng
2022-09-17  4:15       ` [PATCH v3 1/3] net/bonding: support Tx prepare Chengwen Feng
2022-09-17  4:15       ` [PATCH v3 2/3] net/bonding: support Tx prepare fail stats Chengwen Feng
2022-09-17  4:15       ` [PATCH v3 3/3] net/bonding: add testpmd cmd for Tx prepare Chengwen Feng
2022-10-09  3:36     ` [PATCH v4] net/bonding: call Tx prepare before Tx burst Chengwen Feng
2022-10-10 19:42       ` Chas Williams
2022-10-11 13:28         ` fengchengwen
2022-10-11 13:20     ` [PATCH v5] " Chengwen Feng
2022-10-15 15:26       ` Chas Williams
2022-10-18 14:25         ` fengchengwen
2022-10-20  7:07         ` Andrew Rybchenko
2021-04-23  9:46   ` [dpdk-dev] [PATCH 2/2] net/bonding: support configuring Tx offloading for bonding Chengchang Tang
2021-06-08  9:49     ` Andrew Rybchenko
2021-06-09  6:57       ` Chengchang Tang
2021-06-09  9:11         ` Ananyev, Konstantin
2021-06-09  9:37           ` Andrew Rybchenko
2021-06-10  6:29             ` Chengchang Tang
2021-06-14 11:05               ` Ananyev, Konstantin
2021-06-14 14:13                 ` Andrew Rybchenko
2021-04-30  6:26   ` [dpdk-dev] [PATCH 0/2] add Tx prepare support for bonding device Chengchang Tang
2021-04-30  6:47     ` Min Hu (Connor)
2021-06-03  1:44   ` Chengchang Tang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).