* [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx CQ doorbell synchronization on aarch64 @ 2019-09-05 10:55 Phil Yang 2019-09-05 10:55 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx " Phil Yang ` (2 more replies) 0 siblings, 3 replies; 10+ messages in thread From: Phil Yang @ 2019-09-05 10:55 UTC (permalink / raw) To: yskoh, viacheslavo, matan, nelio.laranjeiro, dev Cc: thomas, jerinj, Honnappa.Nagarahalli, gavin.hu, nd, stable The Rx completion queue doorbell field needs to be updated after the last CQE decompressed. For the weaker memory model processors, the compiler barrier is not sufficient to guarantee the order of these operations, so use the coherent I/O memory barrier to make sure these fields are updated in order. Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM") Cc: stable@dpdk.org Suggested-by: Gavin Hu <gavin.hu@arm.com> Signed-off-by: Phil Yang <phil.yang@arm.com> Reviewed-by: Gavin Hu <gavin.hu@arm.com> --- drivers/net/mlx5/mlx5_rxtx_vec_neon.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h index 9930286..e914d01 100644 --- a/drivers/net/mlx5/mlx5_rxtx_vec_neon.h +++ b/drivers/net/mlx5/mlx5_rxtx_vec_neon.h @@ -727,7 +727,7 @@ rxq_burst_v(struct mlx5_rxq_data *rxq, struct rte_mbuf **pkts, uint16_t pkts_n, rxq->decompressed -= n; } } - rte_compiler_barrier(); + rte_cio_wmb(); *rxq->cq_db = rte_cpu_to_be_32(rxq->cq_ci); return rcvd_pkt; } -- 2.7.4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on aarch64 2019-09-05 10:55 [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx CQ doorbell synchronization on aarch64 Phil Yang @ 2019-09-05 10:55 ` Phil Yang 2019-09-05 12:12 ` Slava Ovsiienko 2019-09-10 7:22 ` [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx " Matan Azrad 2019-09-12 8:29 ` Raslan Darawsheh 2 siblings, 1 reply; 10+ messages in thread From: Phil Yang @ 2019-09-05 10:55 UTC (permalink / raw) To: yskoh, viacheslavo, matan, nelio.laranjeiro, dev Cc: thomas, jerinj, Honnappa.Nagarahalli, gavin.hu, nd, stable For the weaker memory model processors, the compiler barrier is not sufficient to guarantee the coherent memory update be observed by I/O device. It needs the coherent I/O memory barrier to enforce the ordering of Tx completion queue doorbell operation. Fixes: da1df1ccabad ("net/mlx5: fix completion queue drain loop") Cc: stable@dpdk.org Suggested-by: Gavin Hu <gavin.hu@arm.com> Signed-off-by: Phil Yang <phil.yang@arm.com> Reviewed-by: Gavin Hu <gavin.hu@arm.com> --- drivers/net/mlx5/mlx5_rxtx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index 4c01187..c11148b 100644 --- a/drivers/net/mlx5/mlx5_rxtx.c +++ b/drivers/net/mlx5/mlx5_rxtx.c @@ -2042,7 +2042,7 @@ mlx5_tx_comp_flush(struct mlx5_txq_data *restrict txq, } else { return; } - rte_compiler_barrier(); + rte_cio_wmb(); *txq->cq_db = rte_cpu_to_be_32(txq->cq_ci); if (likely(tail != txq->elts_tail)) { mlx5_tx_free_elts(txq, tail, olx); -- 2.7.4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on aarch64 2019-09-05 10:55 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx " Phil Yang @ 2019-09-05 12:12 ` Slava Ovsiienko 2019-09-06 7:20 ` Phil Yang (Arm Technology China) 0 siblings, 1 reply; 10+ messages in thread From: Slava Ovsiienko @ 2019-09-05 12:12 UTC (permalink / raw) To: Phil Yang, Yongseok Koh, Matan Azrad, Nélio Laranjeiro, dev Cc: Thomas Monjalon, jerinj, Honnappa.Nagarahalli, gavin.hu, nd, stable Hi, Phil This point is in datapath and performance is very critical. The rte_cio_wmb() may take a lot of CPU cycles, waiting till all previous writes become visible for all external (relating to core) agents. The Tx CQE doorbelling does not need any writes to other locations to be completed, the only concern is not to reorder/merge the writes to the same doorbell register of the same sending queue in the tx_burst() internal sending loop/subsequent calls. As far as I know - the writes to the same location should not be reordered by any arch (may be merged if memory settings allow this, it is not critical for CQE doorbell), could you, please, explain why we need explicit hardware fence before CQE doorbell update? Do you think doorbell write might be rearranged with previously reads from the ring buffer? WBR, Slava > -----Original Message----- > From: Phil Yang <phil.yang@arm.com> > Sent: Thursday, September 5, 2019 13:55 > To: Yongseok Koh <yskoh@mellanox.com>; Slava Ovsiienko > <viacheslavo@mellanox.com>; Matan Azrad <matan@mellanox.com>; Nélio > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; nd@arm.com; > stable@dpdk.org > Subject: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > aarch64 > > For the weaker memory model processors, the compiler barrier is not > sufficient to guarantee the coherent memory update be observed by I/O > device. It needs the coherent I/O memory barrier to enforce the ordering of > Tx completion queue doorbell operation. > > Fixes: da1df1ccabad ("net/mlx5: fix completion queue drain loop") > Cc: stable@dpdk.org > > Suggested-by: Gavin Hu <gavin.hu@arm.com> > Signed-off-by: Phil Yang <phil.yang@arm.com> > Reviewed-by: Gavin Hu <gavin.hu@arm.com> > --- > drivers/net/mlx5/mlx5_rxtx.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c > index 4c01187..c11148b 100644 > --- a/drivers/net/mlx5/mlx5_rxtx.c > +++ b/drivers/net/mlx5/mlx5_rxtx.c > @@ -2042,7 +2042,7 @@ mlx5_tx_comp_flush(struct mlx5_txq_data > *restrict txq, > } else { > return; > } > - rte_compiler_barrier(); > + rte_cio_wmb(); > *txq->cq_db = rte_cpu_to_be_32(txq->cq_ci); > if (likely(tail != txq->elts_tail)) { > mlx5_tx_free_elts(txq, tail, olx); > -- > 2.7.4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on aarch64 2019-09-05 12:12 ` Slava Ovsiienko @ 2019-09-06 7:20 ` Phil Yang (Arm Technology China) 2019-09-06 12:26 ` Slava Ovsiienko 0 siblings, 1 reply; 10+ messages in thread From: Phil Yang (Arm Technology China) @ 2019-09-06 7:20 UTC (permalink / raw) To: Slava Ovsiienko, yskoh, Matan Azrad, Nélio Laranjeiro, dev Cc: thomas, jerinj, Honnappa Nagarahalli, Gavin Hu (Arm Technology China), nd, stable, nd Hi, Slava Thanks for your comments. > -----Original Message----- > From: Slava Ovsiienko <viacheslavo@mellanox.com> > Sent: Thursday, September 5, 2019 8:12 PM > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; > yskoh@mellanox.com; Matan Azrad <matan@mellanox.com>; Nélio > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > Cc: thomas@monjalon.net; jerinj@marvell.com; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; nd <nd@arm.com>; stable@dpdk.org > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > aarch64 > > Hi, Phil > > This point is in datapath and performance is very critical. > The rte_cio_wmb() may take a lot of CPU cycles, waiting till all previous > writes become visible for all external (relating to core) agents. > The Tx CQE doorbelling does not need any writes to other locations to be completed, In my understanding, the PMD needs to wait till all txq fields update is completed then ring the doorbell for HW. Before the Tx CQE doorbelling, it will update the producer index of work queue in Tx queue descriptor (at line 2037). The compiler barrier cannot guarantee the ordering of these operations. So use the explicit HW fence to achieve that. As same as the HW Tx doorbell in vectorized Tx burst routine, it uses a write memory barrier to enforce the register update visible to HW immediately. Section 32.5.2 in https://doc.dpdk.org/guides/nics/mlx5.html > the only concern is not to reorder/merge the writes to the same doorbell register of > the same sending queue in the tx_burst() internal sending loop/subsequent calls. > > As far as I know - the writes to the same location should not be reordered by any arch > (may be merged if memory settings allow this, it is not critical for CQE doorbell), > could you, please, explain why we need explicit hardware fence before CQE doorbell > update? Do you think doorbell write might be rearranged with previously > reads from the ring buffer? > > WBR, > Slava > > > -----Original Message----- > > From: Phil Yang <phil.yang@arm.com> > > Sent: Thursday, September 5, 2019 13:55 > > To: Yongseok Koh <yskoh@mellanox.com>; Slava Ovsiienko > > <viacheslavo@mellanox.com>; Matan Azrad <matan@mellanox.com>; > Nélio > > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; nd@arm.com; > > stable@dpdk.org > > Subject: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > > aarch64 > > > > For the weaker memory model processors, the compiler barrier is not > > sufficient to guarantee the coherent memory update be observed by I/O > > device. It needs the coherent I/O memory barrier to enforce the ordering > of > > Tx completion queue doorbell operation. > > > > Fixes: da1df1ccabad ("net/mlx5: fix completion queue drain loop") > > Cc: stable@dpdk.org > > > > Suggested-by: Gavin Hu <gavin.hu@arm.com> > > Signed-off-by: Phil Yang <phil.yang@arm.com> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com> > > --- > > drivers/net/mlx5/mlx5_rxtx.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c > > index 4c01187..c11148b 100644 > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > @@ -2042,7 +2042,7 @@ mlx5_tx_comp_flush(struct mlx5_txq_data > > *restrict txq, > > } else { > > return; > > } > > - rte_compiler_barrier(); > > + rte_cio_wmb(); > > *txq->cq_db = rte_cpu_to_be_32(txq->cq_ci); > > if (likely(tail != txq->elts_tail)) { > > mlx5_tx_free_elts(txq, tail, olx); > > -- > > 2.7.4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on aarch64 2019-09-06 7:20 ` Phil Yang (Arm Technology China) @ 2019-09-06 12:26 ` Slava Ovsiienko 2019-09-09 10:12 ` Phil Yang (Arm Technology China) 0 siblings, 1 reply; 10+ messages in thread From: Slava Ovsiienko @ 2019-09-06 12:26 UTC (permalink / raw) To: Phil Yang (Arm Technology China), Yongseok Koh, Matan Azrad, Nélio Laranjeiro, dev Cc: Thomas Monjalon, jerinj, Honnappa Nagarahalli, Gavin Hu (Arm Technology China), nd, stable, nd Hi, Phil Thanks for explanations, please, see below. > -----Original Message----- > From: Phil Yang (Arm Technology China) <Phil.Yang@arm.com> > Sent: Friday, September 6, 2019 10:20 > To: Slava Ovsiienko <viacheslavo@mellanox.com>; Yongseok Koh > <yskoh@mellanox.com>; Matan Azrad <matan@mellanox.com>; Nélio > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm > Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>; > stable@dpdk.org; nd <nd@arm.com> > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > aarch64 > > Hi, Slava > > Thanks for your comments. > > > -----Original Message----- > > From: Slava Ovsiienko <viacheslavo@mellanox.com> > > Sent: Thursday, September 5, 2019 8:12 PM > > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; > > yskoh@mellanox.com; Matan Azrad <matan@mellanox.com>; Nélio > Laranjeiro > > <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > Cc: thomas@monjalon.net; jerinj@marvell.com; Honnappa Nagarahalli > > <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China) > > <Gavin.Hu@arm.com>; nd <nd@arm.com>; stable@dpdk.org > > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization > > on > > aarch64 > > > > Hi, Phil > > > > This point is in datapath and performance is very critical. > > The rte_cio_wmb() may take a lot of CPU cycles, waiting till all > > previous writes become visible for all external (relating to core) agents. > > The Tx CQE doorbelling does not need any writes to other locations to > > be completed, > > In my understanding, the PMD needs to wait till all txq fields update is > completed then ring the doorbell for HW. > Before the Tx CQE doorbelling, it will update the producer index of work > queue in Tx queue descriptor (at line 2037). txq->wqe_pi is exclusively software field, not related to HW directly. We should not wait for write completions to this one (assuming the tx_burst() must be called with strict affinity settings and core can't be changed). There may be some concern about reading from "last_cqe->wqe_counter" at the line 2037. The compiler barrier was implemented to guarantee this read is issued before doorbell write. As for possible reordering these operations (read index from CQE at 2037 and write to CQ doorbell register at 2046): a) read is performed from already cached area (we touched this CQE performing ownership check very recently) so it is quite unlikely to be completed after the doorbell write b) The only risk to read wrong data is the case of CQE overwriting by HW with CQ buffer overflow. We create the CQ ring buffer with some extra space, so completions which are "in-flight" can't overwrite the CQE is being read. The new completion request may be issued by setting flags in WQE descriptors and following SQ doorbell write, which is already prepended by wmb. (in mlx5_tx_dbrec_cond_wmb(), line 4733). So, it seems there is no chance for CQE to be overwritten. > The compiler barrier cannot guarantee the ordering of these operations. So > use the explicit HW fence to achieve that. > > As same as the HW Tx doorbell in vectorized Tx burst routine, it uses a write > memory barrier to enforce the register update visible to HW immediately. > Section 32.5.2 in > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoc.d > pdk.org%2Fguides%2Fnics%2Fmlx5.html&data=02%7C01%7Cviacheslav > o%40mellanox.com%7C76938b08a9f145c4a0dd08d7329ab932%7Ca652971 > c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C637033512428674501&sd > ata=8tdVjY0%2FHOUFo1%2BeHiuqkPadSS%2FHLeo4b97gdgEHgME%3D&am > p;reserved=0 This is quite different case. PMD build descriptors (WQEs) in the memory and must guarantee these data are visible for external agents before SQ (sending queue, not completion queue) doorbelling. Now there are no vectorized Tx routines (since 19.08), but, of course, we still have the "true" write memory barrier (in mlx5_tx_dbrec_cond_wmb) for this case. > > > the only concern is not to reorder/merge the writes to the same > > doorbell register of the same sending queue in the tx_burst() internal > sending loop/subsequent calls. > > > > As far as I know - the writes to the same location should not be > > reordered by any arch (may be merged if memory settings allow this, it > > is not critical for CQE doorbell), could you, please, explain why we > > need explicit hardware fence before CQE doorbell update? Do you think > > doorbell write might be rearranged with previously reads from the ring > buffer? > > > > WBR, > > Slava > > > > > -----Original Message----- > > > From: Phil Yang <phil.yang@arm.com> > > > Sent: Thursday, September 5, 2019 13:55 > > > To: Yongseok Koh <yskoh@mellanox.com>; Slava Ovsiienko > > > <viacheslavo@mellanox.com>; Matan Azrad <matan@mellanox.com>; > > Nélio > > > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > > > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; nd@arm.com; > > > stable@dpdk.org > > > Subject: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > > > aarch64 > > > > > > For the weaker memory model processors, the compiler barrier is not > > > sufficient to guarantee the coherent memory update be observed by > > > I/O device. It needs the coherent I/O memory barrier to enforce the > > > ordering > > of > > > Tx completion queue doorbell operation. > > > > > > Fixes: da1df1ccabad ("net/mlx5: fix completion queue drain loop") > > > Cc: stable@dpdk.org > > > > > > Suggested-by: Gavin Hu <gavin.hu@arm.com> > > > Signed-off-by: Phil Yang <phil.yang@arm.com> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com> > > > --- > > > drivers/net/mlx5/mlx5_rxtx.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/net/mlx5/mlx5_rxtx.c > > > b/drivers/net/mlx5/mlx5_rxtx.c index 4c01187..c11148b 100644 > > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > > @@ -2042,7 +2042,7 @@ mlx5_tx_comp_flush(struct mlx5_txq_data > > > *restrict txq, > > > } else { > > > return; > > > } > > > - rte_compiler_barrier(); > > > + rte_cio_wmb(); > > > *txq->cq_db = rte_cpu_to_be_32(txq->cq_ci); > > > if (likely(tail != txq->elts_tail)) { > > > mlx5_tx_free_elts(txq, tail, olx); > > > -- > > > 2.7.4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on aarch64 2019-09-06 12:26 ` Slava Ovsiienko @ 2019-09-09 10:12 ` Phil Yang (Arm Technology China) 2019-09-09 11:29 ` Slava Ovsiienko 0 siblings, 1 reply; 10+ messages in thread From: Phil Yang (Arm Technology China) @ 2019-09-09 10:12 UTC (permalink / raw) To: Slava Ovsiienko, yskoh, Matan Azrad, Nélio Laranjeiro, dev Cc: thomas, jerinj, Honnappa Nagarahalli, Gavin Hu (Arm Technology China), nd, stable, Steve Capper, nd > -----Original Message----- > From: Slava Ovsiienko <viacheslavo@mellanox.com> > Sent: Friday, September 6, 2019 8:27 PM > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; > yskoh@mellanox.com; Matan Azrad <matan@mellanox.com>; Nélio > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > Cc: thomas@monjalon.net; jerinj@marvell.com; Honnappa Nagarahalli > <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China) > <Gavin.Hu@arm.com>; nd <nd@arm.com>; stable@dpdk.org; nd > <nd@arm.com> > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > aarch64 > > Hi, Phil > > Thanks for explanations, please, see below. > > > -----Original Message----- > > From: Phil Yang (Arm Technology China) <Phil.Yang@arm.com> > > Sent: Friday, September 6, 2019 10:20 > > To: Slava Ovsiienko <viacheslavo@mellanox.com>; Yongseok Koh > > <yskoh@mellanox.com>; Matan Azrad <matan@mellanox.com>; Nélio > > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm > > Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>; > > stable@dpdk.org; nd <nd@arm.com> > > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > > aarch64 > > > > Hi, Slava > > > > Thanks for your comments. > > > > > -----Original Message----- > > > From: Slava Ovsiienko <viacheslavo@mellanox.com> > > > Sent: Thursday, September 5, 2019 8:12 PM > > > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; > > > yskoh@mellanox.com; Matan Azrad <matan@mellanox.com>; Nélio > > Laranjeiro > > > <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > > Cc: thomas@monjalon.net; jerinj@marvell.com; Honnappa Nagarahalli > > > <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China) > > > <Gavin.Hu@arm.com>; nd <nd@arm.com>; stable@dpdk.org > > > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization > > > on > > > aarch64 > > > > > > Hi, Phil > > > > > > This point is in datapath and performance is very critical. > > > The rte_cio_wmb() may take a lot of CPU cycles, waiting till all > > > previous writes become visible for all external (relating to core) agents. > > > The Tx CQE doorbelling does not need any writes to other locations to > > > be completed, > > > > In my understanding, the PMD needs to wait till all txq fields update is > > completed then ring the doorbell for HW. > > Before the Tx CQE doorbelling, it will update the producer index of work > > queue in Tx queue descriptor (at line 2037). > > txq->wqe_pi is exclusively software field, not related to HW directly. > We should not wait for write completions to this one (assuming the tx_burst() > must be called with strict affinity settings and core can't be changed). Understood, I really appreciate your explanation. Could you please review the 1/2 patch in this series? All your comments are welcomed. > > There may be some concern about reading from "last_cqe->wqe_counter" > at the line 2037. The compiler barrier was implemented to guarantee this > read is issued before doorbell write. > > As for possible reordering these operations (read index from CQE at 2037 and > write to CQ doorbell register at 2046): > > a) read is performed from already cached area (we touched > this CQE performing ownership check very recently) so it is quite unlikely > to be completed after the doorbell write Yes. The "last_cqe->we_counter " is cached. However, it might not guarantee the cached data is valid when CPU issues the read operation. Because mlx5_cqe is in the coherent memory and is shared with the HW, so updating of any other filed in CQE will invalid the whole cache line. In that case, it is possible to complete the doorbell write before the wqe_counter read completed. So we might need a rte_cio_rmb() here, right? > > b) The only risk to read wrong data is the case of CQE overwriting by HW > with CQ buffer overflow. We create the CQ ring buffer with some extra > space, > so completions which are "in-flight" can't overwrite the CQE is being read. > > The new completion request may be issued by setting flags in WQE > descriptors > and following SQ doorbell write, which is already prepended by wmb. > (in mlx5_tx_dbrec_cond_wmb(), line 4733). So, it seems there is no chance > for CQE to be overwritten. > > > The compiler barrier cannot guarantee the ordering of these operations. So > > use the explicit HW fence to achieve that. > > > > As same as the HW Tx doorbell in vectorized Tx burst routine, it uses a write > > memory barrier to enforce the register update visible to HW immediately. > > Section 32.5.2 in > > > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoc.d > > > pdk.org%2Fguides%2Fnics%2Fmlx5.html&data=02%7C01%7Cviacheslav > > o%40mellanox.com%7C76938b08a9f145c4a0dd08d7329ab932%7Ca652971 > > c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C637033512428674501&sd > > > ata=8tdVjY0%2FHOUFo1%2BeHiuqkPadSS%2FHLeo4b97gdgEHgME%3D&am > > p;reserved=0 > > This is quite different case. PMD build descriptors (WQEs) in the memory and > must > guarantee these data are visible for external agents before SQ (sending > queue, > not completion queue) doorbelling. Now there are no vectorized Tx routines > (since 19.08), > but, of course, we still have the "true" write memory barrier (in > mlx5_tx_dbrec_cond_wmb) > for this case. > > > > > > the only concern is not to reorder/merge the writes to the same > > > doorbell register of the same sending queue in the tx_burst() internal > > sending loop/subsequent calls. > > > > > > As far as I know - the writes to the same location should not be > > > reordered by any arch (may be merged if memory settings allow this, it > > > is not critical for CQE doorbell), could you, please, explain why we > > > need explicit hardware fence before CQE doorbell update? Do you think > > > doorbell write might be rearranged with previously reads from the ring > > buffer? > > > > > > WBR, > > > Slava > > > > > > > -----Original Message----- > > > > From: Phil Yang <phil.yang@arm.com> > > > > Sent: Thursday, September 5, 2019 13:55 > > > > To: Yongseok Koh <yskoh@mellanox.com>; Slava Ovsiienko > > > > <viacheslavo@mellanox.com>; Matan Azrad <matan@mellanox.com>; > > > Nélio > > > > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > > > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > > > > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; nd@arm.com; > > > > stable@dpdk.org > > > > Subject: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > > > > aarch64 > > > > > > > > For the weaker memory model processors, the compiler barrier is not > > > > sufficient to guarantee the coherent memory update be observed by > > > > I/O device. It needs the coherent I/O memory barrier to enforce the > > > > ordering > > > of > > > > Tx completion queue doorbell operation. > > > > > > > > Fixes: da1df1ccabad ("net/mlx5: fix completion queue drain loop") > > > > Cc: stable@dpdk.org > > > > > > > > Suggested-by: Gavin Hu <gavin.hu@arm.com> > > > > Signed-off-by: Phil Yang <phil.yang@arm.com> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com> > > > > --- > > > > drivers/net/mlx5/mlx5_rxtx.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/net/mlx5/mlx5_rxtx.c > > > > b/drivers/net/mlx5/mlx5_rxtx.c index 4c01187..c11148b 100644 > > > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > > > @@ -2042,7 +2042,7 @@ mlx5_tx_comp_flush(struct mlx5_txq_data > > > > *restrict txq, > > > > } else { > > > > return; > > > > } > > > > - rte_compiler_barrier(); > > > > + rte_cio_wmb(); > > > > *txq->cq_db = rte_cpu_to_be_32(txq->cq_ci); > > > > if (likely(tail != txq->elts_tail)) { > > > > mlx5_tx_free_elts(txq, tail, olx); > > > > -- > > > > 2.7.4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on aarch64 2019-09-09 10:12 ` Phil Yang (Arm Technology China) @ 2019-09-09 11:29 ` Slava Ovsiienko 2019-09-10 9:22 ` Phil Yang (Arm Technology China) 0 siblings, 1 reply; 10+ messages in thread From: Slava Ovsiienko @ 2019-09-09 11:29 UTC (permalink / raw) To: Phil Yang (Arm Technology China), Matan Azrad, Nélio Laranjeiro, dev Cc: Thomas Monjalon, jerinj, Honnappa Nagarahalli, Gavin Hu (Arm Technology China), nd, stable, Steve Capper, nd > -----Original Message----- > From: Phil Yang (Arm Technology China) <Phil.Yang@arm.com> > Sent: Monday, September 9, 2019 13:12 > To: Slava Ovsiienko <viacheslavo@mellanox.com>; Yongseok Koh > <yskoh@mellanox.com>; Matan Azrad <matan@mellanox.com>; Nélio > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm > Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>; > stable@dpdk.org; Steve Capper <Steve.Capper@arm.com>; nd > <nd@arm.com> > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on > aarch64 > > > -----Original Message----- > > From: Slava Ovsiienko <viacheslavo@mellanox.com> > > Sent: Friday, September 6, 2019 8:27 PM > > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; > > yskoh@mellanox.com; Matan Azrad <matan@mellanox.com>; Nélio > Laranjeiro > > <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > Cc: thomas@monjalon.net; jerinj@marvell.com; Honnappa Nagarahalli > > <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China) > > <Gavin.Hu@arm.com>; nd <nd@arm.com>; stable@dpdk.org; nd > <nd@arm.com> > > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization > > on > > aarch64 > > > > Hi, Phil > > > > Thanks for explanations, please, see below. > > > > > -----Original Message----- > > > From: Phil Yang (Arm Technology China) <Phil.Yang@arm.com> > > > Sent: Friday, September 6, 2019 10:20 > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>; Yongseok Koh > > > <yskoh@mellanox.com>; Matan Azrad <matan@mellanox.com>; Nélio > > > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > > > Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Gavin Hu > (Arm > > > Technology China) <Gavin.Hu@arm.com>; nd <nd@arm.com>; > > > stable@dpdk.org; nd <nd@arm.com> > > > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell > > > synchronization on > > > aarch64 > > > > > > Hi, Slava > > > > > > Thanks for your comments. > > > > > > > -----Original Message----- > > > > From: Slava Ovsiienko <viacheslavo@mellanox.com> > > > > Sent: Thursday, September 5, 2019 8:12 PM > > > > To: Phil Yang (Arm Technology China) <Phil.Yang@arm.com>; > > > > yskoh@mellanox.com; Matan Azrad <matan@mellanox.com>; Nélio > > > Laranjeiro > > > > <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > > > Cc: thomas@monjalon.net; jerinj@marvell.com; Honnappa Nagarahalli > > > > <Honnappa.Nagarahalli@arm.com>; Gavin Hu (Arm Technology China) > > > > <Gavin.Hu@arm.com>; nd <nd@arm.com>; stable@dpdk.org > > > > Subject: RE: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell > > > > synchronization on > > > > aarch64 > > > > > > > > Hi, Phil > > > > > > > > This point is in datapath and performance is very critical. > > > > The rte_cio_wmb() may take a lot of CPU cycles, waiting till all > > > > previous writes become visible for all external (relating to core) agents. > > > > The Tx CQE doorbelling does not need any writes to other locations > > > > to be completed, > > > > > > In my understanding, the PMD needs to wait till all txq fields > > > update is completed then ring the doorbell for HW. > > > Before the Tx CQE doorbelling, it will update the producer index of > > > work queue in Tx queue descriptor (at line 2037). > > > > txq->wqe_pi is exclusively software field, not related to HW directly. > > We should not wait for write completions to this one (assuming the > > tx_burst() must be called with strict affinity settings and core can't be > changed). > > Understood, I really appreciate your explanation. > Could you please review the 1/2 patch in this series? All your comments are > welcomed. I could, but there is more reliable way - I asked the more Rx datapath experienced guy to do it. If we find he is too busy, I'll review the rx related part by myself. > > > > There may be some concern about reading from "last_cqe->wqe_counter" > > at the line 2037. The compiler barrier was implemented to guarantee > > this read is issued before doorbell write. > > > > As for possible reordering these operations (read index from CQE at > > 2037 and write to CQ doorbell register at 2046): > > > > a) read is performed from already cached area (we touched this CQE > > performing ownership check very recently) so it is quite unlikely to > > be completed after the doorbell write > > Yes. The "last_cqe->we_counter " is cached. However, it might not > guarantee the cached data is valid when CPU issues the read operation. > Because mlx5_cqe is in the coherent memory and is shared with the HW, so > updating of any other filed in CQE will invalid the whole cache line. > In that case, it is possible to complete the doorbell write before the > wqe_counter read completed. > So we might need a rte_cio_rmb() here, right? Due to b) bullet below in my earlier reply, it does not matter whether read from CQE is reordered with write to CQ doorbell or not. CQE being read can't be overwritten (by HW) by issued completion requests (already issued to HW in sending queue, we could say these requests are "in flight" now) - completion queue is large enough to store all of them without overwriting this last CQE being read in handle completion. New completion request (within WQEs) is issued with write to SQ doorbell, which actually is prepended with "true" rte_cio_wmb(). Actually, we could drop the compiler barrier ever, but we do not dare 😊. It is not bad to have some inexpensive (or ever free of charge) insurance from unexpected code optimization done by compiler after some refactoring (porting to new platform, etc). WBR, Slava > > > > > b) The only risk to read wrong data is the case of CQE overwriting by > > HW with CQ buffer overflow. We create the CQ ring buffer with some > > extra space, so completions which are "in-flight" can't overwrite the > > CQE is being read. > > > > The new completion request may be issued by setting flags in WQE > > descriptors and following SQ doorbell write, which is already > > prepended by wmb. > > (in mlx5_tx_dbrec_cond_wmb(), line 4733). So, it seems there is no > > chance for CQE to be overwritten. > > > > > The compiler barrier cannot guarantee the ordering of these > > > operations. So use the explicit HW fence to achieve that. > > > > > > As same as the HW Tx doorbell in vectorized Tx burst routine, it > > > uses a write memory barrier to enforce the register update visible to HW > immediately. > > > Section 32.5.2 in > > > > > https://doc.d > > > > > > pdk.org%2Fguides%2Fnics%2Fmlx5.html&data=02%7C01%7Cviacheslav > > > > o%40mellanox.com%7C76938b08a9f145c4a0dd08d7329ab932%7Ca652971 > > > > c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C637033512428674501&sd > > > > > > ata=8tdVjY0%2FHOUFo1%2BeHiuqkPadSS%2FHLeo4b97gdgEHgME%3D&am > > > p;reserved=0 > > > > This is quite different case. PMD build descriptors (WQEs) in the > > memory and must guarantee these data are visible for external agents > > before SQ (sending queue, not completion queue) doorbelling. Now there > > are no vectorized Tx routines (since 19.08), but, of course, we still > > have the "true" write memory barrier (in > > mlx5_tx_dbrec_cond_wmb) > > for this case. > > > > > > > > > the only concern is not to reorder/merge the writes to the same > > > > doorbell register of the same sending queue in the tx_burst() > > > > internal > > > sending loop/subsequent calls. > > > > > > > > As far as I know - the writes to the same location should not be > > > > reordered by any arch (may be merged if memory settings allow > > > > this, it is not critical for CQE doorbell), could you, please, > > > > explain why we need explicit hardware fence before CQE doorbell > > > > update? Do you think doorbell write might be rearranged with > > > > previously reads from the ring > > > buffer? > > > > > > > > WBR, > > > > Slava > > > > > > > > > -----Original Message----- > > > > > From: Phil Yang <phil.yang@arm.com> > > > > > Sent: Thursday, September 5, 2019 13:55 > > > > > To: Yongseok Koh <yskoh@mellanox.com>; Slava Ovsiienko > > > > > <viacheslavo@mellanox.com>; Matan Azrad > <matan@mellanox.com>; > > > > Nélio > > > > > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > > > > > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > > > > > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; > nd@arm.com; > > > > > stable@dpdk.org > > > > > Subject: [PATCH 2/2] net/mlx5: fix Tx CQ doorbell > > > > > synchronization on > > > > > aarch64 > > > > > > > > > > For the weaker memory model processors, the compiler barrier is > > > > > not sufficient to guarantee the coherent memory update be > > > > > observed by I/O device. It needs the coherent I/O memory barrier > > > > > to enforce the ordering > > > > of > > > > > Tx completion queue doorbell operation. > > > > > > > > > > Fixes: da1df1ccabad ("net/mlx5: fix completion queue drain > > > > > loop") > > > > > Cc: stable@dpdk.org > > > > > > > > > > Suggested-by: Gavin Hu <gavin.hu@arm.com> > > > > > Signed-off-by: Phil Yang <phil.yang@arm.com> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com> > > > > > --- > > > > > drivers/net/mlx5/mlx5_rxtx.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/net/mlx5/mlx5_rxtx.c > > > > > b/drivers/net/mlx5/mlx5_rxtx.c index 4c01187..c11148b 100644 > > > > > --- a/drivers/net/mlx5/mlx5_rxtx.c > > > > > +++ b/drivers/net/mlx5/mlx5_rxtx.c > > > > > @@ -2042,7 +2042,7 @@ mlx5_tx_comp_flush(struct mlx5_txq_data > > > > > *restrict txq, > > > > > } else { > > > > > return; > > > > > } > > > > > - rte_compiler_barrier(); > > > > > + rte_cio_wmb(); > > > > > *txq->cq_db = rte_cpu_to_be_32(txq->cq_ci); > > > > > if (likely(tail != txq->elts_tail)) { > > > > > mlx5_tx_free_elts(txq, tail, olx); > > > > > -- > > > > > 2.7.4 ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx CQ doorbell synchronization on aarch64 2019-09-09 11:29 ` Slava Ovsiienko @ 2019-09-10 9:22 ` Phil Yang (Arm Technology China) 0 siblings, 0 replies; 10+ messages in thread From: Phil Yang (Arm Technology China) @ 2019-09-10 9:22 UTC (permalink / raw) To: Slava Ovsiienko, Matan Azrad, Nélio Laranjeiro, dev Cc: thomas, jerinj, Honnappa Nagarahalli, Gavin Hu (Arm Technology China), nd, stable, Steve Capper, nd, nd Hi, Slava Thank you for taking the time to review this patch. I agree with you. The current design is very comprehensive. I will abandon this patch. Thanks, Phil Yang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx CQ doorbell synchronization on aarch64 2019-09-05 10:55 [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx CQ doorbell synchronization on aarch64 Phil Yang 2019-09-05 10:55 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx " Phil Yang @ 2019-09-10 7:22 ` Matan Azrad 2019-09-12 8:29 ` Raslan Darawsheh 2 siblings, 0 replies; 10+ messages in thread From: Matan Azrad @ 2019-09-10 7:22 UTC (permalink / raw) To: Phil Yang, Yongseok Koh, Slava Ovsiienko, Nélio Laranjeiro, dev Cc: Thomas Monjalon, jerinj, Honnappa.Nagarahalli, gavin.hu, nd, stable From: Phil Yang > Subject: [PATCH 1/2] net/mlx5: fix Rx CQ doorbell synchronization on aarch64 > > The Rx completion queue doorbell field needs to be updated after the last > CQE decompressed. For the weaker memory model processors, the compiler > barrier is not sufficient to guarantee the order of these operations, so use > the coherent I/O memory barrier to make sure these fields are updated in > order. > > Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM") > Cc: stable@dpdk.org > > Suggested-by: Gavin Hu <gavin.hu@arm.com> > Signed-off-by: Phil Yang <phil.yang@arm.com> > Reviewed-by: Gavin Hu <gavin.hu@arm.com> Acked-by: Matan Azrad <matan@mellanox.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx CQ doorbell synchronization on aarch64 2019-09-05 10:55 [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx CQ doorbell synchronization on aarch64 Phil Yang 2019-09-05 10:55 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx " Phil Yang 2019-09-10 7:22 ` [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx " Matan Azrad @ 2019-09-12 8:29 ` Raslan Darawsheh 2 siblings, 0 replies; 10+ messages in thread From: Raslan Darawsheh @ 2019-09-12 8:29 UTC (permalink / raw) To: Phil Yang, Yongseok Koh, Slava Ovsiienko, Matan Azrad, Nélio Laranjeiro, dev Cc: Thomas Monjalon, jerinj, Honnappa.Nagarahalli, gavin.hu, nd, stable Hi, > -----Original Message----- > From: dev <dev-bounces@dpdk.org> On Behalf Of Phil Yang > Sent: Thursday, September 5, 2019 1:55 PM > To: Yongseok Koh <yskoh@mellanox.com>; Slava Ovsiienko > <viacheslavo@mellanox.com>; Matan Azrad <matan@mellanox.com>; Nélio > Laranjeiro <nelio.laranjeiro@6wind.com>; dev@dpdk.org > Cc: Thomas Monjalon <thomas@monjalon.net>; jerinj@marvell.com; > Honnappa.Nagarahalli@arm.com; gavin.hu@arm.com; nd@arm.com; > stable@dpdk.org > Subject: [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx CQ doorbell > synchronization on aarch64 > > The Rx completion queue doorbell field needs to be updated after > the last CQE decompressed. For the weaker memory model processors, > the compiler barrier is not sufficient to guarantee the order of > these operations, so use the coherent I/O memory barrier to make > sure these fields are updated in order. > > Fixes: 570acdb1da8a ("net/mlx5: add vectorized Rx/Tx burst for ARM") > Cc: stable@dpdk.org > > Suggested-by: Gavin Hu <gavin.hu@arm.com> > Signed-off-by: Phil Yang <phil.yang@arm.com> > Reviewed-by: Gavin Hu <gavin.hu@arm.com> > --- Patch applied to next-net-mlx, Kindest regards, Raslan Darawsheh ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2019-09-12 8:29 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-05 10:55 [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx CQ doorbell synchronization on aarch64 Phil Yang 2019-09-05 10:55 ` [dpdk-dev] [PATCH 2/2] net/mlx5: fix Tx " Phil Yang 2019-09-05 12:12 ` Slava Ovsiienko 2019-09-06 7:20 ` Phil Yang (Arm Technology China) 2019-09-06 12:26 ` Slava Ovsiienko 2019-09-09 10:12 ` Phil Yang (Arm Technology China) 2019-09-09 11:29 ` Slava Ovsiienko 2019-09-10 9:22 ` Phil Yang (Arm Technology China) 2019-09-10 7:22 ` [dpdk-dev] [PATCH 1/2] net/mlx5: fix Rx " Matan Azrad 2019-09-12 8:29 ` Raslan Darawsheh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).