* [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad @ 2021-03-22 15:46 Jiawei Zhu 2021-03-22 16:26 ` Jiawei Zhu 2021-03-23 16:33 ` Slava Ovsiienko 0 siblings, 2 replies; 6+ messages in thread From: Jiawei Zhu @ 2021-03-22 15:46 UTC (permalink / raw) To: dev; +Cc: zhujiawei12, matan, shahafs, viacheslavo From: Jiawei Zhu <zhujiawei12@huawei.com> When open the rx checksum offload and receive the wrong checksum, add the ol_flags return bad. And it's not best to use multiplication and division here. Signed-off-by: Jiawei Zhu <zhujiawei12@huawei.com> --- drivers/net/mlx5/mlx5_rxtx.c | 17 ++++++++++------- drivers/net/mlx5/mlx5_utils.h | 6 ------ 2 files changed, 10 insertions(+), 13 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index e3ce9fd..9233af8 100644 --- a/drivers/net/mlx5/mlx5_rxtx.c +++ b/drivers/net/mlx5/mlx5_rxtx.c @@ -1325,13 +1325,16 @@ enum mlx5_txcmp_code { uint32_t ol_flags = 0; uint16_t flags = rte_be_to_cpu_16(cqe->hdr_type_etc); - ol_flags = - TRANSPOSE(flags, - MLX5_CQE_RX_L3_HDR_VALID, - PKT_RX_IP_CKSUM_GOOD) | - TRANSPOSE(flags, - MLX5_CQE_RX_L4_HDR_VALID, - PKT_RX_L4_CKSUM_GOOD); + if (flags & MLX5_CQE_RX_L3_HDR_VALID) + ol_flags |= PKT_RX_IP_CKSUM_GOOD; + else + ol_flags |= PKT_RX_IP_CKSUM_BAD; + + if (flags & MLX5_CQE_RX_L4_HDR_VALID) + ol_flags |= PKT_RX_IP_CKSUM_GOOD; + else + ol_flags |= PKT_RX_IP_CKSUM_BAD; + return ol_flags; } diff --git a/drivers/net/mlx5/mlx5_utils.h b/drivers/net/mlx5/mlx5_utils.h index 7a62187..2f71a23 100644 --- a/drivers/net/mlx5/mlx5_utils.h +++ b/drivers/net/mlx5/mlx5_utils.h @@ -44,12 +44,6 @@ #define NB_SEGS(m) ((m)->nb_segs) #define PORT(m) ((m)->port) -/* Transpose flags. Useful to convert IBV to DPDK flags. */ -#define TRANSPOSE(val, from, to) \ - (((from) >= (to)) ? \ - (((val) & (from)) / ((from) / (to))) : \ - (((val) & (from)) * ((to) / (from)))) - /* * For the case which data is linked with sequence increased index, the * array table will be more efficiect than hash table once need to serarch -- 1.8.3.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad 2021-03-22 15:46 [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad Jiawei Zhu @ 2021-03-22 16:26 ` Jiawei Zhu 2021-03-23 16:33 ` Slava Ovsiienko 1 sibling, 0 replies; 6+ messages in thread From: Jiawei Zhu @ 2021-03-22 16:26 UTC (permalink / raw) To: dev; +Cc: zhujiawei12, matan, shahafs, viacheslavo From: Jiawei Zhu <zhujiawei12@huawei.com> When open the rx checksum offload and receive the wrong checksum, add the ol_flags return bad. And it's not best to use multiplication and division here. Signed-off-by: Jiawei Zhu <zhujiawei12@huawei.com> --- drivers/net/mlx5/mlx5_rxtx.c | 17 ++++++++++------- drivers/net/mlx5/mlx5_utils.h | 6 ------ 2 files changed, 10 insertions(+), 13 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c index e3ce9fd..dcc0ebc 100644 --- a/drivers/net/mlx5/mlx5_rxtx.c +++ b/drivers/net/mlx5/mlx5_rxtx.c @@ -1325,13 +1325,16 @@ enum mlx5_txcmp_code { uint32_t ol_flags = 0; uint16_t flags = rte_be_to_cpu_16(cqe->hdr_type_etc); - ol_flags = - TRANSPOSE(flags, - MLX5_CQE_RX_L3_HDR_VALID, - PKT_RX_IP_CKSUM_GOOD) | - TRANSPOSE(flags, - MLX5_CQE_RX_L4_HDR_VALID, - PKT_RX_L4_CKSUM_GOOD); + if (flags & MLX5_CQE_RX_L3_HDR_VALID) + ol_flags |= PKT_RX_IP_CKSUM_GOOD; + else + ol_flags |= PKT_RX_IP_CKSUM_BAD; + + if (flags & MLX5_CQE_RX_L4_HDR_VALID) + ol_flags |= PKT_RX_L4_CKSUM_GOOD; + else + ol_flags |= PKT_RX_L4_CKSUM_BAD; + return ol_flags; } diff --git a/drivers/net/mlx5/mlx5_utils.h b/drivers/net/mlx5/mlx5_utils.h index 7a62187..2f71a23 100644 --- a/drivers/net/mlx5/mlx5_utils.h +++ b/drivers/net/mlx5/mlx5_utils.h @@ -44,12 +44,6 @@ #define NB_SEGS(m) ((m)->nb_segs) #define PORT(m) ((m)->port) -/* Transpose flags. Useful to convert IBV to DPDK flags. */ -#define TRANSPOSE(val, from, to) \ - (((from) >= (to)) ? \ - (((val) & (from)) / ((from) / (to))) : \ - (((val) & (from)) * ((to) / (from)))) - /* * For the case which data is linked with sequence increased index, the * array table will be more efficiect than hash table once need to serarch -- 1.8.3.1 ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad 2021-03-22 15:46 [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad Jiawei Zhu 2021-03-22 16:26 ` Jiawei Zhu @ 2021-03-23 16:33 ` Slava Ovsiienko 2021-03-24 16:22 ` Jiawei Zhu 1 sibling, 1 reply; 6+ messages in thread From: Slava Ovsiienko @ 2021-03-23 16:33 UTC (permalink / raw) To: Jiawei Zhu, dev; +Cc: zhujiawei12, Matan Azrad, Shahaf Shuler Hi, Jiawei > -----Original Message----- > From: Jiawei Zhu <17826875952@163.com> > Sent: Monday, March 22, 2021 17:46 > To: dev@dpdk.org > Cc: zhujiawei12@huawei.com; Matan Azrad <matan@nvidia.com>; Shahaf > Shuler <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com> > Subject: [PATCH] net/mlx5: add Rx checksum offload flag return bad > > From: Jiawei Zhu <zhujiawei12@huawei.com> > > When open the rx checksum offload and receive the wrong checksum, add > the ol_flags return bad. And it's not best to use multiplication and division > here. I'm sorry, there should be no any multiplications and divisions - the arguments are constants (can be determined at compilation time) and ara power of 2, hence compiler engages simple shifts. For example (I did rxq_cq_to_ol_flags not inline to get the listing for x86-64): 29 rxq_cq_to_ol_flags: 21 /* 22 * An architecture-optimized byte swap for a 16-bit value. 23 * 24 * Do not use this function directly. The preferred function is rte_bswap16(). 25 */ 26 static inline uint16_t rte_arch_bswap16(uint16_t _x) 27 { 28 uint16_t x = _x; 29 0034 86D6 asm volatile ("xchgb %b[x1],%h[x2]" 30 : [x1] "=Q" (x) 37 38 0036 89D0 movl %edx,%eax 39 0038 6681E200 andw $512,%dx 39 02 40 003d 66250004 andw $1024,%ax 41 0041 0FB7D2 movzwl %dx,%edx 42 0044 0FB7C0 movzwl %ax,%eax 43 0047 48C1EA02 shrq $2,%rdx 44 004b 48C1E802 shrq $2,%rax 45 004f 09D0 orl %edx,%eax 46 0051 C3 ret As we can see - there is no any mul/div and no any branches, that improves the performance. > > Signed-off-by: Jiawei Zhu <zhujiawei12@huawei.com> > --- > drivers/net/mlx5/mlx5_rxtx.c | 17 ++++++++++------- > drivers/net/mlx5/mlx5_utils.h | 6 ------ > 2 files changed, 10 insertions(+), 13 deletions(-) > > diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c > index e3ce9fd..9233af8 100644 > --- a/drivers/net/mlx5/mlx5_rxtx.c > +++ b/drivers/net/mlx5/mlx5_rxtx.c > @@ -1325,13 +1325,16 @@ enum mlx5_txcmp_code { > uint32_t ol_flags = 0; > uint16_t flags = rte_be_to_cpu_16(cqe->hdr_type_etc); > > - ol_flags = > - TRANSPOSE(flags, > - MLX5_CQE_RX_L3_HDR_VALID, > - PKT_RX_IP_CKSUM_GOOD) | > - TRANSPOSE(flags, > - MLX5_CQE_RX_L4_HDR_VALID, > - PKT_RX_L4_CKSUM_GOOD); > + if (flags & MLX5_CQE_RX_L3_HDR_VALID) > + ol_flags |= PKT_RX_IP_CKSUM_GOOD; > + else > + ol_flags |= PKT_RX_IP_CKSUM_BAD; > + If MLX5_CQE_RX_L3_HDR_VALID is not set - it does not always mean the sum is bad. If might happen if HW just did not recognize the packet format (for example, for some tunnels) With best regards, Slava > + if (flags & MLX5_CQE_RX_L4_HDR_VALID) > + ol_flags |= PKT_RX_IP_CKSUM_GOOD; > + else > + ol_flags |= PKT_RX_IP_CKSUM_BAD; > + > return ol_flags; > } > > diff --git a/drivers/net/mlx5/mlx5_utils.h b/drivers/net/mlx5/mlx5_utils.h > index 7a62187..2f71a23 100644 > --- a/drivers/net/mlx5/mlx5_utils.h > +++ b/drivers/net/mlx5/mlx5_utils.h > @@ -44,12 +44,6 @@ > #define NB_SEGS(m) ((m)->nb_segs) > #define PORT(m) ((m)->port) > > -/* Transpose flags. Useful to convert IBV to DPDK flags. */ -#define > TRANSPOSE(val, from, to) \ > - (((from) >= (to)) ? \ > - (((val) & (from)) / ((from) / (to))) : \ > - (((val) & (from)) * ((to) / (from)))) > - > /* > * For the case which data is linked with sequence increased index, the > * array table will be more efficiect than hash table once need to serarch > -- > 1.8.3.1 > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad 2021-03-23 16:33 ` Slava Ovsiienko @ 2021-03-24 16:22 ` Jiawei Zhu 2021-03-25 11:55 ` Slava Ovsiienko 0 siblings, 1 reply; 6+ messages in thread From: Jiawei Zhu @ 2021-03-24 16:22 UTC (permalink / raw) To: Slava Ovsiienko, dev; +Cc: zhujiawei12, Matan Azrad, Shahaf Shuler Hi,Slava Thanks for your explain,the multiplications and divisions are in the TRANSPOSE,not in the rte_be_to_cpu_16. So I think use if-else directly could improves the performance. And the second point about the bad sum,I agree with you. With best regards, Jiawei On 2021/3/24 12:33 上午, Slava Ovsiienko wrote: > Hi, Jiawei > >> -----Original Message----- >> From: Jiawei Zhu <17826875952@163.com> >> Sent: Monday, March 22, 2021 17:46 >> To: dev@dpdk.org >> Cc: zhujiawei12@huawei.com; Matan Azrad <matan@nvidia.com>; Shahaf >> Shuler <shahafs@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com> >> Subject: [PATCH] net/mlx5: add Rx checksum offload flag return bad >> >> From: Jiawei Zhu <zhujiawei12@huawei.com> >> >> When open the rx checksum offload and receive the wrong checksum, add >> the ol_flags return bad. And it's not best to use multiplication and division >> here. > > I'm sorry, there should be no any multiplications and divisions - the arguments > are constants (can be determined at compilation time) and ara power of 2, > hence compiler engages simple shifts. For example (I did rxq_cq_to_ol_flags not inline to > get the listing for x86-64): > > 29 rxq_cq_to_ol_flags: > 21 /* > 22 * An architecture-optimized byte swap for a 16-bit value. > 23 * > 24 * Do not use this function directly. The preferred function is rte_bswap16(). > 25 */ > 26 static inline uint16_t rte_arch_bswap16(uint16_t _x) > 27 { > 28 uint16_t x = _x; > 29 0034 86D6 asm volatile ("xchgb %b[x1],%h[x2]" > 30 : [x1] "=Q" (x) > 37 > 38 0036 89D0 movl %edx,%eax > 39 0038 6681E200 andw $512,%dx > 39 02 > 40 003d 66250004 andw $1024,%ax > 41 0041 0FB7D2 movzwl %dx,%edx > 42 0044 0FB7C0 movzwl %ax,%eax > 43 0047 48C1EA02 shrq $2,%rdx > 44 004b 48C1E802 shrq $2,%rax > 45 004f 09D0 orl %edx,%eax > 46 0051 C3 ret > > As we can see - there is no any mul/div and no any branches, that > improves the performance. > >> >> Signed-off-by: Jiawei Zhu <zhujiawei12@huawei.com> >> --- >> drivers/net/mlx5/mlx5_rxtx.c | 17 ++++++++++------- >> drivers/net/mlx5/mlx5_utils.h | 6 ------ >> 2 files changed, 10 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/net/mlx5/mlx5_rxtx.c b/drivers/net/mlx5/mlx5_rxtx.c >> index e3ce9fd..9233af8 100644 >> --- a/drivers/net/mlx5/mlx5_rxtx.c >> +++ b/drivers/net/mlx5/mlx5_rxtx.c >> @@ -1325,13 +1325,16 @@ enum mlx5_txcmp_code { >> uint32_t ol_flags = 0; >> uint16_t flags = rte_be_to_cpu_16(cqe->hdr_type_etc); >> >> - ol_flags = >> - TRANSPOSE(flags, >> - MLX5_CQE_RX_L3_HDR_VALID, >> - PKT_RX_IP_CKSUM_GOOD) | >> - TRANSPOSE(flags, >> - MLX5_CQE_RX_L4_HDR_VALID, >> - PKT_RX_L4_CKSUM_GOOD); >> + if (flags & MLX5_CQE_RX_L3_HDR_VALID) >> + ol_flags |= PKT_RX_IP_CKSUM_GOOD; >> + else >> + ol_flags |= PKT_RX_IP_CKSUM_BAD; >> + > > If MLX5_CQE_RX_L3_HDR_VALID is not set - it does not always mean the sum is bad. > If might happen if HW just did not recognize the packet format (for example, for > some tunnels) > > With best regards, > Slava > > >> + if (flags & MLX5_CQE_RX_L4_HDR_VALID) >> + ol_flags |= PKT_RX_IP_CKSUM_GOOD; >> + else >> + ol_flags |= PKT_RX_IP_CKSUM_BAD; >> + >> return ol_flags; >> } >> >> diff --git a/drivers/net/mlx5/mlx5_utils.h b/drivers/net/mlx5/mlx5_utils.h >> index 7a62187..2f71a23 100644 >> --- a/drivers/net/mlx5/mlx5_utils.h >> +++ b/drivers/net/mlx5/mlx5_utils.h >> @@ -44,12 +44,6 @@ >> #define NB_SEGS(m) ((m)->nb_segs) >> #define PORT(m) ((m)->port) >> >> -/* Transpose flags. Useful to convert IBV to DPDK flags. */ -#define >> TRANSPOSE(val, from, to) \ >> - (((from) >= (to)) ? \ >> - (((val) & (from)) / ((from) / (to))) : \ >> - (((val) & (from)) * ((to) / (from)))) >> - >> /* >> * For the case which data is linked with sequence increased index, the >> * array table will be more efficiect than hash table once need to serarch >> -- >> 1.8.3.1 >> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad 2021-03-24 16:22 ` Jiawei Zhu @ 2021-03-25 11:55 ` Slava Ovsiienko 2021-03-28 13:39 ` Jiawei Zhu 0 siblings, 1 reply; 6+ messages in thread From: Slava Ovsiienko @ 2021-03-25 11:55 UTC (permalink / raw) To: Jiawei Zhu, dev; +Cc: zhujiawei12, Matan Azrad, Shahaf Shuler Hi, Jiawei > -----Original Message----- > From: Jiawei Zhu <17826875952@163.com> > Sent: Wednesday, March 24, 2021 18:22 > To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org > Cc: zhujiawei12@huawei.com; Matan Azrad <matan@nvidia.com>; Shahaf > Shuler <shahafs@nvidia.com> > Subject: Re: [PATCH] net/mlx5: add Rx checksum offload flag return bad > > Hi,Slava > > Thanks for your explain,the multiplications and divisions are in the > TRANSPOSE,not in the rte_be_to_cpu_16. [SO] Yes, TRANSPOSE is the macro with mul and div operators. But, these ones are translated by compiler to the simple shifts (due to operands are power of 2). The only place where TRANSPOSE is used is the rxq_cq_to_ol_flags() routine. I've compiled this one and provided the assembly listing - please see one in my previous reply. It illustrates how TRASPOSE was compiled to and presents the x86 code - we see only shifts: 43 0047 48C1EA02 shrq $2,%rdx 44 004b 48C1E802 shrq $2,%rax No any mul/div, exactly as we expected. > So I think use if-else directly could improves the performance. [SO] The if/else construction is usually compiled to conditional jumps, the branch prediction in CPU over the various ingress traffic patterns (we are analyzing the flags of the received packets) might not work well and we’ll get performance penalty. Hence, it seems the best practice is not to have the conditional jumps at all. The existing code follows this approach as we can see from the assembly listing - there is no conditional jumps. With best regards, Slava PS. Just removed embarrassing details from the listing - this is merely the resulting code of rxq_cq_to_ol_flags(). I removed static and made this one non-inline to see the isolated piece of code: rxq_cq_to_ol_flags: movzwl 28(%rdi),%edx // endianness conversion optimized out at all movl %edx,%eax andw $512,%dx andw $1024,%ax movzwl %dx,%edx movzwl %ax,%eax shrq $2,%rdx shrq $2,%rax orl %edx,%eax ret PPS. As we can see - the shift values are the same for both flags, so there might be some area to optimize (we could have only one shift and only one masking with AND) ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad 2021-03-25 11:55 ` Slava Ovsiienko @ 2021-03-28 13:39 ` Jiawei Zhu 0 siblings, 0 replies; 6+ messages in thread From: Jiawei Zhu @ 2021-03-28 13:39 UTC (permalink / raw) To: Slava Ovsiienko, dev; +Cc: zhujiawei12, Matan Azrad, Shahaf Shuler Hi, Slava Thanks for your detailed explanation!You are right,I didn't look carefully! With best regards, Jiawei On 2021/3/25 7:55 下午, Slava Ovsiienko wrote: > Hi, Jiawei > >> -----Original Message----- >> From: Jiawei Zhu <17826875952@163.com> >> Sent: Wednesday, March 24, 2021 18:22 >> To: Slava Ovsiienko <viacheslavo@nvidia.com>; dev@dpdk.org >> Cc: zhujiawei12@huawei.com; Matan Azrad <matan@nvidia.com>; Shahaf >> Shuler <shahafs@nvidia.com> >> Subject: Re: [PATCH] net/mlx5: add Rx checksum offload flag return bad >> >> Hi,Slava >> >> Thanks for your explain,the multiplications and divisions are in the >> TRANSPOSE,not in the rte_be_to_cpu_16. > > [SO] > Yes, TRANSPOSE is the macro with mul and div operators. But, these ones > are translated by compiler to the simple shifts (due to operands are power of 2). > The only place where TRANSPOSE is used is the rxq_cq_to_ol_flags() routine. > I've compiled this one and provided the assembly listing - please see one > in my previous reply. It illustrates how TRASPOSE was compiled to and > presents the x86 code - we see only shifts: > > 43 0047 48C1EA02 shrq $2,%rdx > 44 004b 48C1E802 shrq $2,%rax > > No any mul/div, exactly as we expected. > >> So I think use if-else directly could improves the performance. > > [SO] > The if/else construction is usually compiled to conditional jumps, the branch > prediction in CPU over the various ingress traffic patterns (we are analyzing the > flags of the received packets) might not work well and we’ll get performance penalty. > Hence, it seems the best practice is not to have the conditional jumps at all. > The existing code follows this approach as we can see from the assembly listing - there > is no conditional jumps. > > With best regards, > Slava > > PS. Just removed embarrassing details from the listing - this is merely the resulting code > of rxq_cq_to_ol_flags(). I removed static and made this one non-inline to see the > isolated piece of code: > > rxq_cq_to_ol_flags: > movzwl 28(%rdi),%edx // endianness conversion optimized out at all > movl %edx,%eax > andw $512,%dx > andw $1024,%ax > movzwl %dx,%edx > movzwl %ax,%eax > shrq $2,%rdx > shrq $2,%rax > orl %edx,%eax > ret > > PPS. As we can see - the shift values are the same for both flags, so there might be some area to optimize > (we could have only one shift and only one masking with AND) > ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2021-03-28 13:39 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-03-22 15:46 [dpdk-dev] [PATCH] net/mlx5: add Rx checksum offload flag return bad Jiawei Zhu 2021-03-22 16:26 ` Jiawei Zhu 2021-03-23 16:33 ` Slava Ovsiienko 2021-03-24 16:22 ` Jiawei Zhu 2021-03-25 11:55 ` Slava Ovsiienko 2021-03-28 13:39 ` Jiawei Zhu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).