On Fri, Jul 9, 2021 at 2:00 AM Ruifeng Wang wrote: > > > +/** > > + * Check validity of a completion ring entry. If the entry is valid, include a > > + * C11 __ATOMIC_ACQUIRE fence to ensure that subsequent loads of fields > > in the > > + * completion are not hoisted by the compiler or by the CPU to come before > > the > > + * loading of the "valid" field. > > + * > > + * Note: the caller must not access any fields in the specified completion > > + * entry prior to calling this function. > > + * > > + * @param cmp > Nit, cmpl Thanks, good catch. I'll fix this in v2. > > > > > /* Check to see if hw has posted a completion for the descriptor. */ > > @@ -3327,7 +3327,7 @@ bnxt_tx_descriptor_status_op(void *tx_queue, > > uint16_t offset) > > cons = RING_CMPL(ring_mask, raw_cons); > > txcmp = (struct tx_cmpl *)&cp_desc_ring[cons]; > > > > - if (!CMP_VALID(txcmp, raw_cons, cp_ring_struct)) > > + if (!bnxt_cpr_cmp_valid(txcmp, raw_cons, ring_mask + 1)) > cpr->cp_ring_struct->ring_size can be used instead of 'ring_mask + 1'? > > > break; > > > > if (CMP_TYPE(txcmp) == TX_CMPL_TYPE_TX_L2) > > > > > diff --git a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c > > b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c > > index 263e6ec3c..13211060c 100644 > > --- a/drivers/net/bnxt/bnxt_rxtx_vec_neon.c > > +++ b/drivers/net/bnxt/bnxt_rxtx_vec_neon.c > > @@ -339,7 +339,7 @@ bnxt_handle_tx_cp_vec(struct bnxt_tx_queue *txq) > > cons = RING_CMPL(ring_mask, raw_cons); > > txcmp = (struct tx_cmpl *)&cp_desc_ring[cons]; > > > > - if (!CMP_VALID(txcmp, raw_cons, cp_ring_struct)) > > + if (!bnxt_cpr_cmp_valid(txcmp, raw_cons, ring_mask + 1)) > Same here. I think cpr->cp_ring_struct->ring_size can be used and it avoids calculation. > Also some places in other vector files. It's true that cpr->cp_ring_struct->ring_size and ring_mask + 1 are equivalent, but there doesn't seem to be a meaningful difference between the two in the generated code. Based on disassembly of x86 and Arm code for this function, the compiler correctly determines that the value of ring_mask + 1 doesn't change within the loop, so it is only computed once. The only difference would be in whether an add instruction or a load instruction is used to put the value in the register.