From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wr0-f196.google.com (mail-wr0-f196.google.com [209.85.128.196]) by dpdk.org (Postfix) with ESMTP id 43518F1F for ; Wed, 25 Oct 2017 18:50:33 +0200 (CEST) Received: by mail-wr0-f196.google.com with SMTP id g90so644696wrd.6 for ; Wed, 25 Oct 2017 09:50:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind-com.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=0v+/pN3KMKzmj3uQ1KHg8NkAqqeClO6Q32RSy+VPH4Q=; b=cxAVhbF5OGQxfGkXrinx1jfIBbsDamcn6HTn+183LEN7uPOSdkjuTD2/t33l81XFT9 SRvamQfMM2/wZniHptsm6DRmbCYA0frjxRP/sEVn/0znYg6oaMgwxLzwJJQ10nDrcPT4 6PSe6yw+dxNqB8a6KOro5WkfjrVzzH5kO+w6Rhoq5pZ4Jdx6rorYtJ4jYL2aPp1NhGpv A0Q2ExxH6ADSTrRxFGUgsHQdFUpyHdm5urtop0z2UQhND55nqbBjWMwcxR0jbgFtIvpw Kpt1pWXsW2PAi2ro5/ZkIdtcQJHhmrUSw+gSpVPpgc6c9CJp/1JO77C1bkwYYbVFEE+d jlPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=0v+/pN3KMKzmj3uQ1KHg8NkAqqeClO6Q32RSy+VPH4Q=; b=KBfbhhFHGtW17l6FsvU0Io5Vfrpfx7PxRgH9SoOPXnjxRGQjUVaLjJbAnu5Kd/KwNV 9FLKyfvHIl34nciaihdAbYip+yTvkQ/fEUQlrl0ntqHqlWo8aNNkA5APE9MuVJv2EMLk Nz9AmHBTZJsoBOxX+23AA40fWz5uRA4a1HrJ0MtSRUMNJLwl1mjCDFmMYZPCuusrSEsU 4xjXG7rnMPqsNm9eKi/M9V8G+abbeSobHlZ8LdC4LEdX6lZLSEwKWBxabiANk0qDSsR/ N+w3OCWE1Ksq+UNSVnYjVlCwUXgehIYeuWHotEZqTIXPeSaKuYCiZoUZPa3wCdWaZTd+ Atxw== X-Gm-Message-State: AMCzsaWAf2V31c9a5aqSLHANynpp1Bq/lIY1kp160ThkurdpvEq+OL0v vJd7JRbLRShBZoyFj3PhgsKWMg== X-Google-Smtp-Source: ABhQp+Svs3h+QrNqhsp82rlUaLOWFu1kdowYRn0Hb92Kx+jv4H4fSX7wBCvIu7AQa4rEdWhdCB44lA== X-Received: by 10.223.155.208 with SMTP id e16mr2976200wrc.161.1508950232877; Wed, 25 Oct 2017 09:50:32 -0700 (PDT) Received: from 6wind.com (host.78.145.23.62.rev.coltfrance.com. [62.23.145.78]) by smtp.gmail.com with ESMTPSA id o7sm3766842wrf.31.2017.10.25.09.50.32 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 25 Oct 2017 09:50:32 -0700 (PDT) Date: Wed, 25 Oct 2017 18:50:21 +0200 From: Adrien Mazarguil To: Ophir Munk Cc: dev@dpdk.org, Thomas Monjalon , Olga Shern , Matan Azrad Message-ID: <20171025165021.GJ26782@6wind.com> References: <1508752838-30408-1-git-send-email-ophirmu@mellanox.com> <1508768520-4810-1-git-send-email-ophirmu@mellanox.com> <1508768520-4810-7-git-send-email-ophirmu@mellanox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1508768520-4810-7-git-send-email-ophirmu@mellanox.com> Subject: Re: [dpdk-dev] [PATCH v2 6/7] net/mlx4: improve performance of one Tx segment X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Oct 2017 16:50:33 -0000 On Mon, Oct 23, 2017 at 02:21:59PM +0000, Ophir Munk wrote: > From: Matan Azrad > > Since one segment shouldn't use additional memory to save segments > byte_count for writing them in different order we can prevent > additional memory unnecessary usage in this case. > By the way, prevent loop management. > > All for performance improvement. ...of single-segent scenario? In my opinion the TX burst function doesn't know the likeliest use case of the application unless it first checks some user-provided configuration, e.g. some flag telling it TX gather is a rare occurrence. Multiple segment TX is actually quite common even for small packet sizes. Applications may find it easier to prepend a cloned mbuf segment to all packets in order to perform some encapsulation than to memcpy() its contents inside the headroom of each packet to send. It's much more efficient CPU-wise and a better use of HW capabilities. likely() and unlikely() must be very carefully used in order to not wreck the performance of the non-ideal (real-world, non-benchmarking, however you want to call it) scenario, so when in doubt, keep them for exceptions and error checks. I can't accept this patch without performance results for single and multiple-segments use cases which show they both benefit from it. A few more comments below. > Signed-off-by: Matan Azrad > > --- > drivers/net/mlx4/mlx4_rxtx.c | 125 +++++++++++++++++++++++++++++-------------- > 1 file changed, 85 insertions(+), 40 deletions(-) > > diff --git a/drivers/net/mlx4/mlx4_rxtx.c b/drivers/net/mlx4/mlx4_rxtx.c > index e8d9a35..3236552 100644 > --- a/drivers/net/mlx4/mlx4_rxtx.c > +++ b/drivers/net/mlx4/mlx4_rxtx.c > @@ -310,7 +310,6 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n) > uint32_t owner_opcode = MLX4_OPCODE_SEND; > struct mlx4_wqe_ctrl_seg *ctrl; > struct mlx4_wqe_data_seg *dseg; > - struct rte_mbuf *sbuf; > union { > uint32_t flags; > uint16_t flags16[2]; > @@ -363,12 +362,12 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n) > dseg = (struct mlx4_wqe_data_seg *)((uintptr_t)ctrl + > sizeof(struct mlx4_wqe_ctrl_seg)); > /* Fill the data segments with buffer information. */ > - for (sbuf = buf; sbuf != NULL; sbuf = sbuf->next, dseg++) { > - addr = rte_pktmbuf_mtod(sbuf, uintptr_t); > + if (likely(buf->nb_segs == 1)) { > + addr = rte_pktmbuf_mtod(buf, uintptr_t); > rte_prefetch0((volatile void *)addr); > /* Handle WQE wraparound. */ > - if (unlikely(dseg >= > - (struct mlx4_wqe_data_seg *)sq->eob)) > + if (unlikely(dseg >= (struct mlx4_wqe_data_seg *) > + sq->eob)) Besides the fact this coding style change is unrelated to this commit, this is one example of unlikely() that should not be unlikely(). While it only occurs every time the index wraps at the end of the ring, it's still extremely likely and expected given the number of packets processed per second. > dseg = (struct mlx4_wqe_data_seg *)sq->buf; > dseg->addr = rte_cpu_to_be_64(addr); > /* Memory region key (big endian). */ > @@ -392,44 +391,90 @@ mlx4_tx_burst(void *dpdk_txq, struct rte_mbuf **pkts, uint16_t pkts_n) > break; > } > #endif /* NDEBUG */ > - if (likely(sbuf->data_len)) { > - byte_count = rte_cpu_to_be_32(sbuf->data_len); > - } else { > - /* > - * Zero length segment is treated as inline > - * segment with zero data. > - */ > - byte_count = RTE_BE32(0x80000000); > - } > - /* > - * If the data segment is not at the beginning > - * of a Tx basic block (TXBB) then write the > - * byte count, else postpone the writing to > - * just before updating the control segment. > - */ > - if ((uintptr_t)dseg & (uintptr_t)(MLX4_TXBB_SIZE - 1)) { > - /* > - * Need a barrier here before writing the > - * byte_count fields to make sure that all the > - * data is visible before the byte_count field > - * is set. otherwise, if the segment begins a > - * new cacheline, the HCA prefetcher could grab > - * the 64-byte chunk and get a valid > - * (!= 0xffffffff) byte count but stale data, > - * and end up sending the wrong data. > - */ > - rte_io_wmb(); > - dseg->byte_count = byte_count; > - } else { > + /* Need a barrier here before writing the byte_count. */ > + rte_io_wmb(); > + dseg->byte_count = rte_cpu_to_be_32(buf->data_len); > + } else { > + /* Fill the data segments with buffer information. */ > + struct rte_mbuf *sbuf; > + > + for (sbuf = buf; > + sbuf != NULL; > + sbuf = sbuf->next, dseg++) { > + addr = rte_pktmbuf_mtod(sbuf, uintptr_t); > + rte_prefetch0((volatile void *)addr); > + /* Handle WQE wraparound. */ > + if (unlikely(dseg >= > + (struct mlx4_wqe_data_seg *)sq->eob)) > + dseg = (struct mlx4_wqe_data_seg *) > + sq->buf; > + dseg->addr = rte_cpu_to_be_64(addr); > + /* Memory region key (big endian). */ > + dseg->lkey = mlx4_txq_mp2mr(txq, > + mlx4_txq_mb2mp(sbuf)); > + #ifndef NDEBUG I didn't catch this in the original review, coding rules prohibit indented preprocessor directives. You must remove the extra indent if you're modifying them. > + if (unlikely(dseg->lkey == > + rte_cpu_to_be_32((uint32_t)-1))) { > + /* MR does not exist. */ > + DEBUG("%p: unable to get MP <-> MR association", > + (void *)txq); > + /* > + * Restamp entry in case of failure. > + * Make sure that size is written > + * correctly, note that we give > + * ownership to the SW, not the HW. > + */ > + ctrl->fence_size = > + (wqe_real_size >> 4) & 0x3f; > + mlx4_txq_stamp_freed_wqe(sq, head_idx, > + (sq->head & sq->txbb_cnt) ? 0 : 1); > + elt->buf = NULL; > + break; > + } > + #endif /* NDEBUG */ > + if (likely(sbuf->data_len)) { > + byte_count = > + rte_cpu_to_be_32(sbuf->data_len); > + } else { > + /* > + * Zero length segment is treated as > + * inline segment with zero data. > + */ > + byte_count = RTE_BE32(0x80000000); > + } > /* > - * This data segment starts at the beginning of > - * a new TXBB, so we need to postpone its > - * byte_count writing for later. > + * If the data segment is not at the beginning > + * of a Tx basic block (TXBB) then write the > + * byte count, else postpone the writing to > + * just before updating the control segment. > */ > - pv[pv_counter].dseg = dseg; > - pv[pv_counter++].val = byte_count; > + if ((uintptr_t)dseg & > + (uintptr_t)(MLX4_TXBB_SIZE - 1)) { > + /* > + * Need a barrier here before writing > + * the byte_count fields to make sure > + * that all the data is visible before > + * the byte_count field is set. > + * Otherwise, if the segment begins a > + * new cacheline, the HCA prefetcher > + * could grab the 64-byte chunk and get > + * a valid (!= 0xffffffff) byte count > + * but stale data, and end up sending > + * the wrong data. > + */ > + rte_io_wmb(); > + dseg->byte_count = byte_count; > + } else { > + /* > + * This data segment starts at the > + * beginning of a new TXBB, so we > + * need to postpone its byte_count > + * writing for later. > + */ > + pv[pv_counter].dseg = dseg; > + pv[pv_counter++].val = byte_count; > + } > } > - } Where did that block go? Isn't there an unnecessary indentation level here? > /* Write the first DWORD of each TXBB save earlier. */ > if (pv_counter) { > /* Need a barrier before writing the byte_count. */ > -- > 2.7.4 > -- Adrien Mazarguil 6WIND