From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f177.google.com (mail-wi0-f177.google.com [209.85.212.177]) by dpdk.org (Postfix) with ESMTP id CB805C602 for ; Thu, 30 Jul 2015 13:22:33 +0200 (CEST) Received: by wicgb10 with SMTP id gb10so239301379wic.1 for ; Thu, 30 Jul 2015 04:22:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=KlG0j+2BPN5kzW7Dh9ahhBWlaDuzRUw767dC+kgylEM=; b=ACPbHm7n9Z41F+Y91ZydjlRSc08Ek5QJaJbcmaqZR9gKLEF8rkMe4I0irUEdPsimi1 N/xyiarcNiSEstFPZ6T5m0i+g3eZnI+6Gq3vjtgURVaRtuiyj1vpisbTDn2nHTlPOSKn rvwmSIc86CuDgejJfpz0uZD8vf5iJ+GUt+Ksfw7S7TkqJ+QSQSK5TRi50PeM9HMV+WOg ciP9/xGoxlX0t+QOOkft0P11PKGhotssyh0U9sVoLC1FLW8EGp9eVND+uOlXuNZa2y5b VG9KlofaJy0ERLkfZLvAyl+sORhDeJW9lSc6iv4y/xiWP83HS0y+Vbh09JXUpCCBrd6Q 5Qyw== X-Gm-Message-State: ALoCoQmOgHB3RiZ8auIXG0hEIxMnxzSL0kS4+5EkKqdPioN3T4LS2hIWolDWXy0PnP7r+HLQhk5T X-Received: by 10.194.121.163 with SMTP id ll3mr90128438wjb.142.1438255353664; Thu, 30 Jul 2015 04:22:33 -0700 (PDT) Received: from [10.16.0.195] (6wind.net2.nerim.net. [213.41.151.210]) by smtp.gmail.com with ESMTPSA id m4sm1305909wjb.37.2015.07.30.04.22.32 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 30 Jul 2015 04:22:32 -0700 (PDT) Message-ID: <55BA08EE.8090902@6wind.com> Date: Thu, 30 Jul 2015 13:22:22 +0200 From: Olivier MATZ User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.7.0 MIME-Version: 1.0 To: "Ananyev, Konstantin" , "Zhang, Helin" , Martin Weiser References: <55B8EC16.60404@allegro-packets.com> <2601191342CEEE43887BDE71AB97725836A69E3D@irsmsx105.ger.corp.intel.com> <55B9DC69.3080508@6wind.com> <2601191342CEEE43887BDE71AB97725836A6A082@irsmsx105.ger.corp.intel.com> <55B9E9F5.8080102@6wind.com> <2601191342CEEE43887BDE71AB97725836A6A0CE@irsmsx105.ger.corp.intel.com> In-Reply-To: <2601191342CEEE43887BDE71AB97725836A6A0CE@irsmsx105.ger.corp.intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] Issue with non-scattered rx in ixgbe and i40e when mbuf private area size is odd X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jul 2015 11:22:34 -0000 Hi, On 07/30/2015 11:43 AM, Ananyev, Konstantin wrote: > > >> -----Original Message----- >> From: Olivier MATZ [mailto:olivier.matz@6wind.com] >> Sent: Thursday, July 30, 2015 10:10 AM >> To: Ananyev, Konstantin; Zhang, Helin; Martin Weiser >> Cc: dev@dpdk.org >> Subject: Re: [dpdk-dev] Issue with non-scattered rx in ixgbe and i40e when mbuf private area size is odd >> >> Hi Konstantin, >> >> On 07/30/2015 11:00 AM, Ananyev, Konstantin wrote: >>> Hi Olivier, >>> >>>> -----Original Message----- >>>> From: Olivier MATZ [mailto:olivier.matz@6wind.com] >>>> Sent: Thursday, July 30, 2015 9:12 AM >>>> To: Zhang, Helin; Ananyev, Konstantin; Martin Weiser >>>> Cc: dev@dpdk.org >>>> Subject: Re: [dpdk-dev] Issue with non-scattered rx in ixgbe and i40e when mbuf private area size is odd >>>> >>>> Hi, >>>> >>>> On 07/29/2015 10:24 PM, Zhang, Helin wrote: >>>>> Hi Martin >>>>> >>>>> Thank you very much for the good catch! >>>>> >>>>> The similar situation in i40e, as explained by Konstantin. >>>>> As header split hasn't been supported by DPDK till now. It would be better to put the header address in RX descriptor to 0. >>>>> But in the future, during header split enabling. We may need to pay extra attention to that. As at least x710 datasheet said >>>> specifically as below. >>>>> "The header address should be set by the software to an even number (word aligned address)". We may need to find a way to >>>> ensure that during mempool/mbuf allocation. >>>> >>>> Indeed it would be good to force the priv_size to be aligned. >>>> >>>> The priv_size could be aligned automatically in >>>> rte_pktmbuf_pool_create(). The only possible problem I could see >>>> is that it would break applications that access to the data buffer >>>> by doing (sizeof(mbuf) + sizeof(priv)), which is probably not the >>>> best thing to do (I didn't find any applications like this in dpdk). >>> >>> >>> Might be just make rte_pktmbuf_pool_create() fail if input priv_size % MIN_ALIGN != 0? >> >> Hmm maybe it would break more applications: an odd priv_size is >> probably rare, but a priv_size that is not aligned to 8 bytes is >> maybe more common. > > My thought was that rte_mempool_create() was just introduced in 2.1, > so if we add extra requirement for the input parameter now - > there would be no ABI breakage, and not many people started to use it already. > For me just seems a bit easier and more straightforward then silent alignment - > user would not have wrong assumptions here. > Though if you think that a silent alignment would be more convenient > for most users - I wouldn't insist. Yes, I agree on the principle, but it depends whether this fix is integrated for 2.1 or not. I think it may already be a bit late for that, especially as it is not a very critical bug. Thomas, what do you think? Olivier > Konstantin > >> It's maybe safer to align the size transparently? >> >> >> Regards, >> Olivier >> >> >> >>> >>>> >>>> For applications that directly use rte_mempool_create() instead of >>>> rte_pktmbuf_pool_create(), we could add a check using an assert in >>>> rte_pktmbuf_init() and some words in the documentation. >>>> >>>> The question is: what should be the proper alignment? I would say >>>> at least 8 bytes, but maybe cache_aligned is an option too. >>> >>> 8 bytes seems enough to me. >>> >>> Konstantin >>> >>>> >>>> Regards, >>>> Olivier >>>> >>>> >>>>> >>>>> Regards, >>>>> Helin >>>>> >>>>>> -----Original Message----- >>>>>> From: Ananyev, Konstantin >>>>>> Sent: Wednesday, July 29, 2015 11:12 AM >>>>>> To: Martin Weiser; Zhang, Helin; olivier.matz@6wind.com >>>>>> Cc: dev@dpdk.org >>>>>> Subject: RE: [dpdk-dev] Issue with non-scattered rx in ixgbe and i40e when mbuf >>>>>> private area size is odd >>>>>> >>>>>> Hi Martin, >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Martin Weiser >>>>>>> Sent: Wednesday, July 29, 2015 4:07 PM >>>>>>> To: Zhang, Helin; olivier.matz@6wind.com >>>>>>> Cc: dev@dpdk.org >>>>>>> Subject: [dpdk-dev] Issue with non-scattered rx in ixgbe and i40e when >>>>>>> mbuf private area size is odd >>>>>>> >>>>>>> Hi Helin, Hi Olivier, >>>>>>> >>>>>>> we are seeing an issue with the ixgbe and i40e drivers which we could >>>>>>> track down to our setting of the private area size of the mbufs. >>>>>>> The issue can be easily reproduced with the l2fwd example application >>>>>>> when a small modification is done: just set the priv_size parameter in >>>>>>> the call to the rte_pktmbuf_pool_create function to an odd number like >>>>>>> 1. In our tests this causes every call to rte_eth_rx_burst to return >>>>>>> 32 (which is the setting of nb_pkts) nonsense mbufs although no >>>>>>> packets are received on the interface and the hardware counters do not >>>>>>> report any received packets. >>>>>> >>>>>> From Niantic datasheet: >>>>>> >>>>>> "7.1.6.1 Advanced Receive Descriptors — Read Format Table 7-15 lists the >>>>>> advanced receive descriptor programming by the software. The ... >>>>>> Packet Buffer Address (64) >>>>>> This is the physical address of the packet buffer. The lowest bit is A0 (LSB of the >>>>>> address). >>>>>> Header Buffer Address (64) >>>>>> The physical address of the header buffer with the lowest bit being Descriptor >>>>>> Done (DD). >>>>>> When a packet spans in multiple descriptors, the header buffer address is used >>>>>> only on the first descriptor. During the programming phase, software must set >>>>>> the DD bit to zero (see the description of the DD bit in this section). This means >>>>>> that header buffer addresses are always word aligned." >>>>>> >>>>>> Right now, in ixgbe PMD we always setup Packet Buffer Address (PBA)and >>>>>> Header Buffer Address (HBA) to the same value: >>>>>> buf_physaddr + RTE_PKTMBUF_HEADROOM >>>>>> So when pirv_size==1, DD bit in RXD is always set to one by SW itself, and then >>>>>> SW considers that HW already done with it. >>>>>> In other words, right now for ixgbe you can't use RX buffer that is not aligned on >>>>>> word boundary. >>>>>> >>>>>> So the advice would be, right now - don't set priv_size to the odd value. >>>>>> As we don't support split header feature anyway, I think we can fix it just by >>>>>> always setting HBA in the RXD to zero. >>>>>> Could you try the fix for ixgbe below? >>>>>> >>>>>> Same story with FVL, I believe. >>>>>> Konstantin >>>>>> >>>>>> >>>>>>> Interestingly this does not happen if we force the scattered rx path. >>>>>>> >>>>>>> I assume the drivers have some expectations regarding the alignment of >>>>>>> the buf_addr in the mbuf and setting an odd private are size breaks >>>>>>> this alignment in the rte_pktmbuf_init function. If this is the case >>>>>>> then one possible fix might be to enforce an alignment on the private area size. >>>>>>> >>>>>>> Best regards, >>>>>>> Martin >>>>>> >>>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>> index a0c8847..94967c5 100644 >>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>> @@ -1183,7 +1183,7 @@ ixgbe_rx_alloc_bufs(struct ixgbe_rx_queue *rxq, bool >>>>>> reset_mbuf) >>>>>> >>>>>> /* populate the descriptors */ >>>>>> dma_addr = >>>>>> rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb)); >>>>>> - rxdp[i].read.hdr_addr = dma_addr; >>>>>> + rxdp[i].read.hdr_addr = 0; >>>>>> rxdp[i].read.pkt_addr = dma_addr; >>>>>> } >>>>>> >>>>>> @@ -1414,7 +1414,7 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf >>>>>> **rx_pkts, >>>>>> rxe->mbuf = nmb; >>>>>> dma_addr = >>>>>> >>>>>> rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb)); >>>>>> - rxdp->read.hdr_addr = dma_addr; >>>>>> + rxdp->read.hdr_addr = 0; >>>>>> rxdp->read.pkt_addr = dma_addr; >>>>>> >>>>>> /* >>>>>> @@ -1741,7 +1741,7 @@ next_desc: >>>>>> rxe->mbuf = nmb; >>>>>> >>>>>> rxm->data_off = RTE_PKTMBUF_HEADROOM; >>>>>> - rxdp->read.hdr_addr = dma; >>>>>> + rxdp->read.hdr_addr = 0; >>>>>> rxdp->read.pkt_addr = dma; >>>>>> } else >>>>>> rxe->mbuf = NULL; @@ -3633,7 +3633,7 @@ >>>>>> ixgbe_alloc_rx_queue_mbufs(struct ixgbe_rx_queue *rxq) >>>>>> dma_addr = >>>>>> >>>>>> rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mbuf)); >>>>>> rxd = &rxq->rx_ring[i]; >>>>>> - rxd->read.hdr_addr = dma_addr; >>>>>> + rxd->read.hdr_addr = 0; >>>>>> rxd->read.pkt_addr = dma_addr; >>>>>> rxe[i].mbuf = mbuf; >>>>>> } >>>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c >>>>>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c >>>>>> index 6c1647e..16a9c64 100644 >>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c >>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c >>>>>> @@ -56,6 +56,8 @@ ixgbe_rxq_rearm(struct ixgbe_rx_queue *rxq) >>>>>> RTE_PKTMBUF_HEADROOM); >>>>>> __m128i dma_addr0, dma_addr1; >>>>>> >>>>>> + const __m128i hba_msk = _mm_set_epi64x(0, UINT64_MAX); >>>>>> + >>>>>> rxdp = rxq->rx_ring + rxq->rxrearm_start; >>>>>> >>>>>> /* Pull 'n' more MBUFs into the software ring */ @@ -108,6 +110,9 @@ >>>>>> ixgbe_rxq_rearm(struct ixgbe_rx_queue *rxq) >>>>>> dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room); >>>>>> dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room); >>>>>> >>>>>> + dma_addr0 = _mm_and_si128(dma_addr0, hba_msk); >>>>>> + dma_addr1 = _mm_and_si128(dma_addr1, hba_msk); >>>>>> + >>>>>> /* flush desc with pa dma_addr */ >>>>>> _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0); >>>>>> _mm_store_si128((__m128i *)&rxdp++->read, dma_addr1); >>>>>> bash-4.2$ cat patch1 diff --git a/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c index a0c8847..94967c5 100644 >>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>> @@ -1183,7 +1183,7 @@ ixgbe_rx_alloc_bufs(struct ixgbe_rx_queue *rxq, bool >>>>>> reset_mbuf) >>>>>> >>>>>> /* populate the descriptors */ >>>>>> dma_addr = >>>>>> rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb)); >>>>>> - rxdp[i].read.hdr_addr = dma_addr; >>>>>> + rxdp[i].read.hdr_addr = 0; >>>>>> rxdp[i].read.pkt_addr = dma_addr; >>>>>> } >>>>>> >>>>>> @@ -1414,7 +1414,7 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf >>>>>> **rx_pkts, >>>>>> rxe->mbuf = nmb; >>>>>> dma_addr = >>>>>> >>>>>> rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb)); >>>>>> - rxdp->read.hdr_addr = dma_addr; >>>>>> + rxdp->read.hdr_addr = 0; >>>>>> rxdp->read.pkt_addr = dma_addr; >>>>>> >>>>>> /* >>>>>> @@ -1741,7 +1741,7 @@ next_desc: >>>>>> rxe->mbuf = nmb; >>>>>> >>>>>> rxm->data_off = RTE_PKTMBUF_HEADROOM; >>>>>> - rxdp->read.hdr_addr = dma; >>>>>> + rxdp->read.hdr_addr = 0; >>>>>> rxdp->read.pkt_addr = dma; >>>>>> } else >>>>>> rxe->mbuf = NULL; @@ -3633,7 +3633,7 @@ >>>>>> ixgbe_alloc_rx_queue_mbufs(struct ixgbe_rx_queue *rxq) >>>>>> dma_addr = >>>>>> >>>>>> rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mbuf)); >>>>>> rxd = &rxq->rx_ring[i]; >>>>>> - rxd->read.hdr_addr = dma_addr; >>>>>> + rxd->read.hdr_addr = 0; >>>>>> rxd->read.pkt_addr = dma_addr; >>>>>> rxe[i].mbuf = mbuf; >>>>>> } >>>>>> diff --git a/drivers/net/ixgbe/ixgbe_rxtx_vec.c >>>>>> b/drivers/net/ixgbe/ixgbe_rxtx_vec.c >>>>>> index 6c1647e..16a9c64 100644 >>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx_vec.c >>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx_vec.c >>>>>> @@ -56,6 +56,8 @@ ixgbe_rxq_rearm(struct ixgbe_rx_queue *rxq) >>>>>> RTE_PKTMBUF_HEADROOM); >>>>>> __m128i dma_addr0, dma_addr1; >>>>>> >>>>>> + const __m128i hba_msk = _mm_set_epi64x(0, UINT64_MAX); >>>>>> + >>>>>> rxdp = rxq->rx_ring + rxq->rxrearm_start; >>>>>> >>>>>> /* Pull 'n' more MBUFs into the software ring */ @@ -108,6 +110,9 @@ >>>>>> ixgbe_rxq_rearm(struct ixgbe_rx_queue *rxq) >>>>>> dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room); >>>>>> dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room); >>>>>> >>>>>> + dma_addr0 = _mm_and_si128(dma_addr0, hba_msk); >>>>>> + dma_addr1 = _mm_and_si128(dma_addr1, hba_msk); >>>>>> + >>>>>> /* flush desc with pa dma_addr */ >>>>>> _mm_store_si128((__m128i *)&rxdp++->read, dma_addr0); >>>>>> _mm_store_si128((__m128i *)&rxdp++->read, dma_addr1); >>> >