DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev]  [PATCH v6 0/3]: Add LRO support to ixgbe PMD
@ 2015-03-09 19:07 Vlad Zolotarov
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 1/3] ixgbe: Cleanups Vlad Zolotarov
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-09 19:07 UTC (permalink / raw)
  To: dev

This series adds the missing flow for enabling the LRO in the ethdev and
adds a support for this feature in the ixgbe PMD. There is a big hope that this
initiative is going to be picked up by some Intel developer that would add the LRO support
to other Intel PMDs. 
 
The series starts with some cleanup work in the code the final patch (the actual adding of
the LRO support) is going to touch/use/change. There are still quite a few issues in the ixgbe
PMD code left but they have to be a matter of a different series and I've left a few "TODO"
remarks in the code.
 
The LRO ("RSC" in Intel's context) PMD completion handling code follows the same design as the
corresponding Linux and FreeBSD implementation: pass the aggregation's cluster HEAD buffer to
the NEXTP entry of the software ring till EOP is met.
 
HW configuration follows the corresponding specs: this feature is supported only by x540 and
82599 PF devices.
 
The feature has been tested with seastar TCP stack with the following configuration on Tx side:
   - MTU: 400B
   - 100 concurrent TCP connections.
 
The results were:
   - Without LRO: total throughput: 0.12Gbps, coefficient of variance: 1.41%
   - With LRO:    total throughput: 8.21Gbps, coefficient of variance: 0.59%
 
This is an almost factor 80 improvement.

New in v6:
   - Fix of the typo in the "bug fixes" series that broke the compilation caused a
     minor change in this follow-up series.  

New in v5:
   - Split the series into "bug fixes" and "all the rest" so that the former could be
     integrated into a 2.0 release.
   - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
   - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.

New in v4:
   - Remove CONFIG_RTE_ETHDEV_LRO_SUPPORT from config/common_linuxapp.
   - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h.
   - As a result of "ixgbe: check rxd number to avoid mbuf leak" (352078e8e) Vector Rx
     had to get the same treatment as Rx Bulk Alloc (see PATCH4 for more details).
     
New in v3:
   - ixgbe_rx_alloc_bufs(): Always reset refcnt of the buffers to 1. Otherwise rte_pktmbuf_free()
     won't free them.

New in v2:
   - Removed rte_eth_dev_data.lro_bulk_alloc and added ixgbe_hw.rx_bulk_alloc_allowed
     instead.
   - Unified the rx_pkt_bulk callback setting (a separate new patch).
   - Fixed a few styling and spelling issues.


Vlad Zolotarov (3):
  ixgbe: Cleanups
  ixgbe: Code refactoring
  ixgbe: Add LRO support

 lib/librte_ether/rte_ethdev.h       |   9 +-
 lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
 lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 705 ++++++++++++++++++++++++++++++++----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
 5 files changed, 663 insertions(+), 68 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [dpdk-dev]  [PATCH v6 1/3] ixgbe: Cleanups
  2015-03-09 19:07 [dpdk-dev] [PATCH v6 0/3]: Add LRO support to ixgbe PMD Vlad Zolotarov
@ 2015-03-09 19:07 ` Vlad Zolotarov
  2015-03-09 20:15   ` Ananyev, Konstantin
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 2/3] ixgbe: Code refactoring Vlad Zolotarov
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support Vlad Zolotarov
  2 siblings, 1 reply; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-09 19:07 UTC (permalink / raw)
  To: dev

   - Removed the not needed casting.
   - ixgbe_dev_rx_init(): shorten the lines by defining a local alias variable to access
                          &dev->data->dev_conf.rxmode.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
---
New in v6:
   - Fixed of a compilation error caused by a patches recomposition during series separation.
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 99c4bde..e015981 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -1032,8 +1032,7 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 	int diag, i;
 
 	/* allocate buffers in bulk directly into the S/W ring */
-	alloc_idx = (uint16_t)(rxq->rx_free_trigger -
-				(rxq->rx_free_thresh - 1));
+	alloc_idx = rxq->rx_free_trigger - (rxq->rx_free_thresh - 1);
 	rxep = &rxq->sw_ring[alloc_idx];
 	diag = rte_mempool_get_bulk(rxq->mb_pool, (void *)rxep,
 				    rxq->rx_free_thresh);
@@ -1061,10 +1060,9 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 	IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rxq->rx_free_trigger);
 
 	/* update state of internal queue structure */
-	rxq->rx_free_trigger = (uint16_t)(rxq->rx_free_trigger +
-						rxq->rx_free_thresh);
+	rxq->rx_free_trigger = rxq->rx_free_trigger + rxq->rx_free_thresh;
 	if (rxq->rx_free_trigger >= rxq->nb_rx_desc)
-		rxq->rx_free_trigger = (uint16_t)(rxq->rx_free_thresh - 1);
+		rxq->rx_free_trigger = rxq->rx_free_thresh - 1;
 
 	/* no errors */
 	return 0;
@@ -3564,6 +3562,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 	uint32_t rxcsum;
 	uint16_t buf_size;
 	uint16_t i;
+	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
 
 	PMD_INIT_FUNC_TRACE();
 	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
@@ -3586,7 +3585,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 	 * Configure CRC stripping, if any.
 	 */
 	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
-	if (dev->data->dev_conf.rxmode.hw_strip_crc)
+	if (rx_conf->hw_strip_crc)
 		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
 	else
 		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
@@ -3594,11 +3593,11 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 	/*
 	 * Configure jumbo frame support, if any.
 	 */
-	if (dev->data->dev_conf.rxmode.jumbo_frame == 1) {
+	if (rx_conf->jumbo_frame == 1) {
 		hlreg0 |= IXGBE_HLREG0_JUMBOEN;
 		maxfrs = IXGBE_READ_REG(hw, IXGBE_MAXFRS);
 		maxfrs &= 0x0000FFFF;
-		maxfrs |= (dev->data->dev_conf.rxmode.max_rx_pkt_len << 16);
+		maxfrs |= (rx_conf->max_rx_pkt_len << 16);
 		IXGBE_WRITE_REG(hw, IXGBE_MAXFRS, maxfrs);
 	} else
 		hlreg0 &= ~IXGBE_HLREG0_JUMBOEN;
@@ -3622,9 +3621,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 		 * Reset crc_len in case it was changed after queue setup by a
 		 * call to configure.
 		 */
-		rxq->crc_len = (uint8_t)
-				((dev->data->dev_conf.rxmode.hw_strip_crc) ? 0 :
-				ETHER_CRC_LEN);
+		rxq->crc_len = rx_conf->hw_strip_crc ? 0 : ETHER_CRC_LEN;
 
 		/* Setup the Base and Length of the Rx Descriptor Rings */
 		bus_addr = rxq->rx_ring_phys_addr;
@@ -3642,7 +3639,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 		/*
 		 * Configure Header Split
 		 */
-		if (dev->data->dev_conf.rxmode.header_split) {
+		if (rx_conf->header_split) {
 			if (hw->mac.type == ixgbe_mac_82599EB) {
 				/* Must setup the PSRTYPE register */
 				uint32_t psrtype;
@@ -3652,7 +3649,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 					IXGBE_PSRTYPE_IPV6HDR;
 				IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx), psrtype);
 			}
-			srrctl = ((dev->data->dev_conf.rxmode.split_hdr_size <<
+			srrctl = ((rx_conf->split_hdr_size <<
 				IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
 				IXGBE_SRRCTL_BSIZEHDR_MASK);
 			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
@@ -3686,7 +3683,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 			dev->data->scattered_rx = 1;
 	}
 
-	if (dev->data->dev_conf.rxmode.enable_scatter)
+	if (rx_conf->enable_scatter)
 		dev->data->scattered_rx = 1;
 
 	set_rx_function(dev);
@@ -3703,7 +3700,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 	 */
 	rxcsum = IXGBE_READ_REG(hw, IXGBE_RXCSUM);
 	rxcsum |= IXGBE_RXCSUM_PCSD;
-	if (dev->data->dev_conf.rxmode.hw_ip_checksum)
+	if (rx_conf->hw_ip_checksum)
 		rxcsum |= IXGBE_RXCSUM_IPPCSE;
 	else
 		rxcsum &= ~IXGBE_RXCSUM_IPPCSE;
@@ -3713,7 +3710,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 	if (hw->mac.type == ixgbe_mac_82599EB ||
 	    hw->mac.type == ixgbe_mac_X540) {
 		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
-		if (dev->data->dev_conf.rxmode.hw_strip_crc)
+		if (rx_conf->hw_strip_crc)
 			rdrxctl |= IXGBE_RDRXCTL_CRCSTRIP;
 		else
 			rdrxctl &= ~IXGBE_RDRXCTL_CRCSTRIP;
-- 
2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [dpdk-dev]  [PATCH v6 2/3] ixgbe: Code refactoring
  2015-03-09 19:07 [dpdk-dev] [PATCH v6 0/3]: Add LRO support to ixgbe PMD Vlad Zolotarov
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 1/3] ixgbe: Cleanups Vlad Zolotarov
@ 2015-03-09 19:07 ` Vlad Zolotarov
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support Vlad Zolotarov
  2 siblings, 0 replies; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-09 19:07 UTC (permalink / raw)
  To: dev

    - ixgbe_rx_alloc_bufs():
       - Reset the rte_mbuf fields only when requested.
       - Take the RDT update out of the function.
       - Add the stub when RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC is not defined.
    - ixgbe_recv_scattered_pkts():
       - Take the code that updates the fields of the cluster's HEAD buffer into
         the inline function.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
---
New in v3:
   - ixgbe_rx_alloc_bufs(): Always reset refcnt of the buffers to 1.
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 114 +++++++++++++++++++++++---------------
 1 file changed, 69 insertions(+), 45 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index e015981..58e619b 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -1022,7 +1022,7 @@ ixgbe_rx_scan_hw_ring(struct igb_rx_queue *rxq)
 }
 
 static inline int
-ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
+ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq, bool reset_mbuf)
 {
 	volatile union ixgbe_adv_rx_desc *rxdp;
 	struct igb_rx_entry *rxep;
@@ -1043,11 +1043,14 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 	for (i = 0; i < rxq->rx_free_thresh; ++i) {
 		/* populate the static rte mbuf fields */
 		mb = rxep[i].mbuf;
+		if (reset_mbuf) {
+			mb->next = NULL;
+			mb->nb_segs = 1;
+			mb->port = rxq->port_id;
+		}
+
 		rte_mbuf_refcnt_set(mb, 1);
-		mb->next = NULL;
 		mb->data_off = RTE_PKTMBUF_HEADROOM;
-		mb->nb_segs = 1;
-		mb->port = rxq->port_id;
 
 		/* populate the descriptors */
 		dma_addr = rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb));
@@ -1055,10 +1058,6 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 		rxdp[i].read.pkt_addr = dma_addr;
 	}
 
-	/* update tail pointer */
-	rte_wmb();
-	IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rxq->rx_free_trigger);
-
 	/* update state of internal queue structure */
 	rxq->rx_free_trigger = rxq->rx_free_trigger + rxq->rx_free_thresh;
 	if (rxq->rx_free_trigger >= rxq->nb_rx_desc)
@@ -1110,7 +1109,9 @@ rx_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 	/* if required, allocate new buffers to replenish descriptors */
 	if (rxq->rx_tail > rxq->rx_free_trigger) {
-		if (ixgbe_rx_alloc_bufs(rxq) != 0) {
+		uint16_t cur_free_trigger = rxq->rx_free_trigger;
+
+		if (ixgbe_rx_alloc_bufs(rxq, true) != 0) {
 			int i, j;
 			PMD_RX_LOG(DEBUG, "RX mbuf alloc failed port_id=%u "
 				   "queue_id=%u", (unsigned) rxq->port_id,
@@ -1130,6 +1131,10 @@ rx_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 			return 0;
 		}
+
+		/* update tail pointer */
+		rte_wmb();
+		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, cur_free_trigger);
 	}
 
 	if (rxq->rx_tail >= rxq->nb_rx_desc)
@@ -1169,6 +1174,13 @@ ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 	return nb_rx;
 }
+#else
+static inline int
+ixgbe_rx_alloc_bufs(__rte_unused struct igb_rx_queue *rxq,
+		    __rte_unused bool reset_mbuf)
+{
+	return -ENOMEM;
+}
 #endif /* RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC */
 
 uint16_t
@@ -1353,6 +1365,51 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	return (nb_rx);
 }
 
+/**
+ * Initialize the first mbuf of the returned packet:
+ *    - RX port identifier,
+ *    - hardware offload data, if any:
+ *      - RSS flag & hash,
+ *      - IP checksum flag,
+ *      - VLAN TCI, if any,
+ *      - error flags.
+ * @head HEAD of the packet cluster
+ * @desc HW descriptor to get data from
+ * @port_id Port ID of the Rx queue
+ */
+static inline void ixgbe_fill_cluster_head_buf(
+	struct rte_mbuf *head,
+	union ixgbe_adv_rx_desc *desc,
+	uint8_t port_id,
+	uint32_t staterr)
+{
+	uint32_t hlen_type_rss;
+	uint64_t pkt_flags;
+
+	head->port = port_id;
+
+	/*
+	 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
+	 * set in the pkt_flags field.
+	 */
+	head->vlan_tci = rte_le_to_cpu_16(desc->wb.upper.vlan);
+	hlen_type_rss = rte_le_to_cpu_32(desc->wb.lower.lo_dword.data);
+	pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
+	pkt_flags |= rx_desc_status_to_pkt_flags(staterr);
+	pkt_flags |= rx_desc_error_to_pkt_flags(staterr);
+	head->ol_flags = pkt_flags;
+
+	if (likely(pkt_flags & PKT_RX_RSS_HASH))
+		head->hash.rss = rte_le_to_cpu_32(desc->wb.lower.hi_dword.rss);
+	else if (pkt_flags & PKT_RX_FDIR) {
+		head->hash.fdir.hash =
+			rte_le_to_cpu_16(desc->wb.lower.hi_dword.csum_ip.csum)
+							  & IXGBE_ATR_HASH_MASK;
+		head->hash.fdir.id =
+			rte_le_to_cpu_16(desc->wb.lower.hi_dword.csum_ip.ip_id);
+	}
+}
+
 uint16_t
 ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 			  uint16_t nb_pkts)
@@ -1369,12 +1426,10 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	union ixgbe_adv_rx_desc rxd;
 	uint64_t dma; /* Physical address of mbuf data buffer */
 	uint32_t staterr;
-	uint32_t hlen_type_rss;
 	uint16_t rx_id;
 	uint16_t nb_rx;
 	uint16_t nb_hold;
 	uint16_t data_len;
-	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -1532,40 +1587,9 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 					(uint16_t) (data_len - ETHER_CRC_LEN);
 		}
 
-		/*
-		 * Initialize the first mbuf of the returned packet:
-		 *    - RX port identifier,
-		 *    - hardware offload data, if any:
-		 *      - RSS flag & hash,
-		 *      - IP checksum flag,
-		 *      - VLAN TCI, if any,
-		 *      - error flags.
-		 */
-		first_seg->port = rxq->port_id;
-
-		/*
-		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
-		 * set in the pkt_flags field.
-		 */
-		first_seg->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
-		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
-		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
-		first_seg->ol_flags = pkt_flags;
-
-		if (likely(pkt_flags & PKT_RX_RSS_HASH))
-			first_seg->hash.rss =
-				    rte_le_to_cpu_32(rxd.wb.lower.hi_dword.rss);
-		else if (pkt_flags & PKT_RX_FDIR) {
-			first_seg->hash.fdir.hash =
-			    rte_le_to_cpu_16(rxd.wb.lower.hi_dword.csum_ip.csum)
-					   & IXGBE_ATR_HASH_MASK;
-			first_seg->hash.fdir.id =
-			  rte_le_to_cpu_16(rxd.wb.lower.hi_dword.csum_ip.ip_id);
-		}
+		/* Initialize the first mbuf of the returned packet */
+		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
+					    staterr);
 
 		/* Prefetch data of first segment, if configured to do so. */
 		rte_packet_prefetch((char *)first_seg->buf_addr +
-- 
2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [dpdk-dev]  [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-09 19:07 [dpdk-dev] [PATCH v6 0/3]: Add LRO support to ixgbe PMD Vlad Zolotarov
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 1/3] ixgbe: Cleanups Vlad Zolotarov
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 2/3] ixgbe: Code refactoring Vlad Zolotarov
@ 2015-03-09 19:07 ` Vlad Zolotarov
  2015-03-10  0:30   ` Ananyev, Konstantin
  2015-03-16 18:26   ` Vlad Zolotarov
  2 siblings, 2 replies; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-09 19:07 UTC (permalink / raw)
  To: dev

    - Only x540 and 82599 devices support LRO.
    - Add the appropriate HW configuration.
    - Add RSC aware rx_pkt_burst() handlers:
       - Implemented bulk allocation and non-bulk allocation versions.
       - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
         and to igb_rx_queue.
       - Use the appropriate handler when LRO is requested.

Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
---
New in v5:
   - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
   - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.

New in v4:
   - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
     RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.

New in v2:
   - Removed rte_eth_dev_data.lro_bulk_alloc.
   - Fixed a few styling and spelling issues.
---
 lib/librte_ether/rte_ethdev.h       |   9 +-
 lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
 lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
 5 files changed, 581 insertions(+), 7 deletions(-)

diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 8db3127..44f081f 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -172,6 +172,9 @@ extern "C" {
 
 #include <stdint.h>
 
+/* Use this macro to check if LRO API is supported */
+#define RTE_ETHDEV_HAS_LRO_SUPPORT
+
 #include <rte_log.h>
 #include <rte_interrupts.h>
 #include <rte_pci.h>
@@ -320,14 +323,15 @@ struct rte_eth_rxmode {
 	enum rte_eth_rx_mq_mode mq_mode;
 	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
 	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
-	uint8_t header_split : 1, /**< Header Split enable. */
+	uint16_t header_split : 1, /**< Header Split enable. */
 		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
 		hw_vlan_filter   : 1, /**< VLAN filter enable. */
 		hw_vlan_strip    : 1, /**< VLAN strip enable. */
 		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
 		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
 		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
-		enable_scatter   : 1; /**< Enable scatter packets rx handler */
+		enable_scatter   : 1, /**< Enable scatter packets rx handler */
+		enable_lro       : 1; /**< Enable LRO */
 };
 
 /**
@@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
 	uint8_t port_id;           /**< Device [external] port identifier. */
 	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
 		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
+		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
 		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
 		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
 };
diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
index 9d3de1a..765174d 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
@@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
 
 	/* Clear stored conf */
 	dev->data->scattered_rx = 0;
+	dev->data->lro = 0;
 	hw->rx_bulk_alloc_allowed = false;
 	hw->rx_vec_allowed = false;
 
@@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
 		DEV_RX_OFFLOAD_IPV4_CKSUM |
 		DEV_RX_OFFLOAD_UDP_CKSUM  |
 		DEV_RX_OFFLOAD_TCP_CKSUM;
+
+	if (hw->mac.type == ixgbe_mac_82599EB ||
+	    hw->mac.type == ixgbe_mac_X540)
+		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
+
 	dev_info->tx_offload_capa =
 		DEV_TX_OFFLOAD_VLAN_INSERT |
 		DEV_TX_OFFLOAD_IPV4_CKSUM  |
diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
index a549f5c..e206584 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
@@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
 uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
 		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 
+uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
+		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
+uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
+		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
+
 uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		uint16_t nb_pkts);
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 58e619b..944c662 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 }
 
 /**
+ * Detect an RSC descriptor.
+ */
+static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
+{
+	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
+		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
+}
+
+/**
  * Initialize the first mbuf of the returned packet:
  *    - RX port identifier,
  *    - hardware offload data, if any:
@@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
 	}
 }
 
+/**
+ * Bulk receive handler for and LRO case.
+ *
+ * @rx_queue Rx queue handle
+ * @rx_pkts table of received packets
+ * @nb_pkts size of rx_pkts table
+ * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
+ *
+ * Handles the Rx HW ring completions when RSC feature is configured. Uses an
+ * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
+ *
+ * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
+ * 1) When non-EOP RSC completion arrives:
+ *    a) Update the HEAD of the current RSC aggregation cluster with the new
+ *       segment's data length.
+ *    b) Set the "next" pointer of the current segment to point to the segment
+ *       at the NEXTP index.
+ *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
+ *       in the sw_rsc_ring.
+ * 2) When EOP arrives we just update the cluster's total length and offload
+ *    flags and deliver the cluster up to the upper layers. In our case - put it
+ *    in the rx_pkts table.
+ *
+ * Returns the number of received packets/clusters (according to the "bulk
+ * receive" interface).
+ */
+static inline uint16_t
+_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
+	       bool bulk_alloc)
+{
+	struct igb_rx_queue *rxq = rx_queue;
+	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
+	struct igb_rx_entry *sw_ring = rxq->sw_ring;
+	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
+	uint16_t rx_id = rxq->rx_tail;
+	uint16_t nb_rx = 0;
+	uint16_t nb_hold = rxq->nb_rx_hold;
+	uint16_t prev_id = rxq->rx_tail;
+
+	while (nb_rx < nb_pkts) {
+		bool eop;
+		struct igb_rx_entry *rxe;
+		struct igb_rsc_entry *rsc_entry;
+		struct igb_rsc_entry *next_rsc_entry;
+		struct igb_rx_entry *next_rxe;
+		struct rte_mbuf *first_seg;
+		struct rte_mbuf *rxm;
+		struct rte_mbuf *nmb;
+		union ixgbe_adv_rx_desc rxd;
+		uint16_t data_len;
+		uint16_t next_id;
+		volatile union ixgbe_adv_rx_desc *rxdp;
+		uint32_t staterr;
+
+next_desc:
+		/*
+		 * The code in this whole file uses the volatile pointer to
+		 * ensure the read ordering of the status and the rest of the
+		 * descriptor fields (on the compiler level only!!!). This is so
+		 * UGLY - why not to just use the compiler barrier instead? DPDK
+		 * even has the rte_compiler_barrier() for that.
+		 *
+		 * But most importantly this is just wrong because this doesn't
+		 * ensure memory ordering in a general case at all. For
+		 * instance, DPDK is supposed to work on Power CPUs where
+		 * compiler barrier may just not be enough!
+		 *
+		 * I tried to write only this function properly to have a
+		 * starting point (as a part of an LRO/RSC series) but the
+		 * compiler cursed at me when I tried to cast away the
+		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
+		 * keeping it the way it is for now.
+		 *
+		 * The code in this file is broken in so many other places and
+		 * will just not work on a big endian CPU anyway therefore the
+		 * lines below will have to be revisited together with the rest
+		 * of the ixgbe PMD.
+		 *
+		 * TODO:
+		 *    - Get rid of "volatile" crap and let the compiler do its
+		 *      job.
+		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
+		 *      memory ordering below.
+		 */
+		rxdp = &rx_ring[rx_id];
+		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
+
+		if (!(staterr & IXGBE_RXDADV_STAT_DD))
+			break;
+
+		rxd = *rxdp;
+
+		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
+				  "staterr=0x%x data_len=%u",
+			   rxq->port_id, rxq->queue_id, rx_id, staterr,
+			   rte_le_to_cpu_16(rxd.wb.upper.length));
+
+		if (!bulk_alloc) {
+			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
+			if (nmb == NULL) {
+				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
+						  "port_id=%u queue_id=%u",
+					   rxq->port_id, rxq->queue_id);
+
+				rte_eth_devices[rxq->port_id].data->
+							rx_mbuf_alloc_failed++;
+				break;
+			}
+		} else if (nb_hold > rxq->rx_free_thresh) {
+			uint16_t next_rdt = rxq->rx_free_trigger;
+
+			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
+				rte_wmb();
+				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
+						    next_rdt);
+				nb_hold -= rxq->rx_free_thresh;
+			} else {
+				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
+						  "port_id=%u queue_id=%u",
+					   rxq->port_id, rxq->queue_id);
+
+				rte_eth_devices[rxq->port_id].data->
+							rx_mbuf_alloc_failed++;
+				break;
+			}
+		}
+
+		nb_hold++;
+		rxe = &sw_ring[rx_id];
+		eop = staterr & IXGBE_RXDADV_STAT_EOP;
+
+		next_id = rx_id + 1;
+		if (next_id == rxq->nb_rx_desc)
+			next_id = 0;
+
+		/* Prefetch next mbuf while processing current one. */
+		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
+
+		/*
+		 * When next RX descriptor is on a cache-line boundary,
+		 * prefetch the next 4 RX descriptors and the next 4 pointers
+		 * to mbufs.
+		 */
+		if ((next_id & 0x3) == 0) {
+			rte_ixgbe_prefetch(&rx_ring[next_id]);
+			rte_ixgbe_prefetch(&sw_ring[next_id]);
+		}
+
+		rxm = rxe->mbuf;
+
+		if (!bulk_alloc) {
+			__le64 dma =
+			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
+			/*
+			 * Update RX descriptor with the physical address of the
+			 * new data buffer of the new allocated mbuf.
+			 */
+			rxe->mbuf = nmb;
+
+			rxm->data_off = RTE_PKTMBUF_HEADROOM;
+			rxdp->read.hdr_addr = dma;
+			rxdp->read.pkt_addr = dma;
+		}
+		/*
+		 * Set data length & data buffer address of mbuf.
+		 */
+		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
+		rxm->data_len = data_len;
+
+		if (!eop) {
+			uint16_t nextp_id;
+			/*
+			 * Get next descriptor index:
+			 *  - For RSC it's in the NEXTP field.
+			 *  - For a scattered packet - it's just a following
+			 *    descriptor.
+			 */
+			if (ixgbe_rsc_count(&rxd))
+				nextp_id =
+					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
+						       IXGBE_RXDADV_NEXTP_SHIFT;
+			else
+				nextp_id = next_id;
+
+			next_rsc_entry = &sw_rsc_ring[nextp_id];
+			next_rxe = &sw_ring[nextp_id];
+			rte_ixgbe_prefetch(next_rxe);
+		}
+
+		rsc_entry = &sw_rsc_ring[rx_id];
+		first_seg = rsc_entry->fbuf;
+		rsc_entry->fbuf = NULL;
+
+		/*
+		 * If this is the first buffer of the received packet,
+		 * set the pointer to the first mbuf of the packet and
+		 * initialize its context.
+		 * Otherwise, update the total length and the number of segments
+		 * of the current scattered packet, and update the pointer to
+		 * the last mbuf of the current packet.
+		 */
+		if (first_seg == NULL) {
+			first_seg = rxm;
+			first_seg->pkt_len = data_len;
+			first_seg->nb_segs = 1;
+		} else {
+			first_seg->pkt_len += data_len;
+			first_seg->nb_segs++;
+		}
+
+		prev_id = rx_id;
+		rx_id = next_id;
+
+		/*
+		 * If this is not the last buffer of the received packet, update
+		 * the pointer to the first mbuf at the NEXTP entry in the
+		 * sw_rsc_ring and continue to parse the RX ring.
+		 */
+		if (!eop) {
+			rxm->next = next_rxe->mbuf;
+			next_rsc_entry->fbuf = first_seg;
+			goto next_desc;
+		}
+
+		/*
+		 * This is the last buffer of the received packet - return
+		 * the current cluster to the user.
+		 */
+		rxm->next = NULL;
+
+		/* Initialize the first mbuf of the returned packet */
+		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
+					    staterr);
+
+		/* Prefetch data of first segment, if configured to do so. */
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
+
+		/*
+		 * Store the mbuf address into the next entry of the array
+		 * of returned packets.
+		 */
+		rx_pkts[nb_rx++] = first_seg;
+	}
+
+	/*
+	 * Record index of the next RX descriptor to probe.
+	 */
+	rxq->rx_tail = rx_id;
+
+	/*
+	 * If the number of free RX descriptors is greater than the RX free
+	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
+	 * register.
+	 * Update the RDT with the value of the last processed RX descriptor
+	 * minus 1, to guarantee that the RDT register is never equal to the
+	 * RDH register, which creates a "full" ring situtation from the
+	 * hardware point of view...
+	 */
+	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
+		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
+			   "nb_hold=%u nb_rx=%u",
+			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
+
+		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
+		nb_hold = 0;
+	}
+
+	rxq->nb_rx_hold = nb_hold;
+	return nb_rx;
+}
+
+uint16_t
+ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
+{
+	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
+}
+
+uint16_t
+ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
+			       uint16_t nb_pkts)
+{
+	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
+}
+
 uint16_t
 ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 			  uint16_t nb_pkts)
@@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
 	if (rxq != NULL) {
 		ixgbe_rx_queue_release_mbufs(rxq);
 		rte_free(rxq->sw_ring);
+		rte_free(rxq->sw_rsc_ring);
 		rte_free(rxq);
 	}
 }
@@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
 	rxq->nb_rx_hold = 0;
 	rxq->pkt_first_seg = NULL;
 	rxq->pkt_last_seg = NULL;
+	rxq->rsc_en = 0;
 }
 
 int
@@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	struct igb_rx_queue *rxq;
 	struct ixgbe_hw     *hw;
 	uint16_t len;
+	struct rte_eth_dev_info dev_info = { 0 };
+	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
+	bool rsc_requested = false;
+
+	dev->dev_ops->dev_infos_get(dev, &dev_info);
+	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
+	    dev_rx_mode->enable_lro)
+		rsc_requested = true;
 
 	PMD_INIT_FUNC_TRACE();
 	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
@@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
 	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
 					  sizeof(struct igb_rx_entry) * len,
 					  RTE_CACHE_LINE_SIZE, socket_id);
-	if (rxq->sw_ring == NULL) {
+	if (!rxq->sw_ring) {
 		ixgbe_rx_queue_release(rxq);
 		return (-ENOMEM);
 	}
-	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
-		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
+
+	if (rsc_requested) {
+		rxq->sw_rsc_ring =
+			rte_zmalloc_socket("rxq->sw_rsc_ring",
+					   sizeof(struct igb_rsc_entry) * len,
+					   RTE_CACHE_LINE_SIZE, socket_id);
+		if (!rxq->sw_rsc_ring) {
+			ixgbe_rx_queue_release(rxq);
+			return (-ENOMEM);
+		}
+	} else {
+		rxq->sw_rsc_ring = NULL;
+	}
+
+	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
+			    "dma_addr=0x%"PRIx64,
+		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
+		     rxq->rx_ring_phys_addr);
 
 	if (!rte_is_power_of_2(nb_desc)) {
 		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
@@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
 	return 0;
 }
 
+/**
+ * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
+ *
+ * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
+ * spec rev. 3.0 chapter 8.2.3.8.13.
+ *
+ * @pool Memory pool of the Rx queue
+ */
+static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
+{
+	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
+
+	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
+	uint16_t maxdesc =
+		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
+
+	if (maxdesc >= 16)
+		return IXGBE_RSCCTL_MAXDESC_16;
+	else if (maxdesc >= 8)
+		return IXGBE_RSCCTL_MAXDESC_8;
+	else if (maxdesc >= 4)
+		return IXGBE_RSCCTL_MAXDESC_4;
+	else
+		return IXGBE_RSCCTL_MAXDESC_1;
+}
+
+/* (Taken from FreeBSD tree)
+** Setup the correct IVAR register for a particular MSIX interrupt
+**   (yes this is all very magic and confusing :)
+**  - entry is the register array entry
+**  - vector is the MSIX vector for this queue
+**  - type is RX/TX/MISC
+*/
+static void
+ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
+{
+	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
+	u32 ivar, index;
+
+	vector |= IXGBE_IVAR_ALLOC_VAL;
+
+	switch (hw->mac.type) {
+
+	case ixgbe_mac_82598EB:
+		if (type == -1)
+			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
+		else
+			entry += (type * 64);
+		index = (entry >> 2) & 0x1F;
+		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
+		ivar &= ~(0xFF << (8 * (entry & 0x3)));
+		ivar |= (vector << (8 * (entry & 0x3)));
+		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
+		break;
+
+	case ixgbe_mac_82599EB:
+	case ixgbe_mac_X540:
+		if (type == -1) { /* MISC IVAR */
+			index = (entry & 1) * 8;
+			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
+			ivar &= ~(0xFF << index);
+			ivar |= (vector << index);
+			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
+		} else {	/* RX/TX IVARS */
+			index = (16 * (entry & 1)) + (8 * type);
+			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
+			ivar &= ~(0xFF << index);
+			ivar |= (vector << index);
+			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
+		}
+
+		break;
+
+	default:
+		break;
+	}
+}
+
 void set_rx_function(struct rte_eth_dev *dev)
 {
 	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
@@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
 			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
 		}
 	}
+
+	/*
+	 * Initialize the appropriate LRO callback.
+	 *
+	 * If all queues satisfy the bulk allocation preconditions
+	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
+	 * Otherwise use a single allocation version.
+	 */
+	if (dev->data->lro) {
+		if (hw->rx_bulk_alloc_allowed) {
+			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
+					   "allocation version");
+			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
+		} else {
+			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
+					   "allocation version");
+			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
+		}
+	}
 }
 
 /*
@@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 	uint32_t maxfrs;
 	uint32_t srrctl;
 	uint32_t rdrxctl;
+	uint32_t rscctl;
+	uint32_t psrtype;
+	uint32_t rfctl;
 	uint32_t rxcsum;
 	uint16_t buf_size;
 	uint16_t i;
 	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
+	struct rte_eth_dev_info dev_info = { 0 };
+	bool rsc_capable = false;
+
+	/* Sanity check */
+	dev->dev_ops->dev_infos_get(dev, &dev_info);
+	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
+		rsc_capable = true;
+
+	if (!rsc_capable && rx_conf->enable_lro) {
+		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
+				   "support it");
+		return -EINVAL;
+	}
 
 	PMD_INIT_FUNC_TRACE();
 	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
@@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
 
 	/*
+	 * RFCTL configuration
+	 *
+	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
+	 * and RFCTL.NFSR_DIS when RSC is enabled.
+	 */
+	if (rsc_capable) {
+		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
+		if (rx_conf->enable_lro) {
+			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
+				   IXGBE_RFCTL_NFSR_DIS);
+		} else {
+			rfctl |= IXGBE_RFCTL_RSC_DIS;
+		}
+
+		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
+	}
+
+
+	/*
 	 * Configure CRC stripping, if any.
 	 */
 	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
 	if (rx_conf->hw_strip_crc)
 		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
-	else
+	else {
 		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
+		if (rx_conf->enable_lro) {
+			/*
+			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
+			 * 3.0 RSC configuration requires HW CRC stripping being
+			 * enabled. If user requested both HW CRC stripping off
+			 * and RSC on - return an error.
+			 */
+			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
+					    "is disabled");
+			return -EINVAL;
+		}
+	}
 
 	/*
 	 * Configure jumbo frame support, if any.
@@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 		 * Configure Header Split
 		 */
 		if (rx_conf->header_split) {
+			/*
+			 * Print a warning if split_hdr_size is less
+			 * than 128 bytes when RSC is requested.
+			 */
+			if (rx_conf->enable_lro &&
+			    rx_conf->split_hdr_size < 128)
+				PMD_INIT_LOG(INFO, "split_hdr_size less than "
+						   "128 bytes (%d)!",
+					     rx_conf->split_hdr_size);
+
 			if (hw->mac.type == ixgbe_mac_82599EB) {
 				/* Must setup the PSRTYPE register */
-				uint32_t psrtype;
 				psrtype = IXGBE_PSRTYPE_TCPHDR |
 					IXGBE_PSRTYPE_UDPHDR   |
 					IXGBE_PSRTYPE_IPV4HDR  |
@@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
 		} else
 #endif
+		{
 			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
+			/*
+			 * Following the 4.6.7.2.1 chapter of the 82599/x540
+			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
+			 * should be configured even if header split is not
+			 * enabled. In the later case we will configure it 128
+			 * bytes following the recommendation in the spec.
+			 */
+			if (rx_conf->enable_lro)
+				srrctl |=
+				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
+						    IXGBE_SRRCTL_BSIZEHDR_MASK);
+		}
 
 		/* Set if packets are dropped when no descriptors available */
 		if (rxq->drop_en)
@@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 				       RTE_PKTMBUF_HEADROOM);
 		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
 			   IXGBE_SRRCTL_BSIZEPKT_MASK);
+
+		/*
+		 * TODO: Consider setting the Receive Descriptor Minimum
+		 * Threshold Size for and RSC case. This is not an obviously
+		 * beneficiary option but the one worth considering...
+		 */
+
 		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
 
 		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
@@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
 					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
 			dev->data->scattered_rx = 1;
+
+		/* RSC per-queue configuration */
+		if (rx_conf->enable_lro) {
+			uint32_t eitr;
+
+			rscctl =
+				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
+			psrtype =
+				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
+			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
+
+			rscctl |= IXGBE_RSCCTL_RSCEN;
+			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
+			psrtype |= IXGBE_PSRTYPE_TCPHDR;
+
+			/*
+			 * RSC: Set ITR interval corresponding to 2K ints/s.
+			 *
+			 * Full-sized RSC aggregations for a 10Gb/s link will
+			 * arrive at about 20K aggregation/s rate.
+			 *
+			 * 2K inst/s rate will make only 10% of the
+			 * aggregations to be closed due to the interrupt timer
+			 * expiration for a streaming at wire-speed case.
+			 *
+			 * For a sparse streaming case this setting will yield
+			 * at most 500us latency for a single RSC aggregation.
+			 */
+			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
+
+			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
+			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
+								       psrtype);
+			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
+
+			/*
+			 * RSC requires the mapping of the queue to the
+			 * interrupt vector.
+			 */
+			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
+
+			rxq->rsc_en = 1;
+		}
 	}
 
 	if (rx_conf->enable_scatter)
 		dev->data->scattered_rx = 1;
 
+	if (rx_conf->enable_lro)
+		dev->data->lro = 1;
+
 	set_rx_function(dev);
 
 	/*
@@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
 	}
 
+	/* Finalize RSC configuration  */
+	if (rx_conf->enable_lro) {
+		/*
+		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
+		 */
+		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
+		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
+		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
+
+		PMD_INIT_LOG(INFO, "enabling LRO mode");
+	}
+
+
 	return 0;
 }
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index bbe5ff3..389173f 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -79,6 +79,10 @@ struct igb_rx_entry {
 	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
 };
 
+struct igb_rsc_entry {
+	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
+};
+
 /**
  * Structure associated with each descriptor of the TX ring of a TX queue.
  */
@@ -105,6 +109,7 @@ struct igb_rx_queue {
 	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
 	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
 	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
+	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
 	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
 	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
 	uint64_t            mbuf_initializer; /**< value to init mbufs */
@@ -126,6 +131,7 @@ struct igb_rx_queue {
 	uint8_t             port_id;  /**< Device port identifier. */
 	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
 	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
+	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
 	uint8_t             rx_deferred_start; /**< not in global dev start. */
 #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
 	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */
-- 
2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] ixgbe: Cleanups
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 1/3] ixgbe: Cleanups Vlad Zolotarov
@ 2015-03-09 20:15   ` Ananyev, Konstantin
  0 siblings, 0 replies; 18+ messages in thread
From: Ananyev, Konstantin @ 2015-03-09 20:15 UTC (permalink / raw)
  To: Vlad Zolotarov, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
> Sent: Monday, March 09, 2015 7:07 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v6 1/3] ixgbe: Cleanups
> 
>    - Removed the not needed casting.
>    - ixgbe_dev_rx_init(): shorten the lines by defining a local alias variable to access
>                           &dev->data->dev_conf.rxmode.
> 
> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> ---
> New in v6:
>    - Fixed of a compilation error caused by a patches recomposition during series separation.
> ---
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 29 +++++++++++++----------------
>  1 file changed, 13 insertions(+), 16 deletions(-)
> 
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> index 99c4bde..e015981 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> @@ -1032,8 +1032,7 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
>  	int diag, i;
> 
>  	/* allocate buffers in bulk directly into the S/W ring */
> -	alloc_idx = (uint16_t)(rxq->rx_free_trigger -
> -				(rxq->rx_free_thresh - 1));
> +	alloc_idx = rxq->rx_free_trigger - (rxq->rx_free_thresh - 1);
>  	rxep = &rxq->sw_ring[alloc_idx];
>  	diag = rte_mempool_get_bulk(rxq->mb_pool, (void *)rxep,
>  				    rxq->rx_free_thresh);
> @@ -1061,10 +1060,9 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
>  	IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rxq->rx_free_trigger);
> 
>  	/* update state of internal queue structure */
> -	rxq->rx_free_trigger = (uint16_t)(rxq->rx_free_trigger +
> -						rxq->rx_free_thresh);
> +	rxq->rx_free_trigger = rxq->rx_free_trigger + rxq->rx_free_thresh;
>  	if (rxq->rx_free_trigger >= rxq->nb_rx_desc)
> -		rxq->rx_free_trigger = (uint16_t)(rxq->rx_free_thresh - 1);
> +		rxq->rx_free_trigger = rxq->rx_free_thresh - 1;
> 
>  	/* no errors */
>  	return 0;
> @@ -3564,6 +3562,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  	uint32_t rxcsum;
>  	uint16_t buf_size;
>  	uint16_t i;
> +	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
> 
>  	PMD_INIT_FUNC_TRACE();
>  	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> @@ -3586,7 +3585,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  	 * Configure CRC stripping, if any.
>  	 */
>  	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
> -	if (dev->data->dev_conf.rxmode.hw_strip_crc)
> +	if (rx_conf->hw_strip_crc)
>  		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
>  	else
>  		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
> @@ -3594,11 +3593,11 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  	/*
>  	 * Configure jumbo frame support, if any.
>  	 */
> -	if (dev->data->dev_conf.rxmode.jumbo_frame == 1) {
> +	if (rx_conf->jumbo_frame == 1) {
>  		hlreg0 |= IXGBE_HLREG0_JUMBOEN;
>  		maxfrs = IXGBE_READ_REG(hw, IXGBE_MAXFRS);
>  		maxfrs &= 0x0000FFFF;
> -		maxfrs |= (dev->data->dev_conf.rxmode.max_rx_pkt_len << 16);
> +		maxfrs |= (rx_conf->max_rx_pkt_len << 16);
>  		IXGBE_WRITE_REG(hw, IXGBE_MAXFRS, maxfrs);
>  	} else
>  		hlreg0 &= ~IXGBE_HLREG0_JUMBOEN;
> @@ -3622,9 +3621,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  		 * Reset crc_len in case it was changed after queue setup by a
>  		 * call to configure.
>  		 */
> -		rxq->crc_len = (uint8_t)
> -				((dev->data->dev_conf.rxmode.hw_strip_crc) ? 0 :
> -				ETHER_CRC_LEN);
> +		rxq->crc_len = rx_conf->hw_strip_crc ? 0 : ETHER_CRC_LEN;
> 
>  		/* Setup the Base and Length of the Rx Descriptor Rings */
>  		bus_addr = rxq->rx_ring_phys_addr;
> @@ -3642,7 +3639,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  		/*
>  		 * Configure Header Split
>  		 */
> -		if (dev->data->dev_conf.rxmode.header_split) {
> +		if (rx_conf->header_split) {
>  			if (hw->mac.type == ixgbe_mac_82599EB) {
>  				/* Must setup the PSRTYPE register */
>  				uint32_t psrtype;
> @@ -3652,7 +3649,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  					IXGBE_PSRTYPE_IPV6HDR;
>  				IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx), psrtype);
>  			}
> -			srrctl = ((dev->data->dev_conf.rxmode.split_hdr_size <<
> +			srrctl = ((rx_conf->split_hdr_size <<
>  				IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
>  				IXGBE_SRRCTL_BSIZEHDR_MASK);
>  			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
> @@ -3686,7 +3683,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  			dev->data->scattered_rx = 1;
>  	}
> 
> -	if (dev->data->dev_conf.rxmode.enable_scatter)
> +	if (rx_conf->enable_scatter)
>  		dev->data->scattered_rx = 1;
> 
>  	set_rx_function(dev);
> @@ -3703,7 +3700,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  	 */
>  	rxcsum = IXGBE_READ_REG(hw, IXGBE_RXCSUM);
>  	rxcsum |= IXGBE_RXCSUM_PCSD;
> -	if (dev->data->dev_conf.rxmode.hw_ip_checksum)
> +	if (rx_conf->hw_ip_checksum)
>  		rxcsum |= IXGBE_RXCSUM_IPPCSE;
>  	else
>  		rxcsum &= ~IXGBE_RXCSUM_IPPCSE;
> @@ -3713,7 +3710,7 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  	if (hw->mac.type == ixgbe_mac_82599EB ||
>  	    hw->mac.type == ixgbe_mac_X540) {
>  		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
> -		if (dev->data->dev_conf.rxmode.hw_strip_crc)
> +		if (rx_conf->hw_strip_crc)
>  			rdrxctl |= IXGBE_RDRXCTL_CRCSTRIP;
>  		else
>  			rdrxctl &= ~IXGBE_RDRXCTL_CRCSTRIP;
> --
> 2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support Vlad Zolotarov
@ 2015-03-10  0:30   ` Ananyev, Konstantin
  2015-03-10 13:22     ` Vlad Zolotarov
  2015-03-10 17:51     ` Vlad Zolotarov
  2015-03-16 18:26   ` Vlad Zolotarov
  1 sibling, 2 replies; 18+ messages in thread
From: Ananyev, Konstantin @ 2015-03-10  0:30 UTC (permalink / raw)
  To: Vlad Zolotarov, dev

Hi Vlad,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
> Sent: Monday, March 09, 2015 7:07 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
> 
>     - Only x540 and 82599 devices support LRO.
>     - Add the appropriate HW configuration.
>     - Add RSC aware rx_pkt_burst() handlers:
>        - Implemented bulk allocation and non-bulk allocation versions.
>        - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
>          and to igb_rx_queue.
>        - Use the appropriate handler when LRO is requested.
> 
> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
> ---
> New in v5:
>    - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
>    - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
> 
> New in v4:
>    - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
>      RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
> 
> New in v2:
>    - Removed rte_eth_dev_data.lro_bulk_alloc.
>    - Fixed a few styling and spelling issues.
> ---
>  lib/librte_ether/rte_ethdev.h       |   9 +-
>  lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
>  lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
>  5 files changed, 581 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> index 8db3127..44f081f 100644
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -172,6 +172,9 @@ extern "C" {
> 
>  #include <stdint.h>
> 
> +/* Use this macro to check if LRO API is supported */
> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
> +
>  #include <rte_log.h>
>  #include <rte_interrupts.h>
>  #include <rte_pci.h>
> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
>  	enum rte_eth_rx_mq_mode mq_mode;
>  	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
>  	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
> -	uint8_t header_split : 1, /**< Header Split enable. */
> +	uint16_t header_split : 1, /**< Header Split enable. */
>  		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
>  		hw_vlan_filter   : 1, /**< VLAN filter enable. */
>  		hw_vlan_strip    : 1, /**< VLAN strip enable. */
>  		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
>  		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
>  		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
> +		enable_lro       : 1; /**< Enable LRO */
>  };
> 
>  /**
> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
>  	uint8_t port_id;           /**< Device [external] port identifier. */
>  	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
>  		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
>  		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
>  		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
>  };
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> index 9d3de1a..765174d 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
> 
>  	/* Clear stored conf */
>  	dev->data->scattered_rx = 0;
> +	dev->data->lro = 0;
>  	hw->rx_bulk_alloc_allowed = false;
>  	hw->rx_vec_allowed = false;
> 
> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>  		DEV_RX_OFFLOAD_IPV4_CKSUM |
>  		DEV_RX_OFFLOAD_UDP_CKSUM  |
>  		DEV_RX_OFFLOAD_TCP_CKSUM;
> +
> +	if (hw->mac.type == ixgbe_mac_82599EB ||
> +	    hw->mac.type == ixgbe_mac_X540)
> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
> +
>  	dev_info->tx_offload_capa =
>  		DEV_TX_OFFLOAD_VLAN_INSERT |
>  		DEV_TX_OFFLOAD_IPV4_CKSUM  |
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> index a549f5c..e206584 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>  uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
>  		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> 
> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> +
>  uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>  		uint16_t nb_pkts);
> 
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> index 58e619b..944c662 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  }
> 
>  /**
> + * Detect an RSC descriptor.
> + */
> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
> +{
> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
> +}
> +
> +/**
>   * Initialize the first mbuf of the returned packet:
>   *    - RX port identifier,
>   *    - hardware offload data, if any:
> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
>  	}
>  }
> 
> +/**
> + * Bulk receive handler for and LRO case.
> + *
> + * @rx_queue Rx queue handle
> + * @rx_pkts table of received packets
> + * @nb_pkts size of rx_pkts table
> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
> + *
> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
> + *
> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
> + * 1) When non-EOP RSC completion arrives:
> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
> + *       segment's data length.
> + *    b) Set the "next" pointer of the current segment to point to the segment
> + *       at the NEXTP index.
> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
> + *       in the sw_rsc_ring.
> + * 2) When EOP arrives we just update the cluster's total length and offload
> + *    flags and deliver the cluster up to the upper layers. In our case - put it
> + *    in the rx_pkts table.
> + *
> + * Returns the number of received packets/clusters (according to the "bulk
> + * receive" interface).
> + */
> +static inline uint16_t
> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
> +	       bool bulk_alloc)
> +{
> +	struct igb_rx_queue *rxq = rx_queue;
> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
> +	uint16_t rx_id = rxq->rx_tail;
> +	uint16_t nb_rx = 0;
> +	uint16_t nb_hold = rxq->nb_rx_hold;
> +	uint16_t prev_id = rxq->rx_tail;
> +
> +	while (nb_rx < nb_pkts) {
> +		bool eop;
> +		struct igb_rx_entry *rxe;
> +		struct igb_rsc_entry *rsc_entry;
> +		struct igb_rsc_entry *next_rsc_entry;
> +		struct igb_rx_entry *next_rxe;
> +		struct rte_mbuf *first_seg;
> +		struct rte_mbuf *rxm;
> +		struct rte_mbuf *nmb;
> +		union ixgbe_adv_rx_desc rxd;
> +		uint16_t data_len;
> +		uint16_t next_id;
> +		volatile union ixgbe_adv_rx_desc *rxdp;
> +		uint32_t staterr;
> +
> +next_desc:
> +		/*
> +		 * The code in this whole file uses the volatile pointer to
> +		 * ensure the read ordering of the status and the rest of the
> +		 * descriptor fields (on the compiler level only!!!). This is so
> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
> +		 * even has the rte_compiler_barrier() for that.
> +		 *
> +		 * But most importantly this is just wrong because this doesn't
> +		 * ensure memory ordering in a general case at all. For
> +		 * instance, DPDK is supposed to work on Power CPUs where
> +		 * compiler barrier may just not be enough!
> +		 *
> +		 * I tried to write only this function properly to have a
> +		 * starting point (as a part of an LRO/RSC series) but the
> +		 * compiler cursed at me when I tried to cast away the
> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
> +		 * keeping it the way it is for now.
> +		 *
> +		 * The code in this file is broken in so many other places and
> +		 * will just not work on a big endian CPU anyway therefore the
> +		 * lines below will have to be revisited together with the rest
> +		 * of the ixgbe PMD.
> +		 *
> +		 * TODO:
> +		 *    - Get rid of "volatile" crap and let the compiler do its
> +		 *      job.
> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
> +		 *      memory ordering below.

Ok, so you wanted to put rte_rmb(), straight after:
staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
correct?
I agree that for machines with relaxed memory model (PPC) we do need it here.
So why not just put it there, instead of complaining about in in comments? ;)  

About rxdp being pointer to volatile, why it bothers you that much?
You copy the whole RXD to the local variable anyway, and then reference it only to setup new addresses.

> +		 */
> +		rxdp = &rx_ring[rx_id];
> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> +
> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
> +			break;
> +
> +		rxd = *rxdp;
> +
> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> +				  "staterr=0x%x data_len=%u",
> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
> +
> +		if (!bulk_alloc) {
> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
> +			if (nmb == NULL) {
> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
> +						  "port_id=%u queue_id=%u",
> +					   rxq->port_id, rxq->queue_id);
> +
> +				rte_eth_devices[rxq->port_id].data->
> +							rx_mbuf_alloc_failed++;
> +				break;
> +			}
> +		} else if (nb_hold > rxq->rx_free_thresh) {
> +			uint16_t next_rdt = rxq->rx_free_trigger;
> +
> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
> +				rte_wmb();
> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
> +						    next_rdt);
> +				nb_hold -= rxq->rx_free_thresh;
> +			} else {
> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
> +						  "port_id=%u queue_id=%u",
> +					   rxq->port_id, rxq->queue_id);
> +
> +				rte_eth_devices[rxq->port_id].data->
> +							rx_mbuf_alloc_failed++;
> +				break;
> +			}
> +		}
> +
> +		nb_hold++;
> +		rxe = &sw_ring[rx_id];
> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
> +
> +		next_id = rx_id + 1;
> +		if (next_id == rxq->nb_rx_desc)
> +			next_id = 0;
> +
> +		/* Prefetch next mbuf while processing current one. */
> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
> +
> +		/*
> +		 * When next RX descriptor is on a cache-line boundary,
> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
> +		 * to mbufs.
> +		 */
> +		if ((next_id & 0x3) == 0) {
> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
> +		}
> +
> +		rxm = rxe->mbuf;
> +
> +		if (!bulk_alloc) {
> +			__le64 dma =
> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
> +			/*
> +			 * Update RX descriptor with the physical address of the
> +			 * new data buffer of the new allocated mbuf.
> +			 */
> +			rxe->mbuf = nmb;
> +
> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
> +			rxdp->read.hdr_addr = dma;
> +			rxdp->read.pkt_addr = dma;
> +		}
> +		/*
> +		 * Set data length & data buffer address of mbuf.
> +		 */
> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
> +		rxm->data_len = data_len;
> +
> +		if (!eop) {
> +			uint16_t nextp_id;
> +			/*
> +			 * Get next descriptor index:
> +			 *  - For RSC it's in the NEXTP field.
> +			 *  - For a scattered packet - it's just a following
> +			 *    descriptor.
> +			 */
> +			if (ixgbe_rsc_count(&rxd))
> +				nextp_id =
> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
> +						       IXGBE_RXDADV_NEXTP_SHIFT;
> +			else
> +				nextp_id = next_id;
> +
> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
> +			next_rxe = &sw_ring[nextp_id];
> +			rte_ixgbe_prefetch(next_rxe);
> +		}
> +
> +		rsc_entry = &sw_rsc_ring[rx_id];
> +		first_seg = rsc_entry->fbuf;
> +		rsc_entry->fbuf = NULL;
> +
> +		/*
> +		 * If this is the first buffer of the received packet,
> +		 * set the pointer to the first mbuf of the packet and
> +		 * initialize its context.
> +		 * Otherwise, update the total length and the number of segments
> +		 * of the current scattered packet, and update the pointer to
> +		 * the last mbuf of the current packet.
> +		 */
> +		if (first_seg == NULL) {
> +			first_seg = rxm;
> +			first_seg->pkt_len = data_len;
> +			first_seg->nb_segs = 1;
> +		} else {
> +			first_seg->pkt_len += data_len;
> +			first_seg->nb_segs++;
> +		}
> +
> +		prev_id = rx_id;
> +		rx_id = next_id;
> +
> +		/*
> +		 * If this is not the last buffer of the received packet, update
> +		 * the pointer to the first mbuf at the NEXTP entry in the
> +		 * sw_rsc_ring and continue to parse the RX ring.
> +		 */
> +		if (!eop) {
> +			rxm->next = next_rxe->mbuf;
> +			next_rsc_entry->fbuf = first_seg;
> +			goto next_desc;

So _recv_pkts_lro() can return with one of rxq->rsc_entry[i] != NULL, correct?
If so, then I think you need at ixgbe_rx_queue_release_mbufs() to add the code, that would go through
all rsc_entry[] to find one whose fbuf  is != NULL, call rte_pktmbuf_free() for it and reset to NULL. 
 To handle the case:
recv_pkts_lro(rxq, ...);
rte_eth_dev_stop();
rte_eth_dev_start();
recv_pkts_lro(rxq, ...);
BTW, that also means that you can't do: 
rxm->next = next_rxe->mbuf;
above, and
rxm->next = NULL;    
should be done before 'goto next_desc;' too

> +		}
> +
> +		/*
> +		 * This is the last buffer of the received packet - return
> +		 * the current cluster to the user.
> +		 */
> +		rxm->next = NULL;
> +
> +		/* Initialize the first mbuf of the returned packet */
> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
> +					    staterr);
> +
> +		/* Prefetch data of first segment, if configured to do so. */
> +		rte_packet_prefetch((char *)first_seg->buf_addr +
> +			first_seg->data_off);
> +
> +		/*
> +		 * Store the mbuf address into the next entry of the array
> +		 * of returned packets.
> +		 */
> +		rx_pkts[nb_rx++] = first_seg;
> +	}
> +
> +	/*
> +	 * Record index of the next RX descriptor to probe.
> +	 */
> +	rxq->rx_tail = rx_id;
> +
> +	/*
> +	 * If the number of free RX descriptors is greater than the RX free
> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
> +	 * register.
> +	 * Update the RDT with the value of the last processed RX descriptor
> +	 * minus 1, to guarantee that the RDT register is never equal to the
> +	 * RDH register, which creates a "full" ring situtation from the
> +	 * hardware point of view...
> +	 */
> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
> +			   "nb_hold=%u nb_rx=%u",
> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
> +

I suppose if you do wmb() after rte_rxmbuf_alloc(), you'd better do it here too.

> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
> +		nb_hold = 0;
> +	}
> +
> +	rxq->nb_rx_hold = nb_hold;
> +	return nb_rx;
> +}
> +
> +uint16_t
> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> +{
> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
> +}
> +
> +uint16_t
> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
> +			       uint16_t nb_pkts)
> +{
> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
> +}
> +
>  uint16_t
>  ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>  			  uint16_t nb_pkts)
> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
>  	if (rxq != NULL) {
>  		ixgbe_rx_queue_release_mbufs(rxq);
>  		rte_free(rxq->sw_ring);
> +		rte_free(rxq->sw_rsc_ring);
>  		rte_free(rxq);
>  	}
>  }
> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
>  	rxq->nb_rx_hold = 0;
>  	rxq->pkt_first_seg = NULL;
>  	rxq->pkt_last_seg = NULL;
> +	rxq->rsc_en = 0;
>  }
> 
>  int
> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>  	struct igb_rx_queue *rxq;
>  	struct ixgbe_hw     *hw;
>  	uint16_t len;
> +	struct rte_eth_dev_info dev_info = { 0 };
> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
> +	bool rsc_requested = false;
> +
> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
> +	    dev_rx_mode->enable_lro)
> +		rsc_requested = true;
> 
>  	PMD_INIT_FUNC_TRACE();
>  	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>  	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
>  					  sizeof(struct igb_rx_entry) * len,
>  					  RTE_CACHE_LINE_SIZE, socket_id);
> -	if (rxq->sw_ring == NULL) {
> +	if (!rxq->sw_ring) {

Wonder what was wrong with that one? :)

>  		ixgbe_rx_queue_release(rxq);
>  		return (-ENOMEM);
>  	}
> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
> +
> +	if (rsc_requested) {
> +		rxq->sw_rsc_ring =
> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
> +					   sizeof(struct igb_rsc_entry) * len,
> +					   RTE_CACHE_LINE_SIZE, socket_id);
> +		if (!rxq->sw_rsc_ring) {
> +			ixgbe_rx_queue_release(rxq);
> +			return (-ENOMEM);
> +		}
> +	} else {
> +		rxq->sw_rsc_ring = NULL;
> +	}
> +
> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
> +			    "dma_addr=0x%"PRIx64,
> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
> +		     rxq->rx_ring_phys_addr);
> 
>  	if (!rte_is_power_of_2(nb_desc)) {
>  		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
>  	return 0;
>  }
> 
> +/**
> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
> + *
> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
> + * spec rev. 3.0 chapter 8.2.3.8.13.
> + *
> + * @pool Memory pool of the Rx queue
> + */
> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
> +{
> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
> +
> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
> +	uint16_t maxdesc =
> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);

A  nit: use some macro (UINt16_MAX?) instead of hardcoded constant if possible.

> +
> +	if (maxdesc >= 16)
> +		return IXGBE_RSCCTL_MAXDESC_16;
> +	else if (maxdesc >= 8)
> +		return IXGBE_RSCCTL_MAXDESC_8;
> +	else if (maxdesc >= 4)
> +		return IXGBE_RSCCTL_MAXDESC_4;
> +	else
> +		return IXGBE_RSCCTL_MAXDESC_1;
> +}
> +
> +/* (Taken from FreeBSD tree)
> +** Setup the correct IVAR register for a particular MSIX interrupt
> +**   (yes this is all very magic and confusing :)
> +**  - entry is the register array entry
> +**  - vector is the MSIX vector for this queue
> +**  - type is RX/TX/MISC
> +*/
> +static void
> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
> +{
> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> +	u32 ivar, index;
> +
> +	vector |= IXGBE_IVAR_ALLOC_VAL;
> +
> +	switch (hw->mac.type) {
> +
> +	case ixgbe_mac_82598EB:
> +		if (type == -1)
> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
> +		else
> +			entry += (type * 64);
> +		index = (entry >> 2) & 0x1F;
> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
> +		ivar |= (vector << (8 * (entry & 0x3)));
> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
> +		break;
> +
> +	case ixgbe_mac_82599EB:
> +	case ixgbe_mac_X540:
> +		if (type == -1) { /* MISC IVAR */
> +			index = (entry & 1) * 8;
> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
> +			ivar &= ~(0xFF << index);
> +			ivar |= (vector << index);
> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
> +		} else {	/* RX/TX IVARS */
> +			index = (16 * (entry & 1)) + (8 * type);
> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
> +			ivar &= ~(0xFF << index);
> +			ivar |= (vector << index);
> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
> +		}
> +
> +		break;
> +
> +	default:
> +		break;
> +	}
> +}
> +
>  void set_rx_function(struct rte_eth_dev *dev)
>  {
>  	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
>  			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
>  		}
>  	}
> +
> +	/*
> +	 * Initialize the appropriate LRO callback.
> +	 *
> +	 * If all queues satisfy the bulk allocation preconditions
> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
> +	 * Otherwise use a single allocation version.
> +	 */
> +	if (dev->data->lro) {
> +		if (hw->rx_bulk_alloc_allowed) {
> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
> +					   "allocation version");
> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
> +		} else {
> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
> +					   "allocation version");
> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
> +		}
> +	}
>  }

As I understand, ixgbe_recv_pkts_lro() can handle both LRO and normal scattered packets?
If that so, then can we remove ixgbe_recv_scattered_pkts() at all and use ixgbe_recv_scattered_pkts() for both cases?

> 
>  /*
> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  	uint32_t maxfrs;
>  	uint32_t srrctl;
>  	uint32_t rdrxctl;
> +	uint32_t rscctl;
> +	uint32_t psrtype;
> +	uint32_t rfctl;
>  	uint32_t rxcsum;
>  	uint16_t buf_size;
>  	uint16_t i;
>  	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
> +	struct rte_eth_dev_info dev_info = { 0 };
> +	bool rsc_capable = false;
> +
> +	/* Sanity check */
> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
> +		rsc_capable = true;

@ 7.11.1 82599 spec says:
" Note that in SR-IOV mode the RSC must be disabled globally by setting the RFCTL.RSC_DIS bit."
Add a check?

> +
> +	if (!rsc_capable && rx_conf->enable_lro) {
> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
> +				   "support it");
> +		return -EINVAL;
> +	}
> 
>  	PMD_INIT_FUNC_TRACE();
>  	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
> 
>  	/*
> +	 * RFCTL configuration
> +	 *
> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
> +	 */
> +	if (rsc_capable) {
> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
> +		if (rx_conf->enable_lro) {
> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
> +				   IXGBE_RFCTL_NFSR_DIS);
> +		} else {
> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
> +		}
> +
> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
> +	}
> +
> +
> +	/*
>  	 * Configure CRC stripping, if any.
>  	 */
>  	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
>  	if (rx_conf->hw_strip_crc)
>  		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
> -	else
> +	else {
>  		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
> +		if (rx_conf->enable_lro) {
> +			/*
> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
> +			 * 3.0 RSC configuration requires HW CRC stripping being
> +			 * enabled. If user requested both HW CRC stripping off
> +			 * and RSC on - return an error.
> +			 */
> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
> +					    "is disabled");
> +			return -EINVAL;
> +		}
> +	}
> 
>  	/*
>  	 * Configure jumbo frame support, if any.
> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  		 * Configure Header Split
>  		 */
>  		if (rx_conf->header_split) {
> +			/*
> +			 * Print a warning if split_hdr_size is less
> +			 * than 128 bytes when RSC is requested.
> +			 */
> +			if (rx_conf->enable_lro &&
> +			    rx_conf->split_hdr_size < 128)
> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
> +						   "128 bytes (%d)!",
> +					     rx_conf->split_hdr_size);
> +
>  			if (hw->mac.type == ixgbe_mac_82599EB) {
>  				/* Must setup the PSRTYPE register */
> -				uint32_t psrtype;
>  				psrtype = IXGBE_PSRTYPE_TCPHDR |
>  					IXGBE_PSRTYPE_UDPHDR   |
>  					IXGBE_PSRTYPE_IPV4HDR  |
> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
>  		} else
>  #endif
> +		{
>  			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
> +			/*
> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
> +			 * should be configured even if header split is not
> +			 * enabled. In the later case we will configure it 128
> +			 * bytes following the recommendation in the spec.
> +			 */
> +			if (rx_conf->enable_lro)
> +				srrctl |=
> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
> +		}
> 
>  		/* Set if packets are dropped when no descriptors available */
>  		if (rxq->drop_en)
> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  				       RTE_PKTMBUF_HEADROOM);
>  		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
>  			   IXGBE_SRRCTL_BSIZEPKT_MASK);
> +
> +		/*
> +		 * TODO: Consider setting the Receive Descriptor Minimum
> +		 * Threshold Size for and RSC case. This is not an obviously
> +		 * beneficiary option but the one worth considering...
> +		 */
> +
>  		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
> 
>  		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
>  					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
>  			dev->data->scattered_rx = 1;
> +
> +		/* RSC per-queue configuration */
> +		if (rx_conf->enable_lro) {
> +			uint32_t eitr;
> +
> +			rscctl =
> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
> +			psrtype =
> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
> +
> +			rscctl |= IXGBE_RSCCTL_RSCEN;
> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
> +
> +			/*
> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
> +			 *
> +			 * Full-sized RSC aggregations for a 10Gb/s link will
> +			 * arrive at about 20K aggregation/s rate.
> +			 *
> +			 * 2K inst/s rate will make only 10% of the
> +			 * aggregations to be closed due to the interrupt timer
> +			 * expiration for a streaming at wire-speed case.
> +			 *
> +			 * For a sparse streaming case this setting will yield
> +			 * at most 500us latency for a single RSC aggregation.
> +			 */
> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);

Again probably create some macro for ITR Interval default value here.

> +
> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
> +								       psrtype);
> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
> +
> +			/*
> +			 * RSC requires the mapping of the queue to the
> +			 * interrupt vector.
> +			 */
> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);

Hm, wonder why do we need to setup IVAR for RSC?
Wouldn't just setting EITR be enough?

> +
> +			rxq->rsc_en = 1;
> +		}
>  	}
> 
>  	if (rx_conf->enable_scatter)
>  		dev->data->scattered_rx = 1;
> 
> +	if (rx_conf->enable_lro)
> +		dev->data->lro = 1;
> +
>  	set_rx_function(dev);
> 
>  	/*
> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>  		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>  	}
> 
> +	/* Finalize RSC configuration  */
> +	if (rx_conf->enable_lro) {
> +		/*
> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
> +		 */
> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
> +
> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
> +	}
> +
> +
>  	return 0;
>  }
> 
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> index bbe5ff3..389173f 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> @@ -79,6 +79,10 @@ struct igb_rx_entry {
>  	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
>  };
> 
> +struct igb_rsc_entry {
> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
> +};
> +
>  /**
>   * Structure associated with each descriptor of the TX ring of a TX queue.
>   */
> @@ -105,6 +109,7 @@ struct igb_rx_queue {
>  	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
>  	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
>  	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
>  	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
>  	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
>  	uint64_t            mbuf_initializer; /**< value to init mbufs */
> @@ -126,6 +131,7 @@ struct igb_rx_queue {
>  	uint8_t             port_id;  /**< Device port identifier. */
>  	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
>  	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
>  	uint8_t             rx_deferred_start; /**< not in global dev start. */
>  #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
>  	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */
> --
> 2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-10  0:30   ` Ananyev, Konstantin
@ 2015-03-10 13:22     ` Vlad Zolotarov
  2015-03-10 20:09       ` Ananyev, Konstantin
  2015-03-10 17:51     ` Vlad Zolotarov
  1 sibling, 1 reply; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-10 13:22 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev



On 03/10/15 02:30, Ananyev, Konstantin wrote:
> Hi Vlad,
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
>> Sent: Monday, March 09, 2015 7:07 PM
>> To: dev@dpdk.org
>> Subject: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
>>
>>      - Only x540 and 82599 devices support LRO.
>>      - Add the appropriate HW configuration.
>>      - Add RSC aware rx_pkt_burst() handlers:
>>         - Implemented bulk allocation and non-bulk allocation versions.
>>         - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
>>           and to igb_rx_queue.
>>         - Use the appropriate handler when LRO is requested.
>>
>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>> ---
>> New in v5:
>>     - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
>>     - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
>>
>> New in v4:
>>     - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
>>       RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
>>
>> New in v2:
>>     - Removed rte_eth_dev_data.lro_bulk_alloc.
>>     - Fixed a few styling and spelling issues.
>> ---
>>   lib/librte_ether/rte_ethdev.h       |   9 +-
>>   lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
>>   lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
>>   lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
>>   lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
>>   5 files changed, 581 insertions(+), 7 deletions(-)
>>
>> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
>> index 8db3127..44f081f 100644
>> --- a/lib/librte_ether/rte_ethdev.h
>> +++ b/lib/librte_ether/rte_ethdev.h
>> @@ -172,6 +172,9 @@ extern "C" {
>>
>>   #include <stdint.h>
>>
>> +/* Use this macro to check if LRO API is supported */
>> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
>> +
>>   #include <rte_log.h>
>>   #include <rte_interrupts.h>
>>   #include <rte_pci.h>
>> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
>>   	enum rte_eth_rx_mq_mode mq_mode;
>>   	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
>>   	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
>> -	uint8_t header_split : 1, /**< Header Split enable. */
>> +	uint16_t header_split : 1, /**< Header Split enable. */
>>   		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
>>   		hw_vlan_filter   : 1, /**< VLAN filter enable. */
>>   		hw_vlan_strip    : 1, /**< VLAN strip enable. */
>>   		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
>>   		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
>>   		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
>> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
>> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
>> +		enable_lro       : 1; /**< Enable LRO */
>>   };
>>
>>   /**
>> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
>>   	uint8_t port_id;           /**< Device [external] port identifier. */
>>   	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
>>   		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
>> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
>>   		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
>>   		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
>>   };
>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>> index 9d3de1a..765174d 100644
>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
>>
>>   	/* Clear stored conf */
>>   	dev->data->scattered_rx = 0;
>> +	dev->data->lro = 0;
>>   	hw->rx_bulk_alloc_allowed = false;
>>   	hw->rx_vec_allowed = false;
>>
>> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>>   		DEV_RX_OFFLOAD_IPV4_CKSUM |
>>   		DEV_RX_OFFLOAD_UDP_CKSUM  |
>>   		DEV_RX_OFFLOAD_TCP_CKSUM;
>> +
>> +	if (hw->mac.type == ixgbe_mac_82599EB ||
>> +	    hw->mac.type == ixgbe_mac_X540)
>> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
>> +
>>   	dev_info->tx_offload_capa =
>>   		DEV_TX_OFFLOAD_VLAN_INSERT |
>>   		DEV_TX_OFFLOAD_IPV4_CKSUM  |
>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>> index a549f5c..e206584 100644
>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>>   uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
>>   		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>
>> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>> +
>>   uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>>   		uint16_t nb_pkts);
>>
>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>> index 58e619b..944c662 100644
>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>   }
>>
>>   /**
>> + * Detect an RSC descriptor.
>> + */
>> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
>> +{
>> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
>> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
>> +}
>> +
>> +/**
>>    * Initialize the first mbuf of the returned packet:
>>    *    - RX port identifier,
>>    *    - hardware offload data, if any:
>> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
>>   	}
>>   }
>>
>> +/**
>> + * Bulk receive handler for and LRO case.
>> + *
>> + * @rx_queue Rx queue handle
>> + * @rx_pkts table of received packets
>> + * @nb_pkts size of rx_pkts table
>> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
>> + *
>> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
>> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
>> + *
>> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
>> + * 1) When non-EOP RSC completion arrives:
>> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
>> + *       segment's data length.
>> + *    b) Set the "next" pointer of the current segment to point to the segment
>> + *       at the NEXTP index.
>> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
>> + *       in the sw_rsc_ring.
>> + * 2) When EOP arrives we just update the cluster's total length and offload
>> + *    flags and deliver the cluster up to the upper layers. In our case - put it
>> + *    in the rx_pkts table.
>> + *
>> + * Returns the number of received packets/clusters (according to the "bulk
>> + * receive" interface).
>> + */
>> +static inline uint16_t
>> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
>> +	       bool bulk_alloc)
>> +{
>> +	struct igb_rx_queue *rxq = rx_queue;
>> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
>> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
>> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
>> +	uint16_t rx_id = rxq->rx_tail;
>> +	uint16_t nb_rx = 0;
>> +	uint16_t nb_hold = rxq->nb_rx_hold;
>> +	uint16_t prev_id = rxq->rx_tail;
>> +
>> +	while (nb_rx < nb_pkts) {
>> +		bool eop;
>> +		struct igb_rx_entry *rxe;
>> +		struct igb_rsc_entry *rsc_entry;
>> +		struct igb_rsc_entry *next_rsc_entry;
>> +		struct igb_rx_entry *next_rxe;
>> +		struct rte_mbuf *first_seg;
>> +		struct rte_mbuf *rxm;
>> +		struct rte_mbuf *nmb;
>> +		union ixgbe_adv_rx_desc rxd;
>> +		uint16_t data_len;
>> +		uint16_t next_id;
>> +		volatile union ixgbe_adv_rx_desc *rxdp;
>> +		uint32_t staterr;
>> +
>> +next_desc:
>> +		/*
>> +		 * The code in this whole file uses the volatile pointer to
>> +		 * ensure the read ordering of the status and the rest of the
>> +		 * descriptor fields (on the compiler level only!!!). This is so
>> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
>> +		 * even has the rte_compiler_barrier() for that.
>> +		 *
>> +		 * But most importantly this is just wrong because this doesn't
>> +		 * ensure memory ordering in a general case at all. For
>> +		 * instance, DPDK is supposed to work on Power CPUs where
>> +		 * compiler barrier may just not be enough!
>> +		 *
>> +		 * I tried to write only this function properly to have a
>> +		 * starting point (as a part of an LRO/RSC series) but the
>> +		 * compiler cursed at me when I tried to cast away the
>> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
>> +		 * keeping it the way it is for now.
>> +		 *
>> +		 * The code in this file is broken in so many other places and
>> +		 * will just not work on a big endian CPU anyway therefore the
>> +		 * lines below will have to be revisited together with the rest
>> +		 * of the ixgbe PMD.
>> +		 *
>> +		 * TODO:
>> +		 *    - Get rid of "volatile" crap and let the compiler do its
>> +		 *      job.
>> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
>> +		 *      memory ordering below.
> Ok, so you wanted to put rte_rmb(), straight after:
> staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> correct?
> I agree that for machines with relaxed memory model (PPC) we do need it here.
> So why not just put it there, instead of complaining about in in comments? ;)

Because it's not a proper fix and I don't like workarounds.

>
> About rxdp being pointer to volatile, why it bothers you that much?

Because using of "volatile" prevent the compiler from optimizing every 
code piece where the "volatile" variable is participating and that's a 
shame.
Read this 
https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt 
for a more detailed explanation.

> You copy the whole RXD to the local variable anyway, and then reference it only to setup new addresses.

The fact that we have to copy the whole descriptor while we may not need 
all the data from it at the end is one problem.
The proper solution in Rx ring context should go as follows:

 1. Remove the "volatile" qualifier from rx_ring (HW Rx descriptors ring).
 2. Remove "volatile" at all places where rx_ring is accessed.
 3. Adjust the code in (2):
     1. Remove the descriptor copy u've mentioned and access the
        descriptor data directly.
     2. Ensure the proper ordering by using the proper memory barriers,
        which are missing in the DPDK SDK at the moment (see a small
        discussion about this with Stephen and Avi on "[dpdk-dev] 
        [PATCH v1 5/5] ixgbe: Add LRO support" thread).


As it sounds this is going to be a VERY sensitive patchset.
That's why it should go separately from this patchwork (or from any 
other patchwork).

>
>> +		 */
>> +		rxdp = &rx_ring[rx_id];
>> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>> +
>> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>> +			break;
>> +
>> +		rxd = *rxdp;
>> +
>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>> +				  "staterr=0x%x data_len=%u",
>> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
>> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
>> +
>> +		if (!bulk_alloc) {
>> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
>> +			if (nmb == NULL) {
>> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
>> +						  "port_id=%u queue_id=%u",
>> +					   rxq->port_id, rxq->queue_id);
>> +
>> +				rte_eth_devices[rxq->port_id].data->
>> +							rx_mbuf_alloc_failed++;
>> +				break;
>> +			}
>> +		} else if (nb_hold > rxq->rx_free_thresh) {
>> +			uint16_t next_rdt = rxq->rx_free_trigger;
>> +
>> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
>> +				rte_wmb();
>> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
>> +						    next_rdt);
>> +				nb_hold -= rxq->rx_free_thresh;
>> +			} else {
>> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
>> +						  "port_id=%u queue_id=%u",
>> +					   rxq->port_id, rxq->queue_id);
>> +
>> +				rte_eth_devices[rxq->port_id].data->
>> +							rx_mbuf_alloc_failed++;
>> +				break;
>> +			}
>> +		}
>> +
>> +		nb_hold++;
>> +		rxe = &sw_ring[rx_id];
>> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
>> +
>> +		next_id = rx_id + 1;
>> +		if (next_id == rxq->nb_rx_desc)
>> +			next_id = 0;
>> +
>> +		/* Prefetch next mbuf while processing current one. */
>> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
>> +
>> +		/*
>> +		 * When next RX descriptor is on a cache-line boundary,
>> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
>> +		 * to mbufs.
>> +		 */
>> +		if ((next_id & 0x3) == 0) {
>> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
>> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
>> +		}
>> +
>> +		rxm = rxe->mbuf;
>> +
>> +		if (!bulk_alloc) {
>> +			__le64 dma =
>> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
>> +			/*
>> +			 * Update RX descriptor with the physical address of the
>> +			 * new data buffer of the new allocated mbuf.
>> +			 */
>> +			rxe->mbuf = nmb;
>> +
>> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
>> +			rxdp->read.hdr_addr = dma;
>> +			rxdp->read.pkt_addr = dma;
>> +		}
>> +		/*
>> +		 * Set data length & data buffer address of mbuf.
>> +		 */
>> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
>> +		rxm->data_len = data_len;
>> +
>> +		if (!eop) {
>> +			uint16_t nextp_id;
>> +			/*
>> +			 * Get next descriptor index:
>> +			 *  - For RSC it's in the NEXTP field.
>> +			 *  - For a scattered packet - it's just a following
>> +			 *    descriptor.
>> +			 */
>> +			if (ixgbe_rsc_count(&rxd))
>> +				nextp_id =
>> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
>> +						       IXGBE_RXDADV_NEXTP_SHIFT;
>> +			else
>> +				nextp_id = next_id;
>> +
>> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
>> +			next_rxe = &sw_ring[nextp_id];
>> +			rte_ixgbe_prefetch(next_rxe);
>> +		}
>> +
>> +		rsc_entry = &sw_rsc_ring[rx_id];
>> +		first_seg = rsc_entry->fbuf;
>> +		rsc_entry->fbuf = NULL;
>> +
>> +		/*
>> +		 * If this is the first buffer of the received packet,
>> +		 * set the pointer to the first mbuf of the packet and
>> +		 * initialize its context.
>> +		 * Otherwise, update the total length and the number of segments
>> +		 * of the current scattered packet, and update the pointer to
>> +		 * the last mbuf of the current packet.
>> +		 */
>> +		if (first_seg == NULL) {
>> +			first_seg = rxm;
>> +			first_seg->pkt_len = data_len;
>> +			first_seg->nb_segs = 1;
>> +		} else {
>> +			first_seg->pkt_len += data_len;
>> +			first_seg->nb_segs++;
>> +		}
>> +
>> +		prev_id = rx_id;
>> +		rx_id = next_id;
>> +
>> +		/*
>> +		 * If this is not the last buffer of the received packet, update
>> +		 * the pointer to the first mbuf at the NEXTP entry in the
>> +		 * sw_rsc_ring and continue to parse the RX ring.
>> +		 */
>> +		if (!eop) {
>> +			rxm->next = next_rxe->mbuf;
>> +			next_rsc_entry->fbuf = first_seg;
>> +			goto next_desc;
> So _recv_pkts_lro() can return with one of rxq->rsc_entry[i] != NULL, correct?
> If so, then I think you need at ixgbe_rx_queue_release_mbufs() to add the code, that would go through
> all rsc_entry[] to find one whose fbuf  is != NULL, call rte_pktmbuf_free() for it and reset to NULL.
>   To handle the case:
> recv_pkts_lro(rxq, ...);
> rte_eth_dev_stop();
> rte_eth_dev_start();
> recv_pkts_lro(rxq, ...);

Right. I've missed that part.

> BTW, that also means that you can't do:
> rxm->next = next_rxe->mbuf;
> above, and
> rxm->next = NULL;
> should be done before 'goto next_desc;' too

Your proposal will cost cycles in the fast path on account of saving 
cycles in the slow path: we'll have to add another pointer to the 
igb_rsc_entry to hold the last mbuf in the current cluster that we'll 
have to read and update for every new completed RSC descriptor.

The easier way would be to just reset the next-pointer of the last 
descriptor in the RSC cluster to NULL (according to nb_segs) before 
calling for rte_pktmbuf_free() in ixgbe_rx_queue_release_mbufs().

>
>> +		}
>> +
>> +		/*
>> +		 * This is the last buffer of the received packet - return
>> +		 * the current cluster to the user.
>> +		 */
>> +		rxm->next = NULL;
>> +
>> +		/* Initialize the first mbuf of the returned packet */
>> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
>> +					    staterr);
>> +
>> +		/* Prefetch data of first segment, if configured to do so. */
>> +		rte_packet_prefetch((char *)first_seg->buf_addr +
>> +			first_seg->data_off);
>> +
>> +		/*
>> +		 * Store the mbuf address into the next entry of the array
>> +		 * of returned packets.
>> +		 */
>> +		rx_pkts[nb_rx++] = first_seg;
>> +	}
>> +
>> +	/*
>> +	 * Record index of the next RX descriptor to probe.
>> +	 */
>> +	rxq->rx_tail = rx_id;
>> +
>> +	/*
>> +	 * If the number of free RX descriptors is greater than the RX free
>> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
>> +	 * register.
>> +	 * Update the RDT with the value of the last processed RX descriptor
>> +	 * minus 1, to guarantee that the RDT register is never equal to the
>> +	 * RDH register, which creates a "full" ring situtation from the
>> +	 * hardware point of view...
>> +	 */
>> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
>> +			   "nb_hold=%u nb_rx=%u",
>> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
>> +
> I suppose if you do wmb() after rte_rxmbuf_alloc(), you'd better do it here too.

Right! Missed that when copied this code from 
ixgbe_recv_scattered_pkts()... ;) Note that the barrier is missing there 
too...
These are the examples of the code that works on x86 only because of 
that "volatile" thing and will break once it's removed. On PPC it is 
broken even with "volatile".

>
>> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
>> +		nb_hold = 0;
>> +	}
>> +
>> +	rxq->nb_rx_hold = nb_hold;
>> +	return nb_rx;
>> +}
>> +
>> +uint16_t
>> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>> +{
>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
>> +}
>> +
>> +uint16_t
>> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>> +			       uint16_t nb_pkts)
>> +{
>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
>> +}
>> +
>>   uint16_t
>>   ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>   			  uint16_t nb_pkts)
>> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
>>   	if (rxq != NULL) {
>>   		ixgbe_rx_queue_release_mbufs(rxq);
>>   		rte_free(rxq->sw_ring);
>> +		rte_free(rxq->sw_rsc_ring);
>>   		rte_free(rxq);
>>   	}
>>   }
>> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
>>   	rxq->nb_rx_hold = 0;
>>   	rxq->pkt_first_seg = NULL;
>>   	rxq->pkt_last_seg = NULL;
>> +	rxq->rsc_en = 0;
>>   }
>>
>>   int
>> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>   	struct igb_rx_queue *rxq;
>>   	struct ixgbe_hw     *hw;
>>   	uint16_t len;
>> +	struct rte_eth_dev_info dev_info = { 0 };
>> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
>> +	bool rsc_requested = false;
>> +
>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
>> +	    dev_rx_mode->enable_lro)
>> +		rsc_requested = true;
>>
>>   	PMD_INIT_FUNC_TRACE();
>>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>   	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
>>   					  sizeof(struct igb_rx_entry) * len,
>>   					  RTE_CACHE_LINE_SIZE, socket_id);
>> -	if (rxq->sw_ring == NULL) {
>> +	if (!rxq->sw_ring) {
> Wonder what was wrong with that one? :)

Nothing - just aligned it with the lines I've added below. ;)

>
>>   		ixgbe_rx_queue_release(rxq);
>>   		return (-ENOMEM);
>>   	}
>> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
>> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
>> +
>> +	if (rsc_requested) {
>> +		rxq->sw_rsc_ring =
>> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
>> +					   sizeof(struct igb_rsc_entry) * len,
>> +					   RTE_CACHE_LINE_SIZE, socket_id);
>> +		if (!rxq->sw_rsc_ring) {
>> +			ixgbe_rx_queue_release(rxq);
>> +			return (-ENOMEM);
>> +		}
>> +	} else {
>> +		rxq->sw_rsc_ring = NULL;
>> +	}
>> +
>> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
>> +			    "dma_addr=0x%"PRIx64,
>> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
>> +		     rxq->rx_ring_phys_addr);
>>
>>   	if (!rte_is_power_of_2(nb_desc)) {
>>   		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
>> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
>>   	return 0;
>>   }
>>
>> +/**
>> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
>> + *
>> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
>> + * spec rev. 3.0 chapter 8.2.3.8.13.
>> + *
>> + * @pool Memory pool of the Rx queue
>> + */
>> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
>> +{
>> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
>> +
>> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
>> +	uint16_t maxdesc =
>> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
> A  nit: use some macro (UINt16_MAX?) instead of hardcoded constant if possible.

Using UINT16_MAX here would be very confusing. The value here just like 
values below (16, 8, 4) are values that are explicitly stated in the
RSCCTL[n].MAXDESC description in the spec and this code piece is 
implementing what spec is demanding. Therefore IMHO using the
explicit values from the spec here is the most readable way considering 
the reader that will try to compare this code to the spec section 
mentioned above and check that the code is correct.


>
>> +
>> +	if (maxdesc >= 16)
>> +		return IXGBE_RSCCTL_MAXDESC_16;
>> +	else if (maxdesc >= 8)
>> +		return IXGBE_RSCCTL_MAXDESC_8;
>> +	else if (maxdesc >= 4)
>> +		return IXGBE_RSCCTL_MAXDESC_4;
>> +	else
>> +		return IXGBE_RSCCTL_MAXDESC_1;
>> +}
>> +
>> +/* (Taken from FreeBSD tree)
>> +** Setup the correct IVAR register for a particular MSIX interrupt
>> +**   (yes this is all very magic and confusing :)
>> +**  - entry is the register array entry
>> +**  - vector is the MSIX vector for this queue
>> +**  - type is RX/TX/MISC
>> +*/
>> +static void
>> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
>> +{
>> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>> +	u32 ivar, index;
>> +
>> +	vector |= IXGBE_IVAR_ALLOC_VAL;
>> +
>> +	switch (hw->mac.type) {
>> +
>> +	case ixgbe_mac_82598EB:
>> +		if (type == -1)
>> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
>> +		else
>> +			entry += (type * 64);
>> +		index = (entry >> 2) & 0x1F;
>> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
>> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
>> +		ivar |= (vector << (8 * (entry & 0x3)));
>> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
>> +		break;
>> +
>> +	case ixgbe_mac_82599EB:
>> +	case ixgbe_mac_X540:
>> +		if (type == -1) { /* MISC IVAR */
>> +			index = (entry & 1) * 8;
>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
>> +			ivar &= ~(0xFF << index);
>> +			ivar |= (vector << index);
>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
>> +		} else {	/* RX/TX IVARS */
>> +			index = (16 * (entry & 1)) + (8 * type);
>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
>> +			ivar &= ~(0xFF << index);
>> +			ivar |= (vector << index);
>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
>> +		}
>> +
>> +		break;
>> +
>> +	default:
>> +		break;
>> +	}
>> +}
>> +
>>   void set_rx_function(struct rte_eth_dev *dev)
>>   {
>>   	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
>>   			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
>>   		}
>>   	}
>> +
>> +	/*
>> +	 * Initialize the appropriate LRO callback.
>> +	 *
>> +	 * If all queues satisfy the bulk allocation preconditions
>> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
>> +	 * Otherwise use a single allocation version.
>> +	 */
>> +	if (dev->data->lro) {
>> +		if (hw->rx_bulk_alloc_allowed) {
>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
>> +					   "allocation version");
>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
>> +		} else {
>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
>> +					   "allocation version");
>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
>> +		}
>> +	}
>>   }
> As I understand, ixgbe_recv_pkts_lro() can handle both LRO and normal scattered packets?

Not as it is now. It may be easily patched to do so though.

> If that so, then can we remove ixgbe_recv_scattered_pkts() at all and use ixgbe_recv_scattered_pkts() for both cases?

This was explicitly requested from me by Bruce Richardson (see 
"[dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx 
flow?" thread) to separate the complicated handling from the simple high 
performance one. The handling in the RSC routine is more generic and 
thus is a bit of overkill for the simple scattered case: e.g. there is 
no need to a sw_rsc_ring.
Therefore I preferred to advance with small steps here. And if there 
will be a decision to join these flows - it may be done with a rather 
small patch in the future.

>>   /*
>> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   	uint32_t maxfrs;
>>   	uint32_t srrctl;
>>   	uint32_t rdrxctl;
>> +	uint32_t rscctl;
>> +	uint32_t psrtype;
>> +	uint32_t rfctl;
>>   	uint32_t rxcsum;
>>   	uint16_t buf_size;
>>   	uint16_t i;
>>   	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
>> +	struct rte_eth_dev_info dev_info = { 0 };
>> +	bool rsc_capable = false;
>> +
>> +	/* Sanity check */
>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
>> +		rsc_capable = true;
> @ 7.11.1 82599 spec says:
> " Note that in SR-IOV mode the RSC must be disabled globally by setting the RFCTL.RSC_DIS bit."
> Add a check?

Good catch! Will add a check. Thanks.

>
>> +
>> +	if (!rsc_capable && rx_conf->enable_lro) {
>> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
>> +				   "support it");
>> +		return -EINVAL;
>> +	}
>>
>>   	PMD_INIT_FUNC_TRACE();
>>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
>>
>>   	/*
>> +	 * RFCTL configuration
>> +	 *
>> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
>> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
>> +	 */
>> +	if (rsc_capable) {
>> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
>> +		if (rx_conf->enable_lro) {
>> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
>> +				   IXGBE_RFCTL_NFSR_DIS);
>> +		} else {
>> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
>> +		}
>> +
>> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
>> +	}
>> +
>> +
>> +	/*
>>   	 * Configure CRC stripping, if any.
>>   	 */
>>   	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
>>   	if (rx_conf->hw_strip_crc)
>>   		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
>> -	else
>> +	else {
>>   		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
>> +		if (rx_conf->enable_lro) {
>> +			/*
>> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
>> +			 * 3.0 RSC configuration requires HW CRC stripping being
>> +			 * enabled. If user requested both HW CRC stripping off
>> +			 * and RSC on - return an error.
>> +			 */
>> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
>> +					    "is disabled");
>> +			return -EINVAL;
>> +		}
>> +	}
>>
>>   	/*
>>   	 * Configure jumbo frame support, if any.
>> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   		 * Configure Header Split
>>   		 */
>>   		if (rx_conf->header_split) {
>> +			/*
>> +			 * Print a warning if split_hdr_size is less
>> +			 * than 128 bytes when RSC is requested.
>> +			 */
>> +			if (rx_conf->enable_lro &&
>> +			    rx_conf->split_hdr_size < 128)
>> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
>> +						   "128 bytes (%d)!",
>> +					     rx_conf->split_hdr_size);
>> +
>>   			if (hw->mac.type == ixgbe_mac_82599EB) {
>>   				/* Must setup the PSRTYPE register */
>> -				uint32_t psrtype;
>>   				psrtype = IXGBE_PSRTYPE_TCPHDR |
>>   					IXGBE_PSRTYPE_UDPHDR   |
>>   					IXGBE_PSRTYPE_IPV4HDR  |
>> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
>>   		} else
>>   #endif
>> +		{
>>   			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
>> +			/*
>> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
>> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
>> +			 * should be configured even if header split is not
>> +			 * enabled. In the later case we will configure it 128
>> +			 * bytes following the recommendation in the spec.
>> +			 */
>> +			if (rx_conf->enable_lro)
>> +				srrctl |=
>> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
>> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
>> +		}
>>
>>   		/* Set if packets are dropped when no descriptors available */
>>   		if (rxq->drop_en)
>> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   				       RTE_PKTMBUF_HEADROOM);
>>   		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
>>   			   IXGBE_SRRCTL_BSIZEPKT_MASK);
>> +
>> +		/*
>> +		 * TODO: Consider setting the Receive Descriptor Minimum
>> +		 * Threshold Size for and RSC case. This is not an obviously
>> +		 * beneficiary option but the one worth considering...
>> +		 */
>> +
>>   		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
>>
>>   		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
>> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
>>   					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
>>   			dev->data->scattered_rx = 1;
>> +
>> +		/* RSC per-queue configuration */
>> +		if (rx_conf->enable_lro) {
>> +			uint32_t eitr;
>> +
>> +			rscctl =
>> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
>> +			psrtype =
>> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
>> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
>> +
>> +			rscctl |= IXGBE_RSCCTL_RSCEN;
>> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
>> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
>> +
>> +			/*
>> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
>> +			 *
>> +			 * Full-sized RSC aggregations for a 10Gb/s link will
>> +			 * arrive at about 20K aggregation/s rate.
>> +			 *
>> +			 * 2K inst/s rate will make only 10% of the
>> +			 * aggregations to be closed due to the interrupt timer
>> +			 * expiration for a streaming at wire-speed case.
>> +			 *
>> +			 * For a sparse streaming case this setting will yield
>> +			 * at most 500us latency for a single RSC aggregation.
>> +			 */
>> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
> Again probably create some macro for ITR Interval default value here.

Well, again - it's the only place where it's used and I've extensively 
explained it in the comments in the code. Therefore I think it's the 
most readable way to write this.
If it would be used in at least two places - then I would have put it in 
a macro...

>
>> +
>> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
>> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
>> +								       psrtype);
>> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
>> +
>> +			/*
>> +			 * RSC requires the mapping of the queue to the
>> +			 * interrupt vector.
>> +			 */
>> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
> Hm, wonder why do we need to setup IVAR for RSC?
> Wouldn't just setting EITR be enough?

Nope. See 82599 spec chapter 4.6.7.2.2. I think I even tried not to map 
the queues to IVAR and it didn't work... ;)

>
>> +
>> +			rxq->rsc_en = 1;
>> +		}
>>   	}
>>
>>   	if (rx_conf->enable_scatter)
>>   		dev->data->scattered_rx = 1;
>>
>> +	if (rx_conf->enable_lro)
>> +		dev->data->lro = 1;
>> +
>>   	set_rx_function(dev);
>>
>>   	/*
>> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>>   	}
>>
>> +	/* Finalize RSC configuration  */
>> +	if (rx_conf->enable_lro) {
>> +		/*
>> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
>> +		 */
>> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
>> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
>> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>> +
>> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
>> +	}
>> +
>> +
>>   	return 0;
>>   }
>>
>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>> index bbe5ff3..389173f 100644
>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>> @@ -79,6 +79,10 @@ struct igb_rx_entry {
>>   	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
>>   };
>>
>> +struct igb_rsc_entry {
>> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
>> +};
>> +
>>   /**
>>    * Structure associated with each descriptor of the TX ring of a TX queue.
>>    */
>> @@ -105,6 +109,7 @@ struct igb_rx_queue {
>>   	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
>>   	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
>>   	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
>> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
>>   	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
>>   	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
>>   	uint64_t            mbuf_initializer; /**< value to init mbufs */
>> @@ -126,6 +131,7 @@ struct igb_rx_queue {
>>   	uint8_t             port_id;  /**< Device port identifier. */
>>   	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
>>   	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
>> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
>>   	uint8_t             rx_deferred_start; /**< not in global dev start. */
>>   #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
>>   	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */
>> --
>> 2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-10  0:30   ` Ananyev, Konstantin
  2015-03-10 13:22     ` Vlad Zolotarov
@ 2015-03-10 17:51     ` Vlad Zolotarov
  1 sibling, 0 replies; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-10 17:51 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev



On 03/10/15 02:30, Ananyev, Konstantin wrote:
> Hi Vlad,
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
>> Sent: Monday, March 09, 2015 7:07 PM
>> To: dev@dpdk.org
>> Subject: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
>>
>>      - Only x540 and 82599 devices support LRO.
>>      - Add the appropriate HW configuration.
>>      - Add RSC aware rx_pkt_burst() handlers:
>>         - Implemented bulk allocation and non-bulk allocation versions.
>>         - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
>>           and to igb_rx_queue.
>>         - Use the appropriate handler when LRO is requested.
>>
>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>> ---
>> New in v5:
>>     - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
>>     - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
>>
>> New in v4:
>>     - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
>>       RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
>>
>> New in v2:
>>     - Removed rte_eth_dev_data.lro_bulk_alloc.
>>     - Fixed a few styling and spelling issues.
>> ---
>>   lib/librte_ether/rte_ethdev.h       |   9 +-
>>   lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
>>   lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
>>   lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
>>   lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
>>   5 files changed, 581 insertions(+), 7 deletions(-)
>>
>> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
>> index 8db3127..44f081f 100644
>> --- a/lib/librte_ether/rte_ethdev.h
>> +++ b/lib/librte_ether/rte_ethdev.h
>> @@ -172,6 +172,9 @@ extern "C" {
>>
>>   #include <stdint.h>
>>
>> +/* Use this macro to check if LRO API is supported */
>> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
>> +
>>   #include <rte_log.h>
>>   #include <rte_interrupts.h>
>>   #include <rte_pci.h>
>> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
>>   	enum rte_eth_rx_mq_mode mq_mode;
>>   	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
>>   	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
>> -	uint8_t header_split : 1, /**< Header Split enable. */
>> +	uint16_t header_split : 1, /**< Header Split enable. */
>>   		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
>>   		hw_vlan_filter   : 1, /**< VLAN filter enable. */
>>   		hw_vlan_strip    : 1, /**< VLAN strip enable. */
>>   		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
>>   		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
>>   		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
>> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
>> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
>> +		enable_lro       : 1; /**< Enable LRO */
>>   };
>>
>>   /**
>> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
>>   	uint8_t port_id;           /**< Device [external] port identifier. */
>>   	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
>>   		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
>> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
>>   		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
>>   		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
>>   };
>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>> index 9d3de1a..765174d 100644
>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
>>
>>   	/* Clear stored conf */
>>   	dev->data->scattered_rx = 0;
>> +	dev->data->lro = 0;
>>   	hw->rx_bulk_alloc_allowed = false;
>>   	hw->rx_vec_allowed = false;
>>
>> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>>   		DEV_RX_OFFLOAD_IPV4_CKSUM |
>>   		DEV_RX_OFFLOAD_UDP_CKSUM  |
>>   		DEV_RX_OFFLOAD_TCP_CKSUM;
>> +
>> +	if (hw->mac.type == ixgbe_mac_82599EB ||
>> +	    hw->mac.type == ixgbe_mac_X540)
>> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
>> +
>>   	dev_info->tx_offload_capa =
>>   		DEV_TX_OFFLOAD_VLAN_INSERT |
>>   		DEV_TX_OFFLOAD_IPV4_CKSUM  |
>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>> index a549f5c..e206584 100644
>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>>   uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
>>   		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>
>> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>> +
>>   uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>>   		uint16_t nb_pkts);
>>
>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>> index 58e619b..944c662 100644
>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>   }
>>
>>   /**
>> + * Detect an RSC descriptor.
>> + */
>> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
>> +{
>> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
>> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
>> +}
>> +
>> +/**
>>    * Initialize the first mbuf of the returned packet:
>>    *    - RX port identifier,
>>    *    - hardware offload data, if any:
>> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
>>   	}
>>   }
>>
>> +/**
>> + * Bulk receive handler for and LRO case.
>> + *
>> + * @rx_queue Rx queue handle
>> + * @rx_pkts table of received packets
>> + * @nb_pkts size of rx_pkts table
>> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
>> + *
>> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
>> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
>> + *
>> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
>> + * 1) When non-EOP RSC completion arrives:
>> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
>> + *       segment's data length.
>> + *    b) Set the "next" pointer of the current segment to point to the segment
>> + *       at the NEXTP index.
>> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
>> + *       in the sw_rsc_ring.
>> + * 2) When EOP arrives we just update the cluster's total length and offload
>> + *    flags and deliver the cluster up to the upper layers. In our case - put it
>> + *    in the rx_pkts table.
>> + *
>> + * Returns the number of received packets/clusters (according to the "bulk
>> + * receive" interface).
>> + */
>> +static inline uint16_t
>> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
>> +	       bool bulk_alloc)
>> +{
>> +	struct igb_rx_queue *rxq = rx_queue;
>> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
>> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
>> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
>> +	uint16_t rx_id = rxq->rx_tail;
>> +	uint16_t nb_rx = 0;
>> +	uint16_t nb_hold = rxq->nb_rx_hold;
>> +	uint16_t prev_id = rxq->rx_tail;
>> +
>> +	while (nb_rx < nb_pkts) {
>> +		bool eop;
>> +		struct igb_rx_entry *rxe;
>> +		struct igb_rsc_entry *rsc_entry;
>> +		struct igb_rsc_entry *next_rsc_entry;
>> +		struct igb_rx_entry *next_rxe;
>> +		struct rte_mbuf *first_seg;
>> +		struct rte_mbuf *rxm;
>> +		struct rte_mbuf *nmb;
>> +		union ixgbe_adv_rx_desc rxd;
>> +		uint16_t data_len;
>> +		uint16_t next_id;
>> +		volatile union ixgbe_adv_rx_desc *rxdp;
>> +		uint32_t staterr;
>> +
>> +next_desc:
>> +		/*
>> +		 * The code in this whole file uses the volatile pointer to
>> +		 * ensure the read ordering of the status and the rest of the
>> +		 * descriptor fields (on the compiler level only!!!). This is so
>> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
>> +		 * even has the rte_compiler_barrier() for that.
>> +		 *
>> +		 * But most importantly this is just wrong because this doesn't
>> +		 * ensure memory ordering in a general case at all. For
>> +		 * instance, DPDK is supposed to work on Power CPUs where
>> +		 * compiler barrier may just not be enough!
>> +		 *
>> +		 * I tried to write only this function properly to have a
>> +		 * starting point (as a part of an LRO/RSC series) but the
>> +		 * compiler cursed at me when I tried to cast away the
>> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
>> +		 * keeping it the way it is for now.
>> +		 *
>> +		 * The code in this file is broken in so many other places and
>> +		 * will just not work on a big endian CPU anyway therefore the
>> +		 * lines below will have to be revisited together with the rest
>> +		 * of the ixgbe PMD.
>> +		 *
>> +		 * TODO:
>> +		 *    - Get rid of "volatile" crap and let the compiler do its
>> +		 *      job.
>> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
>> +		 *      memory ordering below.
> Ok, so you wanted to put rte_rmb(), straight after:
> staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> correct?
> I agree that for machines with relaxed memory model (PPC) we do need it here.
> So why not just put it there, instead of complaining about in in comments? ;)
>
> About rxdp being pointer to volatile, why it bothers you that much?
> You copy the whole RXD to the local variable anyway, and then reference it only to setup new addresses.
>
>> +		 */
>> +		rxdp = &rx_ring[rx_id];
>> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>> +
>> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>> +			break;
>> +
>> +		rxd = *rxdp;
>> +
>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>> +				  "staterr=0x%x data_len=%u",
>> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
>> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
>> +
>> +		if (!bulk_alloc) {
>> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
>> +			if (nmb == NULL) {
>> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
>> +						  "port_id=%u queue_id=%u",
>> +					   rxq->port_id, rxq->queue_id);
>> +
>> +				rte_eth_devices[rxq->port_id].data->
>> +							rx_mbuf_alloc_failed++;
>> +				break;
>> +			}
>> +		} else if (nb_hold > rxq->rx_free_thresh) {
>> +			uint16_t next_rdt = rxq->rx_free_trigger;
>> +
>> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
>> +				rte_wmb();
>> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
>> +						    next_rdt);
>> +				nb_hold -= rxq->rx_free_thresh;
>> +			} else {
>> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
>> +						  "port_id=%u queue_id=%u",
>> +					   rxq->port_id, rxq->queue_id);
>> +
>> +				rte_eth_devices[rxq->port_id].data->
>> +							rx_mbuf_alloc_failed++;
>> +				break;
>> +			}
>> +		}
>> +
>> +		nb_hold++;
>> +		rxe = &sw_ring[rx_id];
>> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
>> +
>> +		next_id = rx_id + 1;
>> +		if (next_id == rxq->nb_rx_desc)
>> +			next_id = 0;
>> +
>> +		/* Prefetch next mbuf while processing current one. */
>> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
>> +
>> +		/*
>> +		 * When next RX descriptor is on a cache-line boundary,
>> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
>> +		 * to mbufs.
>> +		 */
>> +		if ((next_id & 0x3) == 0) {
>> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
>> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
>> +		}
>> +
>> +		rxm = rxe->mbuf;
>> +
>> +		if (!bulk_alloc) {
>> +			__le64 dma =
>> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
>> +			/*
>> +			 * Update RX descriptor with the physical address of the
>> +			 * new data buffer of the new allocated mbuf.
>> +			 */
>> +			rxe->mbuf = nmb;
>> +
>> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
>> +			rxdp->read.hdr_addr = dma;
>> +			rxdp->read.pkt_addr = dma;
>> +		}
>> +		/*
>> +		 * Set data length & data buffer address of mbuf.
>> +		 */
>> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
>> +		rxm->data_len = data_len;
>> +
>> +		if (!eop) {
>> +			uint16_t nextp_id;
>> +			/*
>> +			 * Get next descriptor index:
>> +			 *  - For RSC it's in the NEXTP field.
>> +			 *  - For a scattered packet - it's just a following
>> +			 *    descriptor.
>> +			 */
>> +			if (ixgbe_rsc_count(&rxd))
>> +				nextp_id =
>> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
>> +						       IXGBE_RXDADV_NEXTP_SHIFT;
>> +			else
>> +				nextp_id = next_id;
>> +
>> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
>> +			next_rxe = &sw_ring[nextp_id];
>> +			rte_ixgbe_prefetch(next_rxe);
>> +		}
>> +
>> +		rsc_entry = &sw_rsc_ring[rx_id];
>> +		first_seg = rsc_entry->fbuf;
>> +		rsc_entry->fbuf = NULL;
>> +
>> +		/*
>> +		 * If this is the first buffer of the received packet,
>> +		 * set the pointer to the first mbuf of the packet and
>> +		 * initialize its context.
>> +		 * Otherwise, update the total length and the number of segments
>> +		 * of the current scattered packet, and update the pointer to
>> +		 * the last mbuf of the current packet.
>> +		 */
>> +		if (first_seg == NULL) {
>> +			first_seg = rxm;
>> +			first_seg->pkt_len = data_len;
>> +			first_seg->nb_segs = 1;
>> +		} else {
>> +			first_seg->pkt_len += data_len;
>> +			first_seg->nb_segs++;
>> +		}
>> +
>> +		prev_id = rx_id;
>> +		rx_id = next_id;
>> +
>> +		/*
>> +		 * If this is not the last buffer of the received packet, update
>> +		 * the pointer to the first mbuf at the NEXTP entry in the
>> +		 * sw_rsc_ring and continue to parse the RX ring.
>> +		 */
>> +		if (!eop) {
>> +			rxm->next = next_rxe->mbuf;
>> +			next_rsc_entry->fbuf = first_seg;
>> +			goto next_desc;
> So _recv_pkts_lro() can return with one of rxq->rsc_entry[i] != NULL, correct?
> If so, then I think you need at ixgbe_rx_queue_release_mbufs() to add the code, that would go through
> all rsc_entry[] to find one whose fbuf  is != NULL, call rte_pktmbuf_free() for it and reset to NULL.
>   To handle the case:
> recv_pkts_lro(rxq, ...);
> rte_eth_dev_stop();
> rte_eth_dev_start();
> recv_pkts_lro(rxq, ...);

While working on the issue above I've suddenly noticed something that 
looks like a very nasty bug in the ixgbe_reset_rx_queue():

Notice how zeroed_desc is initialized:

static const union ixgbe_adv_rx_desc zeroed_desc = { .read = {
                        .pkt_addr = 0}};

In general, static variables should be automatically zero-initialized 
but the relevant commit says that ICC got angry about it not being 
explicitly initialized and thus the initialization had been added. If 
for any reason ICC doesn't follow the C standard here (again!) and 
doesn't zero-initialize the static variable then we have a serious 
problem here since very important descriptor fields like "status" and 
"length" are located at the second part of the descriptor that won't be 
initialized. I suppose everybody reading this can come up with the 
use-case when PMD will deliver trash to the caller if that's true... ;)

Could anybody among ICC specialists here clarify to me, please, if there 
is actually an issue I've described above or what is going on with these 
static variables initialization?... ;)

thanks,
vlad

> BTW, that also means that you can't do:
> rxm->next = next_rxe->mbuf;
> above, and
> rxm->next = NULL;
> should be done before 'goto next_desc;' too
>
>> +		}
>> +
>> +		/*
>> +		 * This is the last buffer of the received packet - return
>> +		 * the current cluster to the user.
>> +		 */
>> +		rxm->next = NULL;
>> +
>> +		/* Initialize the first mbuf of the returned packet */
>> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
>> +					    staterr);
>> +
>> +		/* Prefetch data of first segment, if configured to do so. */
>> +		rte_packet_prefetch((char *)first_seg->buf_addr +
>> +			first_seg->data_off);
>> +
>> +		/*
>> +		 * Store the mbuf address into the next entry of the array
>> +		 * of returned packets.
>> +		 */
>> +		rx_pkts[nb_rx++] = first_seg;
>> +	}
>> +
>> +	/*
>> +	 * Record index of the next RX descriptor to probe.
>> +	 */
>> +	rxq->rx_tail = rx_id;
>> +
>> +	/*
>> +	 * If the number of free RX descriptors is greater than the RX free
>> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
>> +	 * register.
>> +	 * Update the RDT with the value of the last processed RX descriptor
>> +	 * minus 1, to guarantee that the RDT register is never equal to the
>> +	 * RDH register, which creates a "full" ring situtation from the
>> +	 * hardware point of view...
>> +	 */
>> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
>> +			   "nb_hold=%u nb_rx=%u",
>> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
>> +
> I suppose if you do wmb() after rte_rxmbuf_alloc(), you'd better do it here too.
>
>> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
>> +		nb_hold = 0;
>> +	}
>> +
>> +	rxq->nb_rx_hold = nb_hold;
>> +	return nb_rx;
>> +}
>> +
>> +uint16_t
>> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>> +{
>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
>> +}
>> +
>> +uint16_t
>> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>> +			       uint16_t nb_pkts)
>> +{
>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
>> +}
>> +
>>   uint16_t
>>   ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>   			  uint16_t nb_pkts)
>> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
>>   	if (rxq != NULL) {
>>   		ixgbe_rx_queue_release_mbufs(rxq);
>>   		rte_free(rxq->sw_ring);
>> +		rte_free(rxq->sw_rsc_ring);
>>   		rte_free(rxq);
>>   	}
>>   }
>> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
>>   	rxq->nb_rx_hold = 0;
>>   	rxq->pkt_first_seg = NULL;
>>   	rxq->pkt_last_seg = NULL;
>> +	rxq->rsc_en = 0;
>>   }
>>
>>   int
>> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>   	struct igb_rx_queue *rxq;
>>   	struct ixgbe_hw     *hw;
>>   	uint16_t len;
>> +	struct rte_eth_dev_info dev_info = { 0 };
>> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
>> +	bool rsc_requested = false;
>> +
>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
>> +	    dev_rx_mode->enable_lro)
>> +		rsc_requested = true;
>>
>>   	PMD_INIT_FUNC_TRACE();
>>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>   	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
>>   					  sizeof(struct igb_rx_entry) * len,
>>   					  RTE_CACHE_LINE_SIZE, socket_id);
>> -	if (rxq->sw_ring == NULL) {
>> +	if (!rxq->sw_ring) {
> Wonder what was wrong with that one? :)
>
>>   		ixgbe_rx_queue_release(rxq);
>>   		return (-ENOMEM);
>>   	}
>> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
>> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
>> +
>> +	if (rsc_requested) {
>> +		rxq->sw_rsc_ring =
>> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
>> +					   sizeof(struct igb_rsc_entry) * len,
>> +					   RTE_CACHE_LINE_SIZE, socket_id);
>> +		if (!rxq->sw_rsc_ring) {
>> +			ixgbe_rx_queue_release(rxq);
>> +			return (-ENOMEM);
>> +		}
>> +	} else {
>> +		rxq->sw_rsc_ring = NULL;
>> +	}
>> +
>> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
>> +			    "dma_addr=0x%"PRIx64,
>> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
>> +		     rxq->rx_ring_phys_addr);
>>
>>   	if (!rte_is_power_of_2(nb_desc)) {
>>   		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
>> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
>>   	return 0;
>>   }
>>
>> +/**
>> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
>> + *
>> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
>> + * spec rev. 3.0 chapter 8.2.3.8.13.
>> + *
>> + * @pool Memory pool of the Rx queue
>> + */
>> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
>> +{
>> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
>> +
>> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
>> +	uint16_t maxdesc =
>> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
> A  nit: use some macro (UINt16_MAX?) instead of hardcoded constant if possible.
>
>> +
>> +	if (maxdesc >= 16)
>> +		return IXGBE_RSCCTL_MAXDESC_16;
>> +	else if (maxdesc >= 8)
>> +		return IXGBE_RSCCTL_MAXDESC_8;
>> +	else if (maxdesc >= 4)
>> +		return IXGBE_RSCCTL_MAXDESC_4;
>> +	else
>> +		return IXGBE_RSCCTL_MAXDESC_1;
>> +}
>> +
>> +/* (Taken from FreeBSD tree)
>> +** Setup the correct IVAR register for a particular MSIX interrupt
>> +**   (yes this is all very magic and confusing :)
>> +**  - entry is the register array entry
>> +**  - vector is the MSIX vector for this queue
>> +**  - type is RX/TX/MISC
>> +*/
>> +static void
>> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
>> +{
>> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>> +	u32 ivar, index;
>> +
>> +	vector |= IXGBE_IVAR_ALLOC_VAL;
>> +
>> +	switch (hw->mac.type) {
>> +
>> +	case ixgbe_mac_82598EB:
>> +		if (type == -1)
>> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
>> +		else
>> +			entry += (type * 64);
>> +		index = (entry >> 2) & 0x1F;
>> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
>> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
>> +		ivar |= (vector << (8 * (entry & 0x3)));
>> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
>> +		break;
>> +
>> +	case ixgbe_mac_82599EB:
>> +	case ixgbe_mac_X540:
>> +		if (type == -1) { /* MISC IVAR */
>> +			index = (entry & 1) * 8;
>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
>> +			ivar &= ~(0xFF << index);
>> +			ivar |= (vector << index);
>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
>> +		} else {	/* RX/TX IVARS */
>> +			index = (16 * (entry & 1)) + (8 * type);
>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
>> +			ivar &= ~(0xFF << index);
>> +			ivar |= (vector << index);
>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
>> +		}
>> +
>> +		break;
>> +
>> +	default:
>> +		break;
>> +	}
>> +}
>> +
>>   void set_rx_function(struct rte_eth_dev *dev)
>>   {
>>   	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
>>   			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
>>   		}
>>   	}
>> +
>> +	/*
>> +	 * Initialize the appropriate LRO callback.
>> +	 *
>> +	 * If all queues satisfy the bulk allocation preconditions
>> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
>> +	 * Otherwise use a single allocation version.
>> +	 */
>> +	if (dev->data->lro) {
>> +		if (hw->rx_bulk_alloc_allowed) {
>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
>> +					   "allocation version");
>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
>> +		} else {
>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
>> +					   "allocation version");
>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
>> +		}
>> +	}
>>   }
> As I understand, ixgbe_recv_pkts_lro() can handle both LRO and normal scattered packets?
> If that so, then can we remove ixgbe_recv_scattered_pkts() at all and use ixgbe_recv_scattered_pkts() for both cases?
>
>>   /*
>> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   	uint32_t maxfrs;
>>   	uint32_t srrctl;
>>   	uint32_t rdrxctl;
>> +	uint32_t rscctl;
>> +	uint32_t psrtype;
>> +	uint32_t rfctl;
>>   	uint32_t rxcsum;
>>   	uint16_t buf_size;
>>   	uint16_t i;
>>   	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
>> +	struct rte_eth_dev_info dev_info = { 0 };
>> +	bool rsc_capable = false;
>> +
>> +	/* Sanity check */
>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
>> +		rsc_capable = true;
> @ 7.11.1 82599 spec says:
> " Note that in SR-IOV mode the RSC must be disabled globally by setting the RFCTL.RSC_DIS bit."
> Add a check?
>
>> +
>> +	if (!rsc_capable && rx_conf->enable_lro) {
>> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
>> +				   "support it");
>> +		return -EINVAL;
>> +	}
>>
>>   	PMD_INIT_FUNC_TRACE();
>>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
>>
>>   	/*
>> +	 * RFCTL configuration
>> +	 *
>> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
>> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
>> +	 */
>> +	if (rsc_capable) {
>> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
>> +		if (rx_conf->enable_lro) {
>> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
>> +				   IXGBE_RFCTL_NFSR_DIS);
>> +		} else {
>> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
>> +		}
>> +
>> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
>> +	}
>> +
>> +
>> +	/*
>>   	 * Configure CRC stripping, if any.
>>   	 */
>>   	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
>>   	if (rx_conf->hw_strip_crc)
>>   		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
>> -	else
>> +	else {
>>   		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
>> +		if (rx_conf->enable_lro) {
>> +			/*
>> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
>> +			 * 3.0 RSC configuration requires HW CRC stripping being
>> +			 * enabled. If user requested both HW CRC stripping off
>> +			 * and RSC on - return an error.
>> +			 */
>> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
>> +					    "is disabled");
>> +			return -EINVAL;
>> +		}
>> +	}
>>
>>   	/*
>>   	 * Configure jumbo frame support, if any.
>> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   		 * Configure Header Split
>>   		 */
>>   		if (rx_conf->header_split) {
>> +			/*
>> +			 * Print a warning if split_hdr_size is less
>> +			 * than 128 bytes when RSC is requested.
>> +			 */
>> +			if (rx_conf->enable_lro &&
>> +			    rx_conf->split_hdr_size < 128)
>> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
>> +						   "128 bytes (%d)!",
>> +					     rx_conf->split_hdr_size);
>> +
>>   			if (hw->mac.type == ixgbe_mac_82599EB) {
>>   				/* Must setup the PSRTYPE register */
>> -				uint32_t psrtype;
>>   				psrtype = IXGBE_PSRTYPE_TCPHDR |
>>   					IXGBE_PSRTYPE_UDPHDR   |
>>   					IXGBE_PSRTYPE_IPV4HDR  |
>> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
>>   		} else
>>   #endif
>> +		{
>>   			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
>> +			/*
>> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
>> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
>> +			 * should be configured even if header split is not
>> +			 * enabled. In the later case we will configure it 128
>> +			 * bytes following the recommendation in the spec.
>> +			 */
>> +			if (rx_conf->enable_lro)
>> +				srrctl |=
>> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
>> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
>> +		}
>>
>>   		/* Set if packets are dropped when no descriptors available */
>>   		if (rxq->drop_en)
>> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   				       RTE_PKTMBUF_HEADROOM);
>>   		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
>>   			   IXGBE_SRRCTL_BSIZEPKT_MASK);
>> +
>> +		/*
>> +		 * TODO: Consider setting the Receive Descriptor Minimum
>> +		 * Threshold Size for and RSC case. This is not an obviously
>> +		 * beneficiary option but the one worth considering...
>> +		 */
>> +
>>   		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
>>
>>   		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
>> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
>>   					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
>>   			dev->data->scattered_rx = 1;
>> +
>> +		/* RSC per-queue configuration */
>> +		if (rx_conf->enable_lro) {
>> +			uint32_t eitr;
>> +
>> +			rscctl =
>> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
>> +			psrtype =
>> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
>> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
>> +
>> +			rscctl |= IXGBE_RSCCTL_RSCEN;
>> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
>> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
>> +
>> +			/*
>> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
>> +			 *
>> +			 * Full-sized RSC aggregations for a 10Gb/s link will
>> +			 * arrive at about 20K aggregation/s rate.
>> +			 *
>> +			 * 2K inst/s rate will make only 10% of the
>> +			 * aggregations to be closed due to the interrupt timer
>> +			 * expiration for a streaming at wire-speed case.
>> +			 *
>> +			 * For a sparse streaming case this setting will yield
>> +			 * at most 500us latency for a single RSC aggregation.
>> +			 */
>> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
> Again probably create some macro for ITR Interval default value here.
>
>> +
>> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
>> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
>> +								       psrtype);
>> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
>> +
>> +			/*
>> +			 * RSC requires the mapping of the queue to the
>> +			 * interrupt vector.
>> +			 */
>> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
> Hm, wonder why do we need to setup IVAR for RSC?
> Wouldn't just setting EITR be enough?
>
>> +
>> +			rxq->rsc_en = 1;
>> +		}
>>   	}
>>
>>   	if (rx_conf->enable_scatter)
>>   		dev->data->scattered_rx = 1;
>>
>> +	if (rx_conf->enable_lro)
>> +		dev->data->lro = 1;
>> +
>>   	set_rx_function(dev);
>>
>>   	/*
>> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>   		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>>   	}
>>
>> +	/* Finalize RSC configuration  */
>> +	if (rx_conf->enable_lro) {
>> +		/*
>> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
>> +		 */
>> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
>> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
>> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>> +
>> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
>> +	}
>> +
>> +
>>   	return 0;
>>   }
>>
>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>> index bbe5ff3..389173f 100644
>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>> @@ -79,6 +79,10 @@ struct igb_rx_entry {
>>   	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
>>   };
>>
>> +struct igb_rsc_entry {
>> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
>> +};
>> +
>>   /**
>>    * Structure associated with each descriptor of the TX ring of a TX queue.
>>    */
>> @@ -105,6 +109,7 @@ struct igb_rx_queue {
>>   	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
>>   	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
>>   	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
>> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
>>   	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
>>   	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
>>   	uint64_t            mbuf_initializer; /**< value to init mbufs */
>> @@ -126,6 +131,7 @@ struct igb_rx_queue {
>>   	uint8_t             port_id;  /**< Device port identifier. */
>>   	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
>>   	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
>> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
>>   	uint8_t             rx_deferred_start; /**< not in global dev start. */
>>   #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
>>   	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */
>> --
>> 2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-10 13:22     ` Vlad Zolotarov
@ 2015-03-10 20:09       ` Ananyev, Konstantin
  2015-03-10 21:36         ` Vlad Zolotarov
  0 siblings, 1 reply; 18+ messages in thread
From: Ananyev, Konstantin @ 2015-03-10 20:09 UTC (permalink / raw)
  To: Vlad Zolotarov, dev

> 
> > Hi Vlad,
> >
> >> -----Original Message-----
> >> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vlad Zolotarov
> >> Sent: Monday, March 09, 2015 7:07 PM
> >> To: dev at dpdk.org
> >> Subject: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
> >>
> >>      - Only x540 and 82599 devices support LRO.
> >>      - Add the appropriate HW configuration.
> >>      - Add RSC aware rx_pkt_burst() handlers:
> >>         - Implemented bulk allocation and non-bulk allocation versions.
> >>         - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
> >>           and to igb_rx_queue.
> >>         - Use the appropriate handler when LRO is requested.
> >>
> >> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com>
> >> ---
> >> New in v5:
> >>     - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
> >>     - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
> >>
> >> New in v4:
> >>     - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
> >>       RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
> >>
> >> New in v2:
> >>     - Removed rte_eth_dev_data.lro_bulk_alloc.
> >>     - Fixed a few styling and spelling issues.
> >> ---
> >>   lib/librte_ether/rte_ethdev.h       |   9 +-
> >>   lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
> >>   lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
> >>   lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
> >>   lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
> >>   5 files changed, 581 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> >> index 8db3127..44f081f 100644
> >> --- a/lib/librte_ether/rte_ethdev.h
> >> +++ b/lib/librte_ether/rte_ethdev.h
> >> @@ -172,6 +172,9 @@ extern "C" {
> >>
> >>   #include <stdint.h>
> >>
> >> +/* Use this macro to check if LRO API is supported */
> >> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
> >> +
> >>   #include <rte_log.h>
> >>   #include <rte_interrupts.h>
> >>   #include <rte_pci.h>
> >> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
> >>   	enum rte_eth_rx_mq_mode mq_mode;
> >>   	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
> >>   	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
> >> -	uint8_t header_split : 1, /**< Header Split enable. */
> >> +	uint16_t header_split : 1, /**< Header Split enable. */
> >>   		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
> >>   		hw_vlan_filter   : 1, /**< VLAN filter enable. */
> >>   		hw_vlan_strip    : 1, /**< VLAN strip enable. */
> >>   		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
> >>   		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
> >>   		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
> >> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
> >> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
> >> +		enable_lro       : 1; /**< Enable LRO */
> >>   };
> >>
> >>   /**
> >> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
> >>   	uint8_t port_id;           /**< Device [external] port identifier. */
> >>   	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
> >>   		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
> >> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
> >>   		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
> >>   		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
> >>   };
> >> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> >> index 9d3de1a..765174d 100644
> >> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> >> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> >> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
> >>
> >>   	/* Clear stored conf */
> >>   	dev->data->scattered_rx = 0;
> >> +	dev->data->lro = 0;
> >>   	hw->rx_bulk_alloc_allowed = false;
> >>   	hw->rx_vec_allowed = false;
> >>
> >> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> >>   		DEV_RX_OFFLOAD_IPV4_CKSUM |
> >>   		DEV_RX_OFFLOAD_UDP_CKSUM  |
> >>   		DEV_RX_OFFLOAD_TCP_CKSUM;
> >> +
> >> +	if (hw->mac.type == ixgbe_mac_82599EB ||
> >> +	    hw->mac.type == ixgbe_mac_X540)
> >> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
> >> +
> >>   	dev_info->tx_offload_capa =
> >>   		DEV_TX_OFFLOAD_VLAN_INSERT |
> >>   		DEV_TX_OFFLOAD_IPV4_CKSUM  |
> >> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> >> index a549f5c..e206584 100644
> >> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> >> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> >> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
> >>   uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
> >>   		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> >>
> >> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
> >> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> >> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
> >> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> >> +
> >>   uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
> >>   		uint16_t nb_pkts);
> >>
> >> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> >> index 58e619b..944c662 100644
> >> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> >> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> >> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> >>   }
> >>
> >>   /**
> >> + * Detect an RSC descriptor.
> >> + */
> >> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
> >> +{
> >> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
> >> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
> >> +}
> >> +
> >> +/**
> >>    * Initialize the first mbuf of the returned packet:
> >>    *    - RX port identifier,
> >>    *    - hardware offload data, if any:
> >> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
> >>   	}
> >>   }
> >>
> >> +/**
> >> + * Bulk receive handler for and LRO case.
> >> + *
> >> + * @rx_queue Rx queue handle
> >> + * @rx_pkts table of received packets
> >> + * @nb_pkts size of rx_pkts table
> >> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
> >> + *
> >> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
> >> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
> >> + *
> >> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
> >> + * 1) When non-EOP RSC completion arrives:
> >> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
> >> + *       segment's data length.
> >> + *    b) Set the "next" pointer of the current segment to point to the segment
> >> + *       at the NEXTP index.
> >> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
> >> + *       in the sw_rsc_ring.
> >> + * 2) When EOP arrives we just update the cluster's total length and offload
> >> + *    flags and deliver the cluster up to the upper layers. In our case - put it
> >> + *    in the rx_pkts table.
> >> + *
> >> + * Returns the number of received packets/clusters (according to the "bulk
> >> + * receive" interface).
> >> + */
> >> +static inline uint16_t
> >> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
> >> +	       bool bulk_alloc)
> >> +{
> >> +	struct igb_rx_queue *rxq = rx_queue;
> >> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
> >> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
> >> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
> >> +	uint16_t rx_id = rxq->rx_tail;
> >> +	uint16_t nb_rx = 0;
> >> +	uint16_t nb_hold = rxq->nb_rx_hold;
> >> +	uint16_t prev_id = rxq->rx_tail;
> >> +
> >> +	while (nb_rx < nb_pkts) {
> >> +		bool eop;
> >> +		struct igb_rx_entry *rxe;
> >> +		struct igb_rsc_entry *rsc_entry;
> >> +		struct igb_rsc_entry *next_rsc_entry;
> >> +		struct igb_rx_entry *next_rxe;
> >> +		struct rte_mbuf *first_seg;
> >> +		struct rte_mbuf *rxm;
> >> +		struct rte_mbuf *nmb;
> >> +		union ixgbe_adv_rx_desc rxd;
> >> +		uint16_t data_len;
> >> +		uint16_t next_id;
> >> +		volatile union ixgbe_adv_rx_desc *rxdp;
> >> +		uint32_t staterr;
> >> +
> >> +next_desc:
> >> +		/*
> >> +		 * The code in this whole file uses the volatile pointer to
> >> +		 * ensure the read ordering of the status and the rest of the
> >> +		 * descriptor fields (on the compiler level only!!!). This is so
> >> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
> >> +		 * even has the rte_compiler_barrier() for that.
> >> +		 *
> >> +		 * But most importantly this is just wrong because this doesn't
> >> +		 * ensure memory ordering in a general case at all. For
> >> +		 * instance, DPDK is supposed to work on Power CPUs where
> >> +		 * compiler barrier may just not be enough!
> >> +		 *
> >> +		 * I tried to write only this function properly to have a
> >> +		 * starting point (as a part of an LRO/RSC series) but the
> >> +		 * compiler cursed at me when I tried to cast away the
> >> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
> >> +		 * keeping it the way it is for now.
> >> +		 *
> >> +		 * The code in this file is broken in so many other places and
> >> +		 * will just not work on a big endian CPU anyway therefore the
> >> +		 * lines below will have to be revisited together with the rest
> >> +		 * of the ixgbe PMD.
> >> +		 *
> >> +		 * TODO:
> >> +		 *    - Get rid of "volatile" crap and let the compiler do its
> >> +		 *      job.
> >> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
> >> +		 *      memory ordering below.
> > Ok, so you wanted to put rte_rmb(), straight after:
> > staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> > correct?
> > I agree that for machines with relaxed memory model (PPC) we do need it here.
> > So why not just put it there, instead of complaining about in in comments? ;)
> 
> Because it's not a proper fix and I don't like workarounds.

Why not? For machines with relaxed memory model you would need  rmb() here no matter does rxdp points to volatile or not.

> 
> >
> > About rxdp being pointer to volatile, why it bothers you that much?
> 
> Because using of "volatile" prevent the compiler from optimizing every
> code piece where the "volatile" variable is participating and that's a
> shame.
> Read this
> https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
> for a more detailed explanation.
> 
> > You copy the whole RXD to the local variable anyway, and then reference it only to setup new addresses.
> 
> The fact that we have to copy the whole descriptor while we may not need
> all the data from it at the end is one problem.

I understand that, but I don't think that the difference would that critical.
Though I don't have any data in hand to compare.

> The proper solution in Rx ring context should go as follows:
> 
>  1. Remove the "volatile" qualifier from rx_ring (HW Rx descriptors ring).
>  2. Remove "volatile" at all places where rx_ring is accessed.
>  3. Adjust the code in (2):
>      1. Remove the descriptor copy u've mentioned and access the
>         descriptor data directly.
>      2. Ensure the proper ordering by using the proper memory barriers,
>         which are missing in the DPDK SDK at the moment (see a small
>         discussion about this with Stephen and Avi on "[dpdk-dev]
>         [PATCH v1 5/5] ixgbe: Add LRO support" thread).

I think you are mixing 2 different issues here:

1.  For architectures with relaxed memory model we do need rmb() after that line:
staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
We do need it *always*, not depending on is rx_ring a volatile or not.
If we really plan to support PPC and other architectures that allow read reordering  - 
not having an 'rmb()' or similar sync primitive here is a bug.
Same thing applies to 'wmb()' before updating RDT. 

2. volatile rx_ring vs non-volatile with explicit memory ordering instrincts. 
Actually I think that using volatile rx_ring is not a real bug on itself.
Code with volatile rx_ring and fix for #1 in place would work correctly on all architectures.
It might be slower than non-volatile approach, but nothing would be broken.

About the existing RX/TX functions and PPC support:
Note that all of them were created before PPC support for DPDK was introduced.
At that moment only IA was supported.
That's why in some places where you would expect to see 'mb()' there are 'volatile' and/or ' rte_compiler_barrier' instead. 
Why all that places wasn't updated when PPC support was added - that's another question.
>From my understanding - with current implementation some of DPDK PMDs RX/TX functions and  rte_ring wouldn't work correctly on PPC.
So, I suppose we need to decide for ourselves - do we really want to support PPC and other architectures with non-IA memory model or not?
If not, then I think we don't need any mb()s inside recv_pkts_lro() - just rte_compiler_barrier seems enough, and no point to complain about   
it in comments.
If yes - then why to introduce a new function with a known potential bug?

> 
> As it sounds this is going to be a VERY sensitive patchset.
> That's why it should go separately from this patchwork (or from any
> other patchwork).

For that patch, I am not suggesting you to change any other functions, just one that you introducing.

> 
> >
> >> +		 */
> >> +		rxdp = &rx_ring[rx_id];
> >> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >> +
> >> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >> +			break;
> >> +
> >> +		rxd = *rxdp;
> >> +
> >> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> >> +				  "staterr=0x%x data_len=%u",
> >> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
> >> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
> >> +
> >> +		if (!bulk_alloc) {
> >> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
> >> +			if (nmb == NULL) {
> >> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
> >> +						  "port_id=%u queue_id=%u",
> >> +					   rxq->port_id, rxq->queue_id);
> >> +
> >> +				rte_eth_devices[rxq->port_id].data->
> >> +							rx_mbuf_alloc_failed++;
> >> +				break;
> >> +			}
> >> +		} else if (nb_hold > rxq->rx_free_thresh) {
> >> +			uint16_t next_rdt = rxq->rx_free_trigger;
> >> +
> >> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
> >> +				rte_wmb();
> >> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
> >> +						    next_rdt);
> >> +				nb_hold -= rxq->rx_free_thresh;
> >> +			} else {
> >> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
> >> +						  "port_id=%u queue_id=%u",
> >> +					   rxq->port_id, rxq->queue_id);
> >> +
> >> +				rte_eth_devices[rxq->port_id].data->
> >> +							rx_mbuf_alloc_failed++;
> >> +				break;
> >> +			}
> >> +		}
> >> +
> >> +		nb_hold++;
> >> +		rxe = &sw_ring[rx_id];
> >> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
> >> +
> >> +		next_id = rx_id + 1;
> >> +		if (next_id == rxq->nb_rx_desc)
> >> +			next_id = 0;
> >> +
> >> +		/* Prefetch next mbuf while processing current one. */
> >> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
> >> +
> >> +		/*
> >> +		 * When next RX descriptor is on a cache-line boundary,
> >> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
> >> +		 * to mbufs.
> >> +		 */
> >> +		if ((next_id & 0x3) == 0) {
> >> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
> >> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
> >> +		}
> >> +
> >> +		rxm = rxe->mbuf;
> >> +
> >> +		if (!bulk_alloc) {
> >> +			__le64 dma =
> >> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
> >> +			/*
> >> +			 * Update RX descriptor with the physical address of the
> >> +			 * new data buffer of the new allocated mbuf.
> >> +			 */
> >> +			rxe->mbuf = nmb;
> >> +
> >> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
> >> +			rxdp->read.hdr_addr = dma;
> >> +			rxdp->read.pkt_addr = dma;
> >> +		}
> >> +		/*
> >> +		 * Set data length & data buffer address of mbuf.
> >> +		 */
> >> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
> >> +		rxm->data_len = data_len;
> >> +
> >> +		if (!eop) {
> >> +			uint16_t nextp_id;
> >> +			/*
> >> +			 * Get next descriptor index:
> >> +			 *  - For RSC it's in the NEXTP field.
> >> +			 *  - For a scattered packet - it's just a following
> >> +			 *    descriptor.
> >> +			 */
> >> +			if (ixgbe_rsc_count(&rxd))
> >> +				nextp_id =
> >> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
> >> +						       IXGBE_RXDADV_NEXTP_SHIFT;
> >> +			else
> >> +				nextp_id = next_id;
> >> +
> >> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
> >> +			next_rxe = &sw_ring[nextp_id];
> >> +			rte_ixgbe_prefetch(next_rxe);
> >> +		}
> >> +
> >> +		rsc_entry = &sw_rsc_ring[rx_id];
> >> +		first_seg = rsc_entry->fbuf;
> >> +		rsc_entry->fbuf = NULL;
> >> +
> >> +		/*
> >> +		 * If this is the first buffer of the received packet,
> >> +		 * set the pointer to the first mbuf of the packet and
> >> +		 * initialize its context.
> >> +		 * Otherwise, update the total length and the number of segments
> >> +		 * of the current scattered packet, and update the pointer to
> >> +		 * the last mbuf of the current packet.
> >> +		 */
> >> +		if (first_seg == NULL) {
> >> +			first_seg = rxm;
> >> +			first_seg->pkt_len = data_len;
> >> +			first_seg->nb_segs = 1;
> >> +		} else {
> >> +			first_seg->pkt_len += data_len;
> >> +			first_seg->nb_segs++;
> >> +		}
> >> +
> >> +		prev_id = rx_id;
> >> +		rx_id = next_id;
> >> +
> >> +		/*
> >> +		 * If this is not the last buffer of the received packet, update
> >> +		 * the pointer to the first mbuf at the NEXTP entry in the
> >> +		 * sw_rsc_ring and continue to parse the RX ring.
> >> +		 */
> >> +		if (!eop) {
> >> +			rxm->next = next_rxe->mbuf;
> >> +			next_rsc_entry->fbuf = first_seg;
> >> +			goto next_desc;
> > So _recv_pkts_lro() can return with one of rxq->rsc_entry[i] != NULL, correct?
> > If so, then I think you need at ixgbe_rx_queue_release_mbufs() to add the code, that would go through
> > all rsc_entry[] to find one whose fbuf  is != NULL, call rte_pktmbuf_free() for it and reset to NULL.
> >   To handle the case:
> > recv_pkts_lro(rxq, ...);
> > rte_eth_dev_stop();
> > rte_eth_dev_start();
> > recv_pkts_lro(rxq, ...);
> 
> Right. I've missed that part.
> 
> > BTW, that also means that you can't do:
> > rxm->next = next_rxe->mbuf;
> > above, and
> > rxm->next = NULL;
> > should be done before 'goto next_desc;' too
> 
> Your proposal will cost cycles in the fast path on account of saving
> cycles in the slow path: we'll have to add another pointer to the
> igb_rsc_entry to hold the last mbuf in the current cluster that we'll
> have to read and update for every new completed RSC descriptor.
> 
> The easier way would be to just reset the next-pointer of the last
> descriptor in the RSC cluster to NULL (according to nb_segs) before
> calling for rte_pktmbuf_free() in ixgbe_rx_queue_release_mbufs().

Should work too, I think. 

> 
> >
> >> +		}
> >> +
> >> +		/*
> >> +		 * This is the last buffer of the received packet - return
> >> +		 * the current cluster to the user.
> >> +		 */
> >> +		rxm->next = NULL;
> >> +
> >> +		/* Initialize the first mbuf of the returned packet */
> >> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
> >> +					    staterr);
> >> +
> >> +		/* Prefetch data of first segment, if configured to do so. */
> >> +		rte_packet_prefetch((char *)first_seg->buf_addr +
> >> +			first_seg->data_off);
> >> +
> >> +		/*
> >> +		 * Store the mbuf address into the next entry of the array
> >> +		 * of returned packets.
> >> +		 */
> >> +		rx_pkts[nb_rx++] = first_seg;
> >> +	}
> >> +
> >> +	/*
> >> +	 * Record index of the next RX descriptor to probe.
> >> +	 */
> >> +	rxq->rx_tail = rx_id;
> >> +
> >> +	/*
> >> +	 * If the number of free RX descriptors is greater than the RX free
> >> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
> >> +	 * register.
> >> +	 * Update the RDT with the value of the last processed RX descriptor
> >> +	 * minus 1, to guarantee that the RDT register is never equal to the
> >> +	 * RDH register, which creates a "full" ring situtation from the
> >> +	 * hardware point of view...
> >> +	 */
> >> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
> >> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
> >> +			   "nb_hold=%u nb_rx=%u",
> >> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
> >> +
> > I suppose if you do wmb() after rte_rxmbuf_alloc(), you'd better do it here too.
> 
> Right! Missed that when copied this code from
> ixgbe_recv_scattered_pkts()... ;) Note that the barrier is missing there
> too...
> These are the examples of the code that works on x86 only because of
> that "volatile" thing and will break once it's removed. On PPC it is
> broken even with "volatile".

Yep, as I said above -for IA we don't need mb() here - using 'volatile' or compiler barrier seems enough to me.
For PPC - I think we do.

> 
> >
> >> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
> >> +		nb_hold = 0;
> >> +	}
> >> +
> >> +	rxq->nb_rx_hold = nb_hold;
> >> +	return nb_rx;
> >> +}
> >> +
> >> +uint16_t
> >> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> >> +{
> >> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
> >> +}
> >> +
> >> +uint16_t
> >> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
> >> +			       uint16_t nb_pkts)
> >> +{
> >> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
> >> +}
> >> +
> >>   uint16_t
> >>   ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> >>   			  uint16_t nb_pkts)
> >> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
> >>   	if (rxq != NULL) {
> >>   		ixgbe_rx_queue_release_mbufs(rxq);
> >>   		rte_free(rxq->sw_ring);
> >> +		rte_free(rxq->sw_rsc_ring);
> >>   		rte_free(rxq);
> >>   	}
> >>   }
> >> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
> >>   	rxq->nb_rx_hold = 0;
> >>   	rxq->pkt_first_seg = NULL;
> >>   	rxq->pkt_last_seg = NULL;
> >> +	rxq->rsc_en = 0;
> >>   }
> >>
> >>   int
> >> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
> >>   	struct igb_rx_queue *rxq;
> >>   	struct ixgbe_hw     *hw;
> >>   	uint16_t len;
> >> +	struct rte_eth_dev_info dev_info = { 0 };
> >> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
> >> +	bool rsc_requested = false;
> >> +
> >> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> >> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
> >> +	    dev_rx_mode->enable_lro)
> >> +		rsc_requested = true;
> >>
> >>   	PMD_INIT_FUNC_TRACE();
> >>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
> >>   	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
> >>   					  sizeof(struct igb_rx_entry) * len,
> >>   					  RTE_CACHE_LINE_SIZE, socket_id);
> >> -	if (rxq->sw_ring == NULL) {
> >> +	if (!rxq->sw_ring) {
> > Wonder what was wrong with that one? :)
> 
> Nothing - just aligned it with the lines I've added below. ;)
> 
> >
> >>   		ixgbe_rx_queue_release(rxq);
> >>   		return (-ENOMEM);
> >>   	}
> >> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
> >> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
> >> +
> >> +	if (rsc_requested) {
> >> +		rxq->sw_rsc_ring =
> >> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
> >> +					   sizeof(struct igb_rsc_entry) * len,
> >> +					   RTE_CACHE_LINE_SIZE, socket_id);
> >> +		if (!rxq->sw_rsc_ring) {
> >> +			ixgbe_rx_queue_release(rxq);
> >> +			return (-ENOMEM);
> >> +		}
> >> +	} else {
> >> +		rxq->sw_rsc_ring = NULL;
> >> +	}
> >> +
> >> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
> >> +			    "dma_addr=0x%"PRIx64,
> >> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
> >> +		     rxq->rx_ring_phys_addr);
> >>
> >>   	if (!rte_is_power_of_2(nb_desc)) {
> >>   		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
> >> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
> >>   	return 0;
> >>   }
> >>
> >> +/**
> >> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
> >> + *
> >> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
> >> + * spec rev. 3.0 chapter 8.2.3.8.13.
> >> + *
> >> + * @pool Memory pool of the Rx queue
> >> + */
> >> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
> >> +{
> >> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
> >> +
> >> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
> >> +	uint16_t maxdesc =
> >> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
> > A  nit: use some macro (UINt16_MAX?) instead of hardcoded constant if possible.
> 
> Using UINT16_MAX here would be very confusing. The value here just like
> values below (16, 8, 4) are values that are explicitly stated in the
> RSCCTL[n].MAXDESC description in the spec and this code piece is
> implementing what spec is demanding. Therefore IMHO using the
> explicit values from the spec here is the most readable way considering
> the reader that will try to compare this code to the spec section
> mentioned above and check that the code is correct.

Ok, if you think UINT16_MAX is confusing, then just add a new one: IXGBE_RSC_MAX_PACKET_SIZE or something.
As I understand, that's sort of upper limit for the RSC packet size supported, right?

> 
> 
> >
> >> +
> >> +	if (maxdesc >= 16)
> >> +		return IXGBE_RSCCTL_MAXDESC_16;
> >> +	else if (maxdesc >= 8)
> >> +		return IXGBE_RSCCTL_MAXDESC_8;
> >> +	else if (maxdesc >= 4)
> >> +		return IXGBE_RSCCTL_MAXDESC_4;
> >> +	else
> >> +		return IXGBE_RSCCTL_MAXDESC_1;
> >> +}
> >> +
> >> +/* (Taken from FreeBSD tree)
> >> +** Setup the correct IVAR register for a particular MSIX interrupt
> >> +**   (yes this is all very magic and confusing :)
> >> +**  - entry is the register array entry
> >> +**  - vector is the MSIX vector for this queue
> >> +**  - type is RX/TX/MISC
> >> +*/
> >> +static void
> >> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
> >> +{
> >> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >> +	u32 ivar, index;
> >> +
> >> +	vector |= IXGBE_IVAR_ALLOC_VAL;
> >> +
> >> +	switch (hw->mac.type) {
> >> +
> >> +	case ixgbe_mac_82598EB:
> >> +		if (type == -1)
> >> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
> >> +		else
> >> +			entry += (type * 64);
> >> +		index = (entry >> 2) & 0x1F;
> >> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
> >> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
> >> +		ivar |= (vector << (8 * (entry & 0x3)));
> >> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
> >> +		break;
> >> +
> >> +	case ixgbe_mac_82599EB:
> >> +	case ixgbe_mac_X540:
> >> +		if (type == -1) { /* MISC IVAR */
> >> +			index = (entry & 1) * 8;
> >> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
> >> +			ivar &= ~(0xFF << index);
> >> +			ivar |= (vector << index);
> >> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
> >> +		} else {	/* RX/TX IVARS */
> >> +			index = (16 * (entry & 1)) + (8 * type);
> >> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
> >> +			ivar &= ~(0xFF << index);
> >> +			ivar |= (vector << index);
> >> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
> >> +		}
> >> +
> >> +		break;
> >> +
> >> +	default:
> >> +		break;
> >> +	}
> >> +}
> >> +
> >>   void set_rx_function(struct rte_eth_dev *dev)
> >>   {
> >>   	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
> >>   			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
> >>   		}
> >>   	}
> >> +
> >> +	/*
> >> +	 * Initialize the appropriate LRO callback.
> >> +	 *
> >> +	 * If all queues satisfy the bulk allocation preconditions
> >> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
> >> +	 * Otherwise use a single allocation version.
> >> +	 */
> >> +	if (dev->data->lro) {
> >> +		if (hw->rx_bulk_alloc_allowed) {
> >> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
> >> +					   "allocation version");
> >> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
> >> +		} else {
> >> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
> >> +					   "allocation version");
> >> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
> >> +		}
> >> +	}
> >>   }
> > As I understand, ixgbe_recv_pkts_lro() can handle both LRO and normal scattered packets?
> 
> Not as it is now. It may be easily patched to do so though.
> 
> > If that so, then can we remove ixgbe_recv_scattered_pkts() at all and use ixgbe_recv_scattered_pkts() for both cases?
> 
> This was explicitly requested from me by Bruce Richardson (see
> "[dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx
> flow?" thread) to separate the complicated handling from the simple high
> performance one. The handling in the RSC routine is more generic and
> thus is a bit of overkill for the simple scattered case: e.g. there is
> no need to a sw_rsc_ring.

I think Bruce meant ixgbe_recv_pkts_bulk_alloc() not ixgbe_recv_scattered_pkts()
when he told about simple and high performance RX path.

> Therefore I preferred to advance with small steps here. And if there
> will be a decision to join these flows - it may be done with a rather
> small patch in the future.

Ok, that's understandable and I wouldn't insist to do that in the same patch.
It just worries me that number of our ixgbe RX functions keeps increasing.  

> 
> >>   /*
> >> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>   	uint32_t maxfrs;
> >>   	uint32_t srrctl;
> >>   	uint32_t rdrxctl;
> >> +	uint32_t rscctl;
> >> +	uint32_t psrtype;
> >> +	uint32_t rfctl;
> >>   	uint32_t rxcsum;
> >>   	uint16_t buf_size;
> >>   	uint16_t i;
> >>   	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
> >> +	struct rte_eth_dev_info dev_info = { 0 };
> >> +	bool rsc_capable = false;
> >> +
> >> +	/* Sanity check */
> >> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> >> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
> >> +		rsc_capable = true;
> > @ 7.11.1 82599 spec says:
> > " Note that in SR-IOV mode the RSC must be disabled globally by setting the RFCTL.RSC_DIS bit."
> > Add a check?
> 
> Good catch! Will add a check. Thanks.
> 
> >
> >> +
> >> +	if (!rsc_capable && rx_conf->enable_lro) {
> >> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
> >> +				   "support it");
> >> +		return -EINVAL;
> >> +	}
> >>
> >>   	PMD_INIT_FUNC_TRACE();
> >>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>   	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
> >>
> >>   	/*
> >> +	 * RFCTL configuration
> >> +	 *
> >> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
> >> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
> >> +	 */
> >> +	if (rsc_capable) {
> >> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
> >> +		if (rx_conf->enable_lro) {
> >> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
> >> +				   IXGBE_RFCTL_NFSR_DIS);
> >> +		} else {
> >> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
> >> +		}
> >> +
> >> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
> >> +	}
> >> +
> >> +
> >> +	/*
> >>   	 * Configure CRC stripping, if any.
> >>   	 */
> >>   	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
> >>   	if (rx_conf->hw_strip_crc)
> >>   		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
> >> -	else
> >> +	else {
> >>   		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
> >> +		if (rx_conf->enable_lro) {
> >> +			/*
> >> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
> >> +			 * 3.0 RSC configuration requires HW CRC stripping being
> >> +			 * enabled. If user requested both HW CRC stripping off
> >> +			 * and RSC on - return an error.
> >> +			 */
> >> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
> >> +					    "is disabled");
> >> +			return -EINVAL;
> >> +		}
> >> +	}
> >>
> >>   	/*
> >>   	 * Configure jumbo frame support, if any.
> >> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>   		 * Configure Header Split
> >>   		 */
> >>   		if (rx_conf->header_split) {
> >> +			/*
> >> +			 * Print a warning if split_hdr_size is less
> >> +			 * than 128 bytes when RSC is requested.
> >> +			 */
> >> +			if (rx_conf->enable_lro &&
> >> +			    rx_conf->split_hdr_size < 128)
> >> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
> >> +						   "128 bytes (%d)!",
> >> +					     rx_conf->split_hdr_size);
> >> +
> >>   			if (hw->mac.type == ixgbe_mac_82599EB) {
> >>   				/* Must setup the PSRTYPE register */
> >> -				uint32_t psrtype;
> >>   				psrtype = IXGBE_PSRTYPE_TCPHDR |
> >>   					IXGBE_PSRTYPE_UDPHDR   |
> >>   					IXGBE_PSRTYPE_IPV4HDR  |
> >> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>   			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
> >>   		} else
> >>   #endif
> >> +		{
> >>   			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
> >> +			/*
> >> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
> >> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
> >> +			 * should be configured even if header split is not
> >> +			 * enabled. In the later case we will configure it 128
> >> +			 * bytes following the recommendation in the spec.
> >> +			 */
> >> +			if (rx_conf->enable_lro)
> >> +				srrctl |=
> >> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
> >> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
> >> +		}
> >>
> >>   		/* Set if packets are dropped when no descriptors available */
> >>   		if (rxq->drop_en)
> >> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>   				       RTE_PKTMBUF_HEADROOM);
> >>   		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
> >>   			   IXGBE_SRRCTL_BSIZEPKT_MASK);
> >> +
> >> +		/*
> >> +		 * TODO: Consider setting the Receive Descriptor Minimum
> >> +		 * Threshold Size for and RSC case. This is not an obviously
> >> +		 * beneficiary option but the one worth considering...
> >> +		 */
> >> +
> >>   		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
> >>
> >>   		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
> >> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>   		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
> >>   					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
> >>   			dev->data->scattered_rx = 1;
> >> +
> >> +		/* RSC per-queue configuration */
> >> +		if (rx_conf->enable_lro) {
> >> +			uint32_t eitr;
> >> +
> >> +			rscctl =
> >> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
> >> +			psrtype =
> >> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
> >> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
> >> +
> >> +			rscctl |= IXGBE_RSCCTL_RSCEN;
> >> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
> >> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
> >> +
> >> +			/*
> >> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
> >> +			 *
> >> +			 * Full-sized RSC aggregations for a 10Gb/s link will
> >> +			 * arrive at about 20K aggregation/s rate.
> >> +			 *
> >> +			 * 2K inst/s rate will make only 10% of the
> >> +			 * aggregations to be closed due to the interrupt timer
> >> +			 * expiration for a streaming at wire-speed case.
> >> +			 *
> >> +			 * For a sparse streaming case this setting will yield
> >> +			 * at most 500us latency for a single RSC aggregation.
> >> +			 */
> >> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
> > Again probably create some macro for ITR Interval default value here.
> 
> Well, again - it's the only place where it's used and I've extensively
> explained it in the comments in the code. Therefore I think it's the
> most readable way to write this.
> If it would be used in at least two places - then I would have put it in
> a macro...

I think it is a good practise to use macros instead of raw numbers in such places.
You probably can make these macros self-explanatory:
/* EITR Inteval in 2us uinits for 1G and 10G. */
#define IXGBE_EITR_INTERVAL_US	2 

#define IXGBE_EITR_INTERVAL_SHIFT	3

#define IXGBE_EITR_INTERVAL(us)	((us) / IXGBE_EITR_INTERVAL_US << IXGBE_EITR_INTERVAL_SHIFT)

/* at most 500us latency for a single RSC aggregation */
#define IXGHE_EITR_INTERVAL_DEFAULT  IXGBE_EITR_INTERVAL(500)

> 
> >
> >> +
> >> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
> >> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
> >> +								       psrtype);
> >> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
> >> +
> >> +			/*
> >> +			 * RSC requires the mapping of the queue to the
> >> +			 * interrupt vector.
> >> +			 */
> >> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
> > Hm, wonder why do we need to setup IVAR for RSC?
> > Wouldn't just setting EITR be enough?
> 
> Nope. See 82599 spec chapter 4.6.7.2.2. 

I read it, though it doesn't say 'IVAR must be setup' like it does for EITR.Inerval.
That made me thought that it might be optional.

> I think I even tried not to map
> the queues to IVAR and it didn't work... ;)

Pity, but not much we can in that case, I suppose. 

> 
> >
> >> +
> >> +			rxq->rsc_en = 1;
> >> +		}
> >>   	}
> >>
> >>   	if (rx_conf->enable_scatter)
> >>   		dev->data->scattered_rx = 1;
> >>
> >> +	if (rx_conf->enable_lro)
> >> +		dev->data->lro = 1;
> >> +
> >>   	set_rx_function(dev);
> >>
> >>   	/*
> >> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>   		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
> >>   	}
> >>
> >> +	/* Finalize RSC configuration  */
> >> +	if (rx_conf->enable_lro) {
> >> +		/*
> >> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
> >> +		 */
> >> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
> >> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
> >> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
> >> +
> >> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
> >> +	}
> >> +
> >> +
> >>   	return 0;
> >>   }
> >>
> >> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> >> index bbe5ff3..389173f 100644
> >> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> >> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> >> @@ -79,6 +79,10 @@ struct igb_rx_entry {
> >>   	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
> >>   };
> >>
> >> +struct igb_rsc_entry {
> >> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
> >> +};
> >> +
> >>   /**
> >>    * Structure associated with each descriptor of the TX ring of a TX queue.
> >>    */
> >> @@ -105,6 +109,7 @@ struct igb_rx_queue {
> >>   	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
> >>   	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
> >>   	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
> >> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
> >>   	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
> >>   	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
> >>   	uint64_t            mbuf_initializer; /**< value to init mbufs */
> >> @@ -126,6 +131,7 @@ struct igb_rx_queue {
> >>   	uint8_t             port_id;  /**< Device port identifier. */
> >>   	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
> >>   	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
> >> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
> >>   	uint8_t             rx_deferred_start; /**< not in global dev start. */
> >>   #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
> >>   	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */
> >> --
> >> 2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-10 20:09       ` Ananyev, Konstantin
@ 2015-03-10 21:36         ` Vlad Zolotarov
  2015-03-11 16:32           ` Ananyev, Konstantin
  0 siblings, 1 reply; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-10 21:36 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev



On 03/10/15 22:09, Ananyev, Konstantin wrote:
>>> Hi Vlad,
>>>
>>>> -----Original Message-----
>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vlad Zolotarov
>>>> Sent: Monday, March 09, 2015 7:07 PM
>>>> To: dev at dpdk.org
>>>> Subject: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
>>>>
>>>>       - Only x540 and 82599 devices support LRO.
>>>>       - Add the appropriate HW configuration.
>>>>       - Add RSC aware rx_pkt_burst() handlers:
>>>>          - Implemented bulk allocation and non-bulk allocation versions.
>>>>          - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
>>>>            and to igb_rx_queue.
>>>>          - Use the appropriate handler when LRO is requested.
>>>>
>>>> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com>
>>>> ---
>>>> New in v5:
>>>>      - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
>>>>      - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
>>>>
>>>> New in v4:
>>>>      - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
>>>>        RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
>>>>
>>>> New in v2:
>>>>      - Removed rte_eth_dev_data.lro_bulk_alloc.
>>>>      - Fixed a few styling and spelling issues.
>>>> ---
>>>>    lib/librte_ether/rte_ethdev.h       |   9 +-
>>>>    lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
>>>>    lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
>>>>    lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
>>>>    lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
>>>>    5 files changed, 581 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
>>>> index 8db3127..44f081f 100644
>>>> --- a/lib/librte_ether/rte_ethdev.h
>>>> +++ b/lib/librte_ether/rte_ethdev.h
>>>> @@ -172,6 +172,9 @@ extern "C" {
>>>>
>>>>    #include <stdint.h>
>>>>
>>>> +/* Use this macro to check if LRO API is supported */
>>>> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
>>>> +
>>>>    #include <rte_log.h>
>>>>    #include <rte_interrupts.h>
>>>>    #include <rte_pci.h>
>>>> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
>>>>    	enum rte_eth_rx_mq_mode mq_mode;
>>>>    	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
>>>>    	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
>>>> -	uint8_t header_split : 1, /**< Header Split enable. */
>>>> +	uint16_t header_split : 1, /**< Header Split enable. */
>>>>    		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
>>>>    		hw_vlan_filter   : 1, /**< VLAN filter enable. */
>>>>    		hw_vlan_strip    : 1, /**< VLAN strip enable. */
>>>>    		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
>>>>    		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
>>>>    		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
>>>> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
>>>> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
>>>> +		enable_lro       : 1; /**< Enable LRO */
>>>>    };
>>>>
>>>>    /**
>>>> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
>>>>    	uint8_t port_id;           /**< Device [external] port identifier. */
>>>>    	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
>>>>    		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
>>>> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
>>>>    		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
>>>>    		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
>>>>    };
>>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>>> index 9d3de1a..765174d 100644
>>>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>>> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
>>>>
>>>>    	/* Clear stored conf */
>>>>    	dev->data->scattered_rx = 0;
>>>> +	dev->data->lro = 0;
>>>>    	hw->rx_bulk_alloc_allowed = false;
>>>>    	hw->rx_vec_allowed = false;
>>>>
>>>> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>>>>    		DEV_RX_OFFLOAD_IPV4_CKSUM |
>>>>    		DEV_RX_OFFLOAD_UDP_CKSUM  |
>>>>    		DEV_RX_OFFLOAD_TCP_CKSUM;
>>>> +
>>>> +	if (hw->mac.type == ixgbe_mac_82599EB ||
>>>> +	    hw->mac.type == ixgbe_mac_X540)
>>>> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
>>>> +
>>>>    	dev_info->tx_offload_capa =
>>>>    		DEV_TX_OFFLOAD_VLAN_INSERT |
>>>>    		DEV_TX_OFFLOAD_IPV4_CKSUM  |
>>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>>> index a549f5c..e206584 100644
>>>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>>> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>>    uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
>>>>    		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>>>
>>>> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
>>>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>>> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
>>>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>>> +
>>>>    uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>>>>    		uint16_t nb_pkts);
>>>>
>>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>>> index 58e619b..944c662 100644
>>>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>>> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>>    }
>>>>
>>>>    /**
>>>> + * Detect an RSC descriptor.
>>>> + */
>>>> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
>>>> +{
>>>> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
>>>> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
>>>> +}
>>>> +
>>>> +/**
>>>>     * Initialize the first mbuf of the returned packet:
>>>>     *    - RX port identifier,
>>>>     *    - hardware offload data, if any:
>>>> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
>>>>    	}
>>>>    }
>>>>
>>>> +/**
>>>> + * Bulk receive handler for and LRO case.
>>>> + *
>>>> + * @rx_queue Rx queue handle
>>>> + * @rx_pkts table of received packets
>>>> + * @nb_pkts size of rx_pkts table
>>>> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
>>>> + *
>>>> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
>>>> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
>>>> + *
>>>> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
>>>> + * 1) When non-EOP RSC completion arrives:
>>>> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
>>>> + *       segment's data length.
>>>> + *    b) Set the "next" pointer of the current segment to point to the segment
>>>> + *       at the NEXTP index.
>>>> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
>>>> + *       in the sw_rsc_ring.
>>>> + * 2) When EOP arrives we just update the cluster's total length and offload
>>>> + *    flags and deliver the cluster up to the upper layers. In our case - put it
>>>> + *    in the rx_pkts table.
>>>> + *
>>>> + * Returns the number of received packets/clusters (according to the "bulk
>>>> + * receive" interface).
>>>> + */
>>>> +static inline uint16_t
>>>> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
>>>> +	       bool bulk_alloc)
>>>> +{
>>>> +	struct igb_rx_queue *rxq = rx_queue;
>>>> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
>>>> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
>>>> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
>>>> +	uint16_t rx_id = rxq->rx_tail;
>>>> +	uint16_t nb_rx = 0;
>>>> +	uint16_t nb_hold = rxq->nb_rx_hold;
>>>> +	uint16_t prev_id = rxq->rx_tail;
>>>> +
>>>> +	while (nb_rx < nb_pkts) {
>>>> +		bool eop;
>>>> +		struct igb_rx_entry *rxe;
>>>> +		struct igb_rsc_entry *rsc_entry;
>>>> +		struct igb_rsc_entry *next_rsc_entry;
>>>> +		struct igb_rx_entry *next_rxe;
>>>> +		struct rte_mbuf *first_seg;
>>>> +		struct rte_mbuf *rxm;
>>>> +		struct rte_mbuf *nmb;
>>>> +		union ixgbe_adv_rx_desc rxd;
>>>> +		uint16_t data_len;
>>>> +		uint16_t next_id;
>>>> +		volatile union ixgbe_adv_rx_desc *rxdp;
>>>> +		uint32_t staterr;
>>>> +
>>>> +next_desc:
>>>> +		/*
>>>> +		 * The code in this whole file uses the volatile pointer to
>>>> +		 * ensure the read ordering of the status and the rest of the
>>>> +		 * descriptor fields (on the compiler level only!!!). This is so
>>>> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
>>>> +		 * even has the rte_compiler_barrier() for that.
>>>> +		 *
>>>> +		 * But most importantly this is just wrong because this doesn't
>>>> +		 * ensure memory ordering in a general case at all. For
>>>> +		 * instance, DPDK is supposed to work on Power CPUs where
>>>> +		 * compiler barrier may just not be enough!
>>>> +		 *
>>>> +		 * I tried to write only this function properly to have a
>>>> +		 * starting point (as a part of an LRO/RSC series) but the
>>>> +		 * compiler cursed at me when I tried to cast away the
>>>> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
>>>> +		 * keeping it the way it is for now.
>>>> +		 *
>>>> +		 * The code in this file is broken in so many other places and
>>>> +		 * will just not work on a big endian CPU anyway therefore the
>>>> +		 * lines below will have to be revisited together with the rest
>>>> +		 * of the ixgbe PMD.
>>>> +		 *
>>>> +		 * TODO:
>>>> +		 *    - Get rid of "volatile" crap and let the compiler do its
>>>> +		 *      job.
>>>> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
>>>> +		 *      memory ordering below.
>>> Ok, so you wanted to put rte_rmb(), straight after:
>>> staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>> correct?
>>> I agree that for machines with relaxed memory model (PPC) we do need it here.
>>> So why not just put it there, instead of complaining about in in comments? ;)
>> Because it's not a proper fix and I don't like workarounds.
> Why not? For machines with relaxed memory model you would need  rmb() here no matter does rxdp points to volatile or not.
>
>>> About rxdp being pointer to volatile, why it bothers you that much?
>> Because using of "volatile" prevent the compiler from optimizing every
>> code piece where the "volatile" variable is participating and that's a
>> shame.
>> Read this
>> https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
>> for a more detailed explanation.
>>
>>> You copy the whole RXD to the local variable anyway, and then reference it only to setup new addresses.
>> The fact that we have to copy the whole descriptor while we may not need
>> all the data from it at the end is one problem.
> I understand that, but I don't think that the difference would that critical.
> Though I don't have any data in hand to compare.
>
>> The proper solution in Rx ring context should go as follows:
>>
>>   1. Remove the "volatile" qualifier from rx_ring (HW Rx descriptors ring).
>>   2. Remove "volatile" at all places where rx_ring is accessed.
>>   3. Adjust the code in (2):
>>       1. Remove the descriptor copy u've mentioned and access the
>>          descriptor data directly.
>>       2. Ensure the proper ordering by using the proper memory barriers,
>>          which are missing in the DPDK SDK at the moment (see a small
>>          discussion about this with Stephen and Avi on "[dpdk-dev]
>>          [PATCH v1 5/5] ixgbe: Add LRO support" thread).
> I think you are mixing 2 different issues here:
>
> 1.  For architectures with relaxed memory model we do need rmb() after that line:
> staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> We do need it *always*, not depending on is rx_ring a volatile or not.
> If we really plan to support PPC and other architectures that allow read reordering  -
> not having an 'rmb()' or similar sync primitive here is a bug.
> Same thing applies to 'wmb()' before updating RDT.
>
> 2. volatile rx_ring vs non-volatile with explicit memory ordering instrincts.
> Actually I think that using volatile rx_ring is not a real bug on itself.
> Code with volatile rx_ring and fix for #1 in place would work correctly on all architectures.
> It might be slower than non-volatile approach, but nothing would be broken.
>
> About the existing RX/TX functions and PPC support:
> Note that all of them were created before PPC support for DPDK was introduced.
> At that moment only IA was supported.
> That's why in some places where you would expect to see 'mb()' there are 'volatile' and/or ' rte_compiler_barrier' instead.
> Why all that places wasn't updated when PPC support was added - that's another question.
>  From my understanding - with current implementation some of DPDK PMDs RX/TX functions and  rte_ring wouldn't work correctly on PPC.
> So, I suppose we need to decide for ourselves - do we really want to support PPC and other architectures with non-IA memory model or not?
> If not, then I think we don't need any mb()s inside recv_pkts_lro() - just rte_compiler_barrier seems enough, and no point to complain about
> it in comments.
> If yes - then why to introduce a new function with a known potential bug?

In order to introduce a new function with the proper implementation or 
to fix any other places with the similar weakness I would need a proper 
tools like a proper platform-dependent barrier-macros similar to 
smp_Xmb() Linux macros that reduce to a compiler barrier where 
appropriate or to a proper memory fence where needed.

Unfortunately DPDK doesn't have such at the moment. That's why I put a 
big fat comment at the place that has to be fixed once they are introduced.

U are right though about "volatile" thing not being a bug but it would 
be strange to keep it after barriers are properly placed. That's why I 
think these 2 changes should go together.

About the "decision" we have to make - I think it has been decided 
already since PPC is one the official DPDK targets. Therefore the only 
thing to decide here is when and who gets to fix these things. One thing 
is obvious - this patch is not the right place to do it. ;)

>
>> As it sounds this is going to be a VERY sensitive patchset.
>> That's why it should go separately from this patchwork (or from any
>> other patchwork).
> For that patch, I am not suggesting you to change any other functions, just one that you introducing.

I don't think that putting an lfence on x86 there is a good idea. As 
I've just explained above - once DPDK has proper platform-dependent 
rmb() macros I'll gladly revisit these lines. Frankly, the same could be 
told about the rte_wmb() before the RDT update but it is much less 
harmful than lfence so I didn't raise it... ;)

>
>>>> +		 */
>>>> +		rxdp = &rx_ring[rx_id];
>>>> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>> +
>>>> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>> +			break;
>>>> +
>>>> +		rxd = *rxdp;
>>>> +
>>>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>>>> +				  "staterr=0x%x data_len=%u",
>>>> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
>>>> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
>>>> +
>>>> +		if (!bulk_alloc) {
>>>> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
>>>> +			if (nmb == NULL) {
>>>> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
>>>> +						  "port_id=%u queue_id=%u",
>>>> +					   rxq->port_id, rxq->queue_id);
>>>> +
>>>> +				rte_eth_devices[rxq->port_id].data->
>>>> +							rx_mbuf_alloc_failed++;
>>>> +				break;
>>>> +			}
>>>> +		} else if (nb_hold > rxq->rx_free_thresh) {
>>>> +			uint16_t next_rdt = rxq->rx_free_trigger;
>>>> +
>>>> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
>>>> +				rte_wmb();
>>>> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
>>>> +						    next_rdt);
>>>> +				nb_hold -= rxq->rx_free_thresh;
>>>> +			} else {
>>>> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
>>>> +						  "port_id=%u queue_id=%u",
>>>> +					   rxq->port_id, rxq->queue_id);
>>>> +
>>>> +				rte_eth_devices[rxq->port_id].data->
>>>> +							rx_mbuf_alloc_failed++;
>>>> +				break;
>>>> +			}
>>>> +		}
>>>> +
>>>> +		nb_hold++;
>>>> +		rxe = &sw_ring[rx_id];
>>>> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
>>>> +
>>>> +		next_id = rx_id + 1;
>>>> +		if (next_id == rxq->nb_rx_desc)
>>>> +			next_id = 0;
>>>> +
>>>> +		/* Prefetch next mbuf while processing current one. */
>>>> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
>>>> +
>>>> +		/*
>>>> +		 * When next RX descriptor is on a cache-line boundary,
>>>> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
>>>> +		 * to mbufs.
>>>> +		 */
>>>> +		if ((next_id & 0x3) == 0) {
>>>> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
>>>> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
>>>> +		}
>>>> +
>>>> +		rxm = rxe->mbuf;
>>>> +
>>>> +		if (!bulk_alloc) {
>>>> +			__le64 dma =
>>>> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
>>>> +			/*
>>>> +			 * Update RX descriptor with the physical address of the
>>>> +			 * new data buffer of the new allocated mbuf.
>>>> +			 */
>>>> +			rxe->mbuf = nmb;
>>>> +
>>>> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
>>>> +			rxdp->read.hdr_addr = dma;
>>>> +			rxdp->read.pkt_addr = dma;
>>>> +		}
>>>> +		/*
>>>> +		 * Set data length & data buffer address of mbuf.
>>>> +		 */
>>>> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
>>>> +		rxm->data_len = data_len;
>>>> +
>>>> +		if (!eop) {
>>>> +			uint16_t nextp_id;
>>>> +			/*
>>>> +			 * Get next descriptor index:
>>>> +			 *  - For RSC it's in the NEXTP field.
>>>> +			 *  - For a scattered packet - it's just a following
>>>> +			 *    descriptor.
>>>> +			 */
>>>> +			if (ixgbe_rsc_count(&rxd))
>>>> +				nextp_id =
>>>> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
>>>> +						       IXGBE_RXDADV_NEXTP_SHIFT;
>>>> +			else
>>>> +				nextp_id = next_id;
>>>> +
>>>> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
>>>> +			next_rxe = &sw_ring[nextp_id];
>>>> +			rte_ixgbe_prefetch(next_rxe);
>>>> +		}
>>>> +
>>>> +		rsc_entry = &sw_rsc_ring[rx_id];
>>>> +		first_seg = rsc_entry->fbuf;
>>>> +		rsc_entry->fbuf = NULL;
>>>> +
>>>> +		/*
>>>> +		 * If this is the first buffer of the received packet,
>>>> +		 * set the pointer to the first mbuf of the packet and
>>>> +		 * initialize its context.
>>>> +		 * Otherwise, update the total length and the number of segments
>>>> +		 * of the current scattered packet, and update the pointer to
>>>> +		 * the last mbuf of the current packet.
>>>> +		 */
>>>> +		if (first_seg == NULL) {
>>>> +			first_seg = rxm;
>>>> +			first_seg->pkt_len = data_len;
>>>> +			first_seg->nb_segs = 1;
>>>> +		} else {
>>>> +			first_seg->pkt_len += data_len;
>>>> +			first_seg->nb_segs++;
>>>> +		}
>>>> +
>>>> +		prev_id = rx_id;
>>>> +		rx_id = next_id;
>>>> +
>>>> +		/*
>>>> +		 * If this is not the last buffer of the received packet, update
>>>> +		 * the pointer to the first mbuf at the NEXTP entry in the
>>>> +		 * sw_rsc_ring and continue to parse the RX ring.
>>>> +		 */
>>>> +		if (!eop) {
>>>> +			rxm->next = next_rxe->mbuf;
>>>> +			next_rsc_entry->fbuf = first_seg;
>>>> +			goto next_desc;
>>> So _recv_pkts_lro() can return with one of rxq->rsc_entry[i] != NULL, correct?
>>> If so, then I think you need at ixgbe_rx_queue_release_mbufs() to add the code, that would go through
>>> all rsc_entry[] to find one whose fbuf  is != NULL, call rte_pktmbuf_free() for it and reset to NULL.
>>>    To handle the case:
>>> recv_pkts_lro(rxq, ...);
>>> rte_eth_dev_stop();
>>> rte_eth_dev_start();
>>> recv_pkts_lro(rxq, ...);
>> Right. I've missed that part.
>>
>>> BTW, that also means that you can't do:
>>> rxm->next = next_rxe->mbuf;
>>> above, and
>>> rxm->next = NULL;
>>> should be done before 'goto next_desc;' too
>> Your proposal will cost cycles in the fast path on account of saving
>> cycles in the slow path: we'll have to add another pointer to the
>> igb_rsc_entry to hold the last mbuf in the current cluster that we'll
>> have to read and update for every new completed RSC descriptor.
>>
>> The easier way would be to just reset the next-pointer of the last
>> descriptor in the RSC cluster to NULL (according to nb_segs) before
>> calling for rte_pktmbuf_free() in ixgbe_rx_queue_release_mbufs().
> Should work too, I think.

The final solution is even nicer - see v7. And it works like a charm 
too... ;)

>
>>>> +		}
>>>> +
>>>> +		/*
>>>> +		 * This is the last buffer of the received packet - return
>>>> +		 * the current cluster to the user.
>>>> +		 */
>>>> +		rxm->next = NULL;
>>>> +
>>>> +		/* Initialize the first mbuf of the returned packet */
>>>> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
>>>> +					    staterr);
>>>> +
>>>> +		/* Prefetch data of first segment, if configured to do so. */
>>>> +		rte_packet_prefetch((char *)first_seg->buf_addr +
>>>> +			first_seg->data_off);
>>>> +
>>>> +		/*
>>>> +		 * Store the mbuf address into the next entry of the array
>>>> +		 * of returned packets.
>>>> +		 */
>>>> +		rx_pkts[nb_rx++] = first_seg;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Record index of the next RX descriptor to probe.
>>>> +	 */
>>>> +	rxq->rx_tail = rx_id;
>>>> +
>>>> +	/*
>>>> +	 * If the number of free RX descriptors is greater than the RX free
>>>> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
>>>> +	 * register.
>>>> +	 * Update the RDT with the value of the last processed RX descriptor
>>>> +	 * minus 1, to guarantee that the RDT register is never equal to the
>>>> +	 * RDH register, which creates a "full" ring situtation from the
>>>> +	 * hardware point of view...
>>>> +	 */
>>>> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
>>>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
>>>> +			   "nb_hold=%u nb_rx=%u",
>>>> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
>>>> +
>>> I suppose if you do wmb() after rte_rxmbuf_alloc(), you'd better do it here too.
>> Right! Missed that when copied this code from
>> ixgbe_recv_scattered_pkts()... ;) Note that the barrier is missing there
>> too...
>> These are the examples of the code that works on x86 only because of
>> that "volatile" thing and will break once it's removed. On PPC it is
>> broken even with "volatile".
> Yep, as I said above -for IA we don't need mb() here - using 'volatile' or compiler barrier seems enough to me.
> For PPC - I think we do.
>
>>>> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
>>>> +		nb_hold = 0;
>>>> +	}
>>>> +
>>>> +	rxq->nb_rx_hold = nb_hold;
>>>> +	return nb_rx;
>>>> +}
>>>> +
>>>> +uint16_t
>>>> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>>>> +{
>>>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
>>>> +}
>>>> +
>>>> +uint16_t
>>>> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>> +			       uint16_t nb_pkts)
>>>> +{
>>>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
>>>> +}
>>>> +
>>>>    uint16_t
>>>>    ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>>    			  uint16_t nb_pkts)
>>>> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
>>>>    	if (rxq != NULL) {
>>>>    		ixgbe_rx_queue_release_mbufs(rxq);
>>>>    		rte_free(rxq->sw_ring);
>>>> +		rte_free(rxq->sw_rsc_ring);
>>>>    		rte_free(rxq);
>>>>    	}
>>>>    }
>>>> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
>>>>    	rxq->nb_rx_hold = 0;
>>>>    	rxq->pkt_first_seg = NULL;
>>>>    	rxq->pkt_last_seg = NULL;
>>>> +	rxq->rsc_en = 0;
>>>>    }
>>>>
>>>>    int
>>>> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>>>    	struct igb_rx_queue *rxq;
>>>>    	struct ixgbe_hw     *hw;
>>>>    	uint16_t len;
>>>> +	struct rte_eth_dev_info dev_info = { 0 };
>>>> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
>>>> +	bool rsc_requested = false;
>>>> +
>>>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>>>> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
>>>> +	    dev_rx_mode->enable_lro)
>>>> +		rsc_requested = true;
>>>>
>>>>    	PMD_INIT_FUNC_TRACE();
>>>>    	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>>>    	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
>>>>    					  sizeof(struct igb_rx_entry) * len,
>>>>    					  RTE_CACHE_LINE_SIZE, socket_id);
>>>> -	if (rxq->sw_ring == NULL) {
>>>> +	if (!rxq->sw_ring) {
>>> Wonder what was wrong with that one? :)
>> Nothing - just aligned it with the lines I've added below. ;)
>>
>>>>    		ixgbe_rx_queue_release(rxq);
>>>>    		return (-ENOMEM);
>>>>    	}
>>>> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
>>>> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
>>>> +
>>>> +	if (rsc_requested) {
>>>> +		rxq->sw_rsc_ring =
>>>> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
>>>> +					   sizeof(struct igb_rsc_entry) * len,
>>>> +					   RTE_CACHE_LINE_SIZE, socket_id);
>>>> +		if (!rxq->sw_rsc_ring) {
>>>> +			ixgbe_rx_queue_release(rxq);
>>>> +			return (-ENOMEM);
>>>> +		}
>>>> +	} else {
>>>> +		rxq->sw_rsc_ring = NULL;
>>>> +	}
>>>> +
>>>> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
>>>> +			    "dma_addr=0x%"PRIx64,
>>>> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
>>>> +		     rxq->rx_ring_phys_addr);
>>>>
>>>>    	if (!rte_is_power_of_2(nb_desc)) {
>>>>    		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
>>>> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
>>>>    	return 0;
>>>>    }
>>>>
>>>> +/**
>>>> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
>>>> + *
>>>> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
>>>> + * spec rev. 3.0 chapter 8.2.3.8.13.
>>>> + *
>>>> + * @pool Memory pool of the Rx queue
>>>> + */
>>>> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
>>>> +{
>>>> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
>>>> +
>>>> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
>>>> +	uint16_t maxdesc =
>>>> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
>>> A  nit: use some macro (UINt16_MAX?) instead of hardcoded constant if possible.
>> Using UINT16_MAX here would be very confusing. The value here just like
>> values below (16, 8, 4) are values that are explicitly stated in the
>> RSCCTL[n].MAXDESC description in the spec and this code piece is
>> implementing what spec is demanding. Therefore IMHO using the
>> explicit values from the spec here is the most readable way considering
>> the reader that will try to compare this code to the spec section
>> mentioned above and check that the code is correct.
> Ok, if you think UINT16_MAX is confusing, then just add a new one: IXGBE_RSC_MAX_PACKET_SIZE or something.
> As I understand, that's sort of upper limit for the RSC packet size supported, right?

Why to define a macro for a value that is not used anywhere else but 
here and that is never going to be changed? How does it make the code 
more readable or robust?

>
>>
>>>> +
>>>> +	if (maxdesc >= 16)
>>>> +		return IXGBE_RSCCTL_MAXDESC_16;
>>>> +	else if (maxdesc >= 8)
>>>> +		return IXGBE_RSCCTL_MAXDESC_8;
>>>> +	else if (maxdesc >= 4)
>>>> +		return IXGBE_RSCCTL_MAXDESC_4;
>>>> +	else
>>>> +		return IXGBE_RSCCTL_MAXDESC_1;
>>>> +}
>>>> +
>>>> +/* (Taken from FreeBSD tree)
>>>> +** Setup the correct IVAR register for a particular MSIX interrupt
>>>> +**   (yes this is all very magic and confusing :)
>>>> +**  - entry is the register array entry
>>>> +**  - vector is the MSIX vector for this queue
>>>> +**  - type is RX/TX/MISC
>>>> +*/
>>>> +static void
>>>> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
>>>> +{
>>>> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>> +	u32 ivar, index;
>>>> +
>>>> +	vector |= IXGBE_IVAR_ALLOC_VAL;
>>>> +
>>>> +	switch (hw->mac.type) {
>>>> +
>>>> +	case ixgbe_mac_82598EB:
>>>> +		if (type == -1)
>>>> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
>>>> +		else
>>>> +			entry += (type * 64);
>>>> +		index = (entry >> 2) & 0x1F;
>>>> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
>>>> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
>>>> +		ivar |= (vector << (8 * (entry & 0x3)));
>>>> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
>>>> +		break;
>>>> +
>>>> +	case ixgbe_mac_82599EB:
>>>> +	case ixgbe_mac_X540:
>>>> +		if (type == -1) { /* MISC IVAR */
>>>> +			index = (entry & 1) * 8;
>>>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
>>>> +			ivar &= ~(0xFF << index);
>>>> +			ivar |= (vector << index);
>>>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
>>>> +		} else {	/* RX/TX IVARS */
>>>> +			index = (16 * (entry & 1)) + (8 * type);
>>>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
>>>> +			ivar &= ~(0xFF << index);
>>>> +			ivar |= (vector << index);
>>>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
>>>> +		}
>>>> +
>>>> +		break;
>>>> +
>>>> +	default:
>>>> +		break;
>>>> +	}
>>>> +}
>>>> +
>>>>    void set_rx_function(struct rte_eth_dev *dev)
>>>>    {
>>>>    	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
>>>>    			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
>>>>    		}
>>>>    	}
>>>> +
>>>> +	/*
>>>> +	 * Initialize the appropriate LRO callback.
>>>> +	 *
>>>> +	 * If all queues satisfy the bulk allocation preconditions
>>>> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
>>>> +	 * Otherwise use a single allocation version.
>>>> +	 */
>>>> +	if (dev->data->lro) {
>>>> +		if (hw->rx_bulk_alloc_allowed) {
>>>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
>>>> +					   "allocation version");
>>>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
>>>> +		} else {
>>>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
>>>> +					   "allocation version");
>>>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
>>>> +		}
>>>> +	}
>>>>    }
>>> As I understand, ixgbe_recv_pkts_lro() can handle both LRO and normal scattered packets?
>> Not as it is now. It may be easily patched to do so though.
>>
>>> If that so, then can we remove ixgbe_recv_scattered_pkts() at all and use ixgbe_recv_scattered_pkts() for both cases?
>> This was explicitly requested from me by Bruce Richardson (see
>> "[dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx
>> flow?" thread) to separate the complicated handling from the simple high
>> performance one. The handling in the RSC routine is more generic and
>> thus is a bit of overkill for the simple scattered case: e.g. there is
>> no need to a sw_rsc_ring.
> I think Bruce meant ixgbe_recv_pkts_bulk_alloc() not ixgbe_recv_scattered_pkts()
> when he told about simple and high performance RX path.
>
>> Therefore I preferred to advance with small steps here. And if there
>> will be a decision to join these flows - it may be done with a rather
>> small patch in the future.
> Ok, that's understandable and I wouldn't insist to do that in the same patch.
> It just worries me that number of our ixgbe RX functions keeps increasing.

Let's have this series get to the master and I'll send a follow-up 
series that kills non-vector scatter callback. Agreed? ;)

>   
>
>>>>    /*
>>>> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>    	uint32_t maxfrs;
>>>>    	uint32_t srrctl;
>>>>    	uint32_t rdrxctl;
>>>> +	uint32_t rscctl;
>>>> +	uint32_t psrtype;
>>>> +	uint32_t rfctl;
>>>>    	uint32_t rxcsum;
>>>>    	uint16_t buf_size;
>>>>    	uint16_t i;
>>>>    	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
>>>> +	struct rte_eth_dev_info dev_info = { 0 };
>>>> +	bool rsc_capable = false;
>>>> +
>>>> +	/* Sanity check */
>>>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>>>> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
>>>> +		rsc_capable = true;
>>> @ 7.11.1 82599 spec says:
>>> " Note that in SR-IOV mode the RSC must be disabled globally by setting the RFCTL.RSC_DIS bit."
>>> Add a check?
>> Good catch! Will add a check. Thanks.
>>
>>>> +
>>>> +	if (!rsc_capable && rx_conf->enable_lro) {
>>>> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
>>>> +				   "support it");
>>>> +		return -EINVAL;
>>>> +	}
>>>>
>>>>    	PMD_INIT_FUNC_TRACE();
>>>>    	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>    	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
>>>>
>>>>    	/*
>>>> +	 * RFCTL configuration
>>>> +	 *
>>>> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
>>>> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
>>>> +	 */
>>>> +	if (rsc_capable) {
>>>> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
>>>> +		if (rx_conf->enable_lro) {
>>>> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
>>>> +				   IXGBE_RFCTL_NFSR_DIS);
>>>> +		} else {
>>>> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
>>>> +		}
>>>> +
>>>> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
>>>> +	}
>>>> +
>>>> +
>>>> +	/*
>>>>    	 * Configure CRC stripping, if any.
>>>>    	 */
>>>>    	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
>>>>    	if (rx_conf->hw_strip_crc)
>>>>    		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
>>>> -	else
>>>> +	else {
>>>>    		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
>>>> +		if (rx_conf->enable_lro) {
>>>> +			/*
>>>> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
>>>> +			 * 3.0 RSC configuration requires HW CRC stripping being
>>>> +			 * enabled. If user requested both HW CRC stripping off
>>>> +			 * and RSC on - return an error.
>>>> +			 */
>>>> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
>>>> +					    "is disabled");
>>>> +			return -EINVAL;
>>>> +		}
>>>> +	}
>>>>
>>>>    	/*
>>>>    	 * Configure jumbo frame support, if any.
>>>> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>    		 * Configure Header Split
>>>>    		 */
>>>>    		if (rx_conf->header_split) {
>>>> +			/*
>>>> +			 * Print a warning if split_hdr_size is less
>>>> +			 * than 128 bytes when RSC is requested.
>>>> +			 */
>>>> +			if (rx_conf->enable_lro &&
>>>> +			    rx_conf->split_hdr_size < 128)
>>>> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
>>>> +						   "128 bytes (%d)!",
>>>> +					     rx_conf->split_hdr_size);
>>>> +
>>>>    			if (hw->mac.type == ixgbe_mac_82599EB) {
>>>>    				/* Must setup the PSRTYPE register */
>>>> -				uint32_t psrtype;
>>>>    				psrtype = IXGBE_PSRTYPE_TCPHDR |
>>>>    					IXGBE_PSRTYPE_UDPHDR   |
>>>>    					IXGBE_PSRTYPE_IPV4HDR  |
>>>> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>    			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
>>>>    		} else
>>>>    #endif
>>>> +		{
>>>>    			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
>>>> +			/*
>>>> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
>>>> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
>>>> +			 * should be configured even if header split is not
>>>> +			 * enabled. In the later case we will configure it 128
>>>> +			 * bytes following the recommendation in the spec.
>>>> +			 */
>>>> +			if (rx_conf->enable_lro)
>>>> +				srrctl |=
>>>> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
>>>> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
>>>> +		}
>>>>
>>>>    		/* Set if packets are dropped when no descriptors available */
>>>>    		if (rxq->drop_en)
>>>> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>    				       RTE_PKTMBUF_HEADROOM);
>>>>    		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
>>>>    			   IXGBE_SRRCTL_BSIZEPKT_MASK);
>>>> +
>>>> +		/*
>>>> +		 * TODO: Consider setting the Receive Descriptor Minimum
>>>> +		 * Threshold Size for and RSC case. This is not an obviously
>>>> +		 * beneficiary option but the one worth considering...
>>>> +		 */
>>>> +
>>>>    		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
>>>>
>>>>    		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
>>>> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>    		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
>>>>    					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
>>>>    			dev->data->scattered_rx = 1;
>>>> +
>>>> +		/* RSC per-queue configuration */
>>>> +		if (rx_conf->enable_lro) {
>>>> +			uint32_t eitr;
>>>> +
>>>> +			rscctl =
>>>> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
>>>> +			psrtype =
>>>> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
>>>> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
>>>> +
>>>> +			rscctl |= IXGBE_RSCCTL_RSCEN;
>>>> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
>>>> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
>>>> +
>>>> +			/*
>>>> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
>>>> +			 *
>>>> +			 * Full-sized RSC aggregations for a 10Gb/s link will
>>>> +			 * arrive at about 20K aggregation/s rate.
>>>> +			 *
>>>> +			 * 2K inst/s rate will make only 10% of the
>>>> +			 * aggregations to be closed due to the interrupt timer
>>>> +			 * expiration for a streaming at wire-speed case.
>>>> +			 *
>>>> +			 * For a sparse streaming case this setting will yield
>>>> +			 * at most 500us latency for a single RSC aggregation.
>>>> +			 */
>>>> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
>>> Again probably create some macro for ITR Interval default value here.
>> Well, again - it's the only place where it's used and I've extensively
>> explained it in the comments in the code. Therefore I think it's the
>> most readable way to write this.
>> If it would be used in at least two places - then I would have put it in
>> a macro...
> I think it is a good practise to use macros instead of raw numbers in such places.
> You probably can make these macros self-explanatory:
> /* EITR Inteval in 2us uinits for 1G and 10G. */
> #define IXGBE_EITR_INTERVAL_US	2
>
> #define IXGBE_EITR_INTERVAL_SHIFT	3
>
> #define IXGBE_EITR_INTERVAL(us)	((us) / IXGBE_EITR_INTERVAL_US << IXGBE_EITR_INTERVAL_SHIFT)
>
> /* at most 500us latency for a single RSC aggregation */
> #define IXGHE_EITR_INTERVAL_DEFAULT  IXGBE_EITR_INTERVAL(500)

If this value would have a potential be changed one day or if it would 
going to be used somewhere else in the code I would immediately agree 
but here u've added 9 long lines of something that nobody would ever 
care about. The only thing that everybody would care what are the actual 
implication of this value on the RSC functionality. To understand that 
having macros like u propose instead of a proper comment like I propose 
doesn't help much. This is because the thing is not just about the EITR 
interval and the maximum latency. But if we keep my comment then we 
don't need any additional self-explanatory macros because everything has 
been explained in the comment already.

If one day this parameter is going to be configured from the outside - 
then I agree that there would be a place for macros like above. For the 
current API state I think it would just pump up the code with useless 
code lines.

>
>>>> +
>>>> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
>>>> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
>>>> +								       psrtype);
>>>> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
>>>> +
>>>> +			/*
>>>> +			 * RSC requires the mapping of the queue to the
>>>> +			 * interrupt vector.
>>>> +			 */
>>>> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
>>> Hm, wonder why do we need to setup IVAR for RSC?
>>> Wouldn't just setting EITR be enough?
>> Nope. See 82599 spec chapter 4.6.7.2.2.
> I read it, though it doesn't say 'IVAR must be setup' like it does for EITR.Inerval.

82599 Spec, Chapter 4.6.7.2.2 ("RSC Enablement" -> "Per Queue Setting"), 
the last bullet:

"— Map the relevant Rx queues to an interrupt by setting the relevant IVAR
registers."

> That made me thought that it might be optional.
>
>> I think I even tried not to map
>> the queues to IVAR and it didn't work... ;)
> Pity, but not much we can in that case, I suppose.
>
>>>> +
>>>> +			rxq->rsc_en = 1;
>>>> +		}
>>>>    	}
>>>>
>>>>    	if (rx_conf->enable_scatter)
>>>>    		dev->data->scattered_rx = 1;
>>>>
>>>> +	if (rx_conf->enable_lro)
>>>> +		dev->data->lro = 1;
>>>> +
>>>>    	set_rx_function(dev);
>>>>
>>>>    	/*
>>>> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>    		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>>>>    	}
>>>>
>>>> +	/* Finalize RSC configuration  */
>>>> +	if (rx_conf->enable_lro) {
>>>> +		/*
>>>> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
>>>> +		 */
>>>> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
>>>> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
>>>> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>>>> +
>>>> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
>>>> +	}
>>>> +
>>>> +
>>>>    	return 0;
>>>>    }
>>>>
>>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>>> index bbe5ff3..389173f 100644
>>>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>>> @@ -79,6 +79,10 @@ struct igb_rx_entry {
>>>>    	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
>>>>    };
>>>>
>>>> +struct igb_rsc_entry {
>>>> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
>>>> +};
>>>> +
>>>>    /**
>>>>     * Structure associated with each descriptor of the TX ring of a TX queue.
>>>>     */
>>>> @@ -105,6 +109,7 @@ struct igb_rx_queue {
>>>>    	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
>>>>    	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
>>>>    	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
>>>> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
>>>>    	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
>>>>    	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
>>>>    	uint64_t            mbuf_initializer; /**< value to init mbufs */
>>>> @@ -126,6 +131,7 @@ struct igb_rx_queue {
>>>>    	uint8_t             port_id;  /**< Device port identifier. */
>>>>    	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
>>>>    	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
>>>> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
>>>>    	uint8_t             rx_deferred_start; /**< not in global dev start. */
>>>>    #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
>>>>    	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */
>>>> --
>>>> 2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-10 21:36         ` Vlad Zolotarov
@ 2015-03-11 16:32           ` Ananyev, Konstantin
  2015-03-11 16:54             ` Vlad Zolotarov
  0 siblings, 1 reply; 18+ messages in thread
From: Ananyev, Konstantin @ 2015-03-11 16:32 UTC (permalink / raw)
  To: Vlad Zolotarov, dev



> -----Original Message-----
> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
> Sent: Tuesday, March 10, 2015 9:36 PM
> To: Ananyev, Konstantin; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
> 
> 
> 
> On 03/10/15 22:09, Ananyev, Konstantin wrote:
> >>> Hi Vlad,
> >>>
> >>>> -----Original Message-----
> >>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vlad Zolotarov
> >>>> Sent: Monday, March 09, 2015 7:07 PM
> >>>> To: dev at dpdk.org
> >>>> Subject: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
> >>>>
> >>>>       - Only x540 and 82599 devices support LRO.
> >>>>       - Add the appropriate HW configuration.
> >>>>       - Add RSC aware rx_pkt_burst() handlers:
> >>>>          - Implemented bulk allocation and non-bulk allocation versions.
> >>>>          - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
> >>>>            and to igb_rx_queue.
> >>>>          - Use the appropriate handler when LRO is requested.
> >>>>
> >>>> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com>
> >>>> ---
> >>>> New in v5:
> >>>>      - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
> >>>>      - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
> >>>>
> >>>> New in v4:
> >>>>      - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
> >>>>        RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
> >>>>
> >>>> New in v2:
> >>>>      - Removed rte_eth_dev_data.lro_bulk_alloc.
> >>>>      - Fixed a few styling and spelling issues.
> >>>> ---
> >>>>    lib/librte_ether/rte_ethdev.h       |   9 +-
> >>>>    lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
> >>>>    lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
> >>>>    lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
> >>>>    lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
> >>>>    5 files changed, 581 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> >>>> index 8db3127..44f081f 100644
> >>>> --- a/lib/librte_ether/rte_ethdev.h
> >>>> +++ b/lib/librte_ether/rte_ethdev.h
> >>>> @@ -172,6 +172,9 @@ extern "C" {
> >>>>
> >>>>    #include <stdint.h>
> >>>>
> >>>> +/* Use this macro to check if LRO API is supported */
> >>>> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
> >>>> +
> >>>>    #include <rte_log.h>
> >>>>    #include <rte_interrupts.h>
> >>>>    #include <rte_pci.h>
> >>>> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
> >>>>    	enum rte_eth_rx_mq_mode mq_mode;
> >>>>    	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
> >>>>    	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
> >>>> -	uint8_t header_split : 1, /**< Header Split enable. */
> >>>> +	uint16_t header_split : 1, /**< Header Split enable. */
> >>>>    		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
> >>>>    		hw_vlan_filter   : 1, /**< VLAN filter enable. */
> >>>>    		hw_vlan_strip    : 1, /**< VLAN strip enable. */
> >>>>    		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
> >>>>    		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
> >>>>    		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
> >>>> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
> >>>> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
> >>>> +		enable_lro       : 1; /**< Enable LRO */
> >>>>    };
> >>>>
> >>>>    /**
> >>>> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
> >>>>    	uint8_t port_id;           /**< Device [external] port identifier. */
> >>>>    	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
> >>>>    		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
> >>>> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
> >>>>    		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
> >>>>    		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
> >>>>    };
> >>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> >>>> index 9d3de1a..765174d 100644
> >>>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> >>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> >>>> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
> >>>>
> >>>>    	/* Clear stored conf */
> >>>>    	dev->data->scattered_rx = 0;
> >>>> +	dev->data->lro = 0;
> >>>>    	hw->rx_bulk_alloc_allowed = false;
> >>>>    	hw->rx_vec_allowed = false;
> >>>>
> >>>> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> >>>>    		DEV_RX_OFFLOAD_IPV4_CKSUM |
> >>>>    		DEV_RX_OFFLOAD_UDP_CKSUM  |
> >>>>    		DEV_RX_OFFLOAD_TCP_CKSUM;
> >>>> +
> >>>> +	if (hw->mac.type == ixgbe_mac_82599EB ||
> >>>> +	    hw->mac.type == ixgbe_mac_X540)
> >>>> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
> >>>> +
> >>>>    	dev_info->tx_offload_capa =
> >>>>    		DEV_TX_OFFLOAD_VLAN_INSERT |
> >>>>    		DEV_TX_OFFLOAD_IPV4_CKSUM  |
> >>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> >>>> index a549f5c..e206584 100644
> >>>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> >>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> >>>> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
> >>>>    uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
> >>>>    		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> >>>>
> >>>> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
> >>>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> >>>> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
> >>>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> >>>> +
> >>>>    uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
> >>>>    		uint16_t nb_pkts);
> >>>>
> >>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> >>>> index 58e619b..944c662 100644
> >>>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> >>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> >>>> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> >>>>    }
> >>>>
> >>>>    /**
> >>>> + * Detect an RSC descriptor.
> >>>> + */
> >>>> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
> >>>> +{
> >>>> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
> >>>> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>>     * Initialize the first mbuf of the returned packet:
> >>>>     *    - RX port identifier,
> >>>>     *    - hardware offload data, if any:
> >>>> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
> >>>>    	}
> >>>>    }
> >>>>
> >>>> +/**
> >>>> + * Bulk receive handler for and LRO case.
> >>>> + *
> >>>> + * @rx_queue Rx queue handle
> >>>> + * @rx_pkts table of received packets
> >>>> + * @nb_pkts size of rx_pkts table
> >>>> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
> >>>> + *
> >>>> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
> >>>> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
> >>>> + *
> >>>> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
> >>>> + * 1) When non-EOP RSC completion arrives:
> >>>> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
> >>>> + *       segment's data length.
> >>>> + *    b) Set the "next" pointer of the current segment to point to the segment
> >>>> + *       at the NEXTP index.
> >>>> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
> >>>> + *       in the sw_rsc_ring.
> >>>> + * 2) When EOP arrives we just update the cluster's total length and offload
> >>>> + *    flags and deliver the cluster up to the upper layers. In our case - put it
> >>>> + *    in the rx_pkts table.
> >>>> + *
> >>>> + * Returns the number of received packets/clusters (according to the "bulk
> >>>> + * receive" interface).
> >>>> + */
> >>>> +static inline uint16_t
> >>>> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
> >>>> +	       bool bulk_alloc)
> >>>> +{
> >>>> +	struct igb_rx_queue *rxq = rx_queue;
> >>>> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
> >>>> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
> >>>> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
> >>>> +	uint16_t rx_id = rxq->rx_tail;
> >>>> +	uint16_t nb_rx = 0;
> >>>> +	uint16_t nb_hold = rxq->nb_rx_hold;
> >>>> +	uint16_t prev_id = rxq->rx_tail;
> >>>> +
> >>>> +	while (nb_rx < nb_pkts) {
> >>>> +		bool eop;
> >>>> +		struct igb_rx_entry *rxe;
> >>>> +		struct igb_rsc_entry *rsc_entry;
> >>>> +		struct igb_rsc_entry *next_rsc_entry;
> >>>> +		struct igb_rx_entry *next_rxe;
> >>>> +		struct rte_mbuf *first_seg;
> >>>> +		struct rte_mbuf *rxm;
> >>>> +		struct rte_mbuf *nmb;
> >>>> +		union ixgbe_adv_rx_desc rxd;
> >>>> +		uint16_t data_len;
> >>>> +		uint16_t next_id;
> >>>> +		volatile union ixgbe_adv_rx_desc *rxdp;
> >>>> +		uint32_t staterr;
> >>>> +
> >>>> +next_desc:
> >>>> +		/*
> >>>> +		 * The code in this whole file uses the volatile pointer to
> >>>> +		 * ensure the read ordering of the status and the rest of the
> >>>> +		 * descriptor fields (on the compiler level only!!!). This is so
> >>>> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
> >>>> +		 * even has the rte_compiler_barrier() for that.
> >>>> +		 *
> >>>> +		 * But most importantly this is just wrong because this doesn't
> >>>> +		 * ensure memory ordering in a general case at all. For
> >>>> +		 * instance, DPDK is supposed to work on Power CPUs where
> >>>> +		 * compiler barrier may just not be enough!
> >>>> +		 *
> >>>> +		 * I tried to write only this function properly to have a
> >>>> +		 * starting point (as a part of an LRO/RSC series) but the
> >>>> +		 * compiler cursed at me when I tried to cast away the
> >>>> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
> >>>> +		 * keeping it the way it is for now.
> >>>> +		 *
> >>>> +		 * The code in this file is broken in so many other places and
> >>>> +		 * will just not work on a big endian CPU anyway therefore the
> >>>> +		 * lines below will have to be revisited together with the rest
> >>>> +		 * of the ixgbe PMD.
> >>>> +		 *
> >>>> +		 * TODO:
> >>>> +		 *    - Get rid of "volatile" crap and let the compiler do its
> >>>> +		 *      job.
> >>>> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
> >>>> +		 *      memory ordering below.
> >>> Ok, so you wanted to put rte_rmb(), straight after:
> >>> staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >>> correct?
> >>> I agree that for machines with relaxed memory model (PPC) we do need it here.
> >>> So why not just put it there, instead of complaining about in in comments? ;)
> >> Because it's not a proper fix and I don't like workarounds.
> > Why not? For machines with relaxed memory model you would need  rmb() here no matter does rxdp points to volatile or not.
> >
> >>> About rxdp being pointer to volatile, why it bothers you that much?
> >> Because using of "volatile" prevent the compiler from optimizing every
> >> code piece where the "volatile" variable is participating and that's a
> >> shame.
> >> Read this
> >> https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
> >> for a more detailed explanation.
> >>
> >>> You copy the whole RXD to the local variable anyway, and then reference it only to setup new addresses.
> >> The fact that we have to copy the whole descriptor while we may not need
> >> all the data from it at the end is one problem.
> > I understand that, but I don't think that the difference would that critical.
> > Though I don't have any data in hand to compare.
> >
> >> The proper solution in Rx ring context should go as follows:
> >>
> >>   1. Remove the "volatile" qualifier from rx_ring (HW Rx descriptors ring).
> >>   2. Remove "volatile" at all places where rx_ring is accessed.
> >>   3. Adjust the code in (2):
> >>       1. Remove the descriptor copy u've mentioned and access the
> >>          descriptor data directly.
> >>       2. Ensure the proper ordering by using the proper memory barriers,
> >>          which are missing in the DPDK SDK at the moment (see a small
> >>          discussion about this with Stephen and Avi on "[dpdk-dev]
> >>          [PATCH v1 5/5] ixgbe: Add LRO support" thread).
> > I think you are mixing 2 different issues here:
> >
> > 1.  For architectures with relaxed memory model we do need rmb() after that line:
> > staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> > We do need it *always*, not depending on is rx_ring a volatile or not.
> > If we really plan to support PPC and other architectures that allow read reordering  -
> > not having an 'rmb()' or similar sync primitive here is a bug.
> > Same thing applies to 'wmb()' before updating RDT.
> >
> > 2. volatile rx_ring vs non-volatile with explicit memory ordering instrincts.
> > Actually I think that using volatile rx_ring is not a real bug on itself.
> > Code with volatile rx_ring and fix for #1 in place would work correctly on all architectures.
> > It might be slower than non-volatile approach, but nothing would be broken.
> >
> > About the existing RX/TX functions and PPC support:
> > Note that all of them were created before PPC support for DPDK was introduced.
> > At that moment only IA was supported.
> > That's why in some places where you would expect to see 'mb()' there are 'volatile' and/or ' rte_compiler_barrier' instead.
> > Why all that places wasn't updated when PPC support was added - that's another question.
> >  From my understanding - with current implementation some of DPDK PMDs RX/TX functions and  rte_ring wouldn't work correctly
> on PPC.
> > So, I suppose we need to decide for ourselves - do we really want to support PPC and other architectures with non-IA memory
> model or not?
> > If not, then I think we don't need any mb()s inside recv_pkts_lro() - just rte_compiler_barrier seems enough, and no point to
> complain about
> > it in comments.
> > If yes - then why to introduce a new function with a known potential bug?
> 
> In order to introduce a new function with the proper implementation or
> to fix any other places with the similar weakness I would need a proper
> tools like a proper platform-dependent barrier-macros similar to
> smp_Xmb() Linux macros that reduce to a compiler barrier where
> appropriate or to a proper memory fence where needed.

I understand that.
Let's add new macro for that: rte_smp_Xmb() or something,
so it would be just rte_compiler_barrier() for x86 and a proper mb() for PPC.    

> 
> Unfortunately DPDK doesn't have such at the moment. That's why I put a
> big fat comment at the place that has to be fixed once they are introduced.
> 
> U are right though about "volatile" thing not being a bug but it would
> be strange to keep it after barriers are properly placed. That's why I
> think these 2 changes should go together.

Yep, with explicit memory ordering volatile will become redundant and could be removed.
Though, I don't see why it should be applied separately.
>From my point: first is a bug fix, second is an enhancement. 

> 
> About the "decision" we have to make - I think it has been decided
> already since PPC is one the official DPDK targets. Therefore the only
> thing to decide here is when and who gets to fix these things. One thing
> is obvious - this patch is not the right place to do it. ;)

My thought was to introduce such macro(s) and start using it with that patch :)
But ok, if you feel like it's too much for that patch, let's leave it as it is right now.

> 
> >
> >> As it sounds this is going to be a VERY sensitive patchset.
> >> That's why it should go separately from this patchwork (or from any
> >> other patchwork).
> > For that patch, I am not suggesting you to change any other functions, just one that you introducing.
> 
> I don't think that putting an lfence on x86 there is a good idea. As
> I've just explained above - once DPDK has proper platform-dependent
> rmb() macros I'll gladly revisit these lines.

Sure, plain rte_rmb() would slowdown things a lot here.
Totally agree with you here - it should be platform dependent macro, see above.

> Frankly, the same could be
> told about the rte_wmb() before the RDT update but it is much less
> harmful than lfence so I didn't raise it... ;)
> 
> >
> >>>> +		 */
> >>>> +		rxdp = &rx_ring[rx_id];
> >>>> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> >>>> +
> >>>> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
> >>>> +			break;
> >>>> +
> >>>> +		rxd = *rxdp;
> >>>> +
> >>>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> >>>> +				  "staterr=0x%x data_len=%u",
> >>>> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
> >>>> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
> >>>> +
> >>>> +		if (!bulk_alloc) {
> >>>> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
> >>>> +			if (nmb == NULL) {
> >>>> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
> >>>> +						  "port_id=%u queue_id=%u",
> >>>> +					   rxq->port_id, rxq->queue_id);
> >>>> +
> >>>> +				rte_eth_devices[rxq->port_id].data->
> >>>> +							rx_mbuf_alloc_failed++;
> >>>> +				break;
> >>>> +			}
> >>>> +		} else if (nb_hold > rxq->rx_free_thresh) {
> >>>> +			uint16_t next_rdt = rxq->rx_free_trigger;
> >>>> +
> >>>> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
> >>>> +				rte_wmb();
> >>>> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
> >>>> +						    next_rdt);
> >>>> +				nb_hold -= rxq->rx_free_thresh;
> >>>> +			} else {
> >>>> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
> >>>> +						  "port_id=%u queue_id=%u",
> >>>> +					   rxq->port_id, rxq->queue_id);
> >>>> +
> >>>> +				rte_eth_devices[rxq->port_id].data->
> >>>> +							rx_mbuf_alloc_failed++;
> >>>> +				break;
> >>>> +			}
> >>>> +		}
> >>>> +
> >>>> +		nb_hold++;
> >>>> +		rxe = &sw_ring[rx_id];
> >>>> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
> >>>> +
> >>>> +		next_id = rx_id + 1;
> >>>> +		if (next_id == rxq->nb_rx_desc)
> >>>> +			next_id = 0;
> >>>> +
> >>>> +		/* Prefetch next mbuf while processing current one. */
> >>>> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
> >>>> +
> >>>> +		/*
> >>>> +		 * When next RX descriptor is on a cache-line boundary,
> >>>> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
> >>>> +		 * to mbufs.
> >>>> +		 */
> >>>> +		if ((next_id & 0x3) == 0) {
> >>>> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
> >>>> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
> >>>> +		}
> >>>> +
> >>>> +		rxm = rxe->mbuf;
> >>>> +
> >>>> +		if (!bulk_alloc) {
> >>>> +			__le64 dma =
> >>>> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
> >>>> +			/*
> >>>> +			 * Update RX descriptor with the physical address of the
> >>>> +			 * new data buffer of the new allocated mbuf.
> >>>> +			 */
> >>>> +			rxe->mbuf = nmb;
> >>>> +
> >>>> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
> >>>> +			rxdp->read.hdr_addr = dma;
> >>>> +			rxdp->read.pkt_addr = dma;
> >>>> +		}
> >>>> +		/*
> >>>> +		 * Set data length & data buffer address of mbuf.
> >>>> +		 */
> >>>> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
> >>>> +		rxm->data_len = data_len;
> >>>> +
> >>>> +		if (!eop) {
> >>>> +			uint16_t nextp_id;
> >>>> +			/*
> >>>> +			 * Get next descriptor index:
> >>>> +			 *  - For RSC it's in the NEXTP field.
> >>>> +			 *  - For a scattered packet - it's just a following
> >>>> +			 *    descriptor.
> >>>> +			 */
> >>>> +			if (ixgbe_rsc_count(&rxd))
> >>>> +				nextp_id =
> >>>> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
> >>>> +						       IXGBE_RXDADV_NEXTP_SHIFT;
> >>>> +			else
> >>>> +				nextp_id = next_id;
> >>>> +
> >>>> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
> >>>> +			next_rxe = &sw_ring[nextp_id];
> >>>> +			rte_ixgbe_prefetch(next_rxe);
> >>>> +		}
> >>>> +
> >>>> +		rsc_entry = &sw_rsc_ring[rx_id];
> >>>> +		first_seg = rsc_entry->fbuf;
> >>>> +		rsc_entry->fbuf = NULL;
> >>>> +
> >>>> +		/*
> >>>> +		 * If this is the first buffer of the received packet,
> >>>> +		 * set the pointer to the first mbuf of the packet and
> >>>> +		 * initialize its context.
> >>>> +		 * Otherwise, update the total length and the number of segments
> >>>> +		 * of the current scattered packet, and update the pointer to
> >>>> +		 * the last mbuf of the current packet.
> >>>> +		 */
> >>>> +		if (first_seg == NULL) {
> >>>> +			first_seg = rxm;
> >>>> +			first_seg->pkt_len = data_len;
> >>>> +			first_seg->nb_segs = 1;
> >>>> +		} else {
> >>>> +			first_seg->pkt_len += data_len;
> >>>> +			first_seg->nb_segs++;
> >>>> +		}
> >>>> +
> >>>> +		prev_id = rx_id;
> >>>> +		rx_id = next_id;
> >>>> +
> >>>> +		/*
> >>>> +		 * If this is not the last buffer of the received packet, update
> >>>> +		 * the pointer to the first mbuf at the NEXTP entry in the
> >>>> +		 * sw_rsc_ring and continue to parse the RX ring.
> >>>> +		 */
> >>>> +		if (!eop) {
> >>>> +			rxm->next = next_rxe->mbuf;
> >>>> +			next_rsc_entry->fbuf = first_seg;
> >>>> +			goto next_desc;
> >>> So _recv_pkts_lro() can return with one of rxq->rsc_entry[i] != NULL, correct?
> >>> If so, then I think you need at ixgbe_rx_queue_release_mbufs() to add the code, that would go through
> >>> all rsc_entry[] to find one whose fbuf  is != NULL, call rte_pktmbuf_free() for it and reset to NULL.
> >>>    To handle the case:
> >>> recv_pkts_lro(rxq, ...);
> >>> rte_eth_dev_stop();
> >>> rte_eth_dev_start();
> >>> recv_pkts_lro(rxq, ...);
> >> Right. I've missed that part.
> >>
> >>> BTW, that also means that you can't do:
> >>> rxm->next = next_rxe->mbuf;
> >>> above, and
> >>> rxm->next = NULL;
> >>> should be done before 'goto next_desc;' too
> >> Your proposal will cost cycles in the fast path on account of saving
> >> cycles in the slow path: we'll have to add another pointer to the
> >> igb_rsc_entry to hold the last mbuf in the current cluster that we'll
> >> have to read and update for every new completed RSC descriptor.
> >>
> >> The easier way would be to just reset the next-pointer of the last
> >> descriptor in the RSC cluster to NULL (according to nb_segs) before
> >> calling for rte_pktmbuf_free() in ixgbe_rx_queue_release_mbufs().
> > Should work too, I think.
> 
> The final solution is even nicer - see v7. And it works like a charm
> too... ;)

Good to hear :)


> 
> >
> >>>> +		}
> >>>> +
> >>>> +		/*
> >>>> +		 * This is the last buffer of the received packet - return
> >>>> +		 * the current cluster to the user.
> >>>> +		 */
> >>>> +		rxm->next = NULL;
> >>>> +
> >>>> +		/* Initialize the first mbuf of the returned packet */
> >>>> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
> >>>> +					    staterr);
> >>>> +
> >>>> +		/* Prefetch data of first segment, if configured to do so. */
> >>>> +		rte_packet_prefetch((char *)first_seg->buf_addr +
> >>>> +			first_seg->data_off);
> >>>> +
> >>>> +		/*
> >>>> +		 * Store the mbuf address into the next entry of the array
> >>>> +		 * of returned packets.
> >>>> +		 */
> >>>> +		rx_pkts[nb_rx++] = first_seg;
> >>>> +	}
> >>>> +
> >>>> +	/*
> >>>> +	 * Record index of the next RX descriptor to probe.
> >>>> +	 */
> >>>> +	rxq->rx_tail = rx_id;
> >>>> +
> >>>> +	/*
> >>>> +	 * If the number of free RX descriptors is greater than the RX free
> >>>> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
> >>>> +	 * register.
> >>>> +	 * Update the RDT with the value of the last processed RX descriptor
> >>>> +	 * minus 1, to guarantee that the RDT register is never equal to the
> >>>> +	 * RDH register, which creates a "full" ring situtation from the
> >>>> +	 * hardware point of view...
> >>>> +	 */
> >>>> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
> >>>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
> >>>> +			   "nb_hold=%u nb_rx=%u",
> >>>> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
> >>>> +
> >>> I suppose if you do wmb() after rte_rxmbuf_alloc(), you'd better do it here too.
> >> Right! Missed that when copied this code from
> >> ixgbe_recv_scattered_pkts()... ;) Note that the barrier is missing there
> >> too...
> >> These are the examples of the code that works on x86 only because of
> >> that "volatile" thing and will break once it's removed. On PPC it is
> >> broken even with "volatile".
> > Yep, as I said above -for IA we don't need mb() here - using 'volatile' or compiler barrier seems enough to me.
> > For PPC - I think we do.
> >
> >>>> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
> >>>> +		nb_hold = 0;
> >>>> +	}
> >>>> +
> >>>> +	rxq->nb_rx_hold = nb_hold;
> >>>> +	return nb_rx;
> >>>> +}
> >>>> +
> >>>> +uint16_t
> >>>> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> >>>> +{
> >>>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
> >>>> +}
> >>>> +
> >>>> +uint16_t
> >>>> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
> >>>> +			       uint16_t nb_pkts)
> >>>> +{
> >>>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
> >>>> +}
> >>>> +
> >>>>    uint16_t
> >>>>    ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> >>>>    			  uint16_t nb_pkts)
> >>>> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
> >>>>    	if (rxq != NULL) {
> >>>>    		ixgbe_rx_queue_release_mbufs(rxq);
> >>>>    		rte_free(rxq->sw_ring);
> >>>> +		rte_free(rxq->sw_rsc_ring);
> >>>>    		rte_free(rxq);
> >>>>    	}
> >>>>    }
> >>>> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
> >>>>    	rxq->nb_rx_hold = 0;
> >>>>    	rxq->pkt_first_seg = NULL;
> >>>>    	rxq->pkt_last_seg = NULL;
> >>>> +	rxq->rsc_en = 0;
> >>>>    }
> >>>>
> >>>>    int
> >>>> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
> >>>>    	struct igb_rx_queue *rxq;
> >>>>    	struct ixgbe_hw     *hw;
> >>>>    	uint16_t len;
> >>>> +	struct rte_eth_dev_info dev_info = { 0 };
> >>>> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
> >>>> +	bool rsc_requested = false;
> >>>> +
> >>>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> >>>> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
> >>>> +	    dev_rx_mode->enable_lro)
> >>>> +		rsc_requested = true;
> >>>>
> >>>>    	PMD_INIT_FUNC_TRACE();
> >>>>    	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >>>> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
> >>>>    	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
> >>>>    					  sizeof(struct igb_rx_entry) * len,
> >>>>    					  RTE_CACHE_LINE_SIZE, socket_id);
> >>>> -	if (rxq->sw_ring == NULL) {
> >>>> +	if (!rxq->sw_ring) {
> >>> Wonder what was wrong with that one? :)
> >> Nothing - just aligned it with the lines I've added below. ;)
> >>
> >>>>    		ixgbe_rx_queue_release(rxq);
> >>>>    		return (-ENOMEM);
> >>>>    	}
> >>>> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
> >>>> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
> >>>> +
> >>>> +	if (rsc_requested) {
> >>>> +		rxq->sw_rsc_ring =
> >>>> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
> >>>> +					   sizeof(struct igb_rsc_entry) * len,
> >>>> +					   RTE_CACHE_LINE_SIZE, socket_id);
> >>>> +		if (!rxq->sw_rsc_ring) {
> >>>> +			ixgbe_rx_queue_release(rxq);
> >>>> +			return (-ENOMEM);
> >>>> +		}
> >>>> +	} else {
> >>>> +		rxq->sw_rsc_ring = NULL;
> >>>> +	}
> >>>> +
> >>>> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
> >>>> +			    "dma_addr=0x%"PRIx64,
> >>>> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
> >>>> +		     rxq->rx_ring_phys_addr);
> >>>>
> >>>>    	if (!rte_is_power_of_2(nb_desc)) {
> >>>>    		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
> >>>> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
> >>>>    	return 0;
> >>>>    }
> >>>>
> >>>> +/**
> >>>> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
> >>>> + *
> >>>> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
> >>>> + * spec rev. 3.0 chapter 8.2.3.8.13.
> >>>> + *
> >>>> + * @pool Memory pool of the Rx queue
> >>>> + */
> >>>> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
> >>>> +{
> >>>> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
> >>>> +
> >>>> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
> >>>> +	uint16_t maxdesc =
> >>>> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
> >>> A  nit: use some macro (UINt16_MAX?) instead of hardcoded constant if possible.
> >> Using UINT16_MAX here would be very confusing. The value here just like
> >> values below (16, 8, 4) are values that are explicitly stated in the
> >> RSCCTL[n].MAXDESC description in the spec and this code piece is
> >> implementing what spec is demanding. Therefore IMHO using the
> >> explicit values from the spec here is the most readable way considering
> >> the reader that will try to compare this code to the spec section
> >> mentioned above and check that the code is correct.
> > Ok, if you think UINT16_MAX is confusing, then just add a new one: IXGBE_RSC_MAX_PACKET_SIZE or something.
> > As I understand, that's sort of upper limit for the RSC packet size supported, right?
> 
> Why to define a macro for a value that is not used anywhere else but
> here and that is never going to be changed? How does it make the code
> more readable or robust?
> 
> >
> >>
> >>>> +
> >>>> +	if (maxdesc >= 16)
> >>>> +		return IXGBE_RSCCTL_MAXDESC_16;
> >>>> +	else if (maxdesc >= 8)
> >>>> +		return IXGBE_RSCCTL_MAXDESC_8;
> >>>> +	else if (maxdesc >= 4)
> >>>> +		return IXGBE_RSCCTL_MAXDESC_4;
> >>>> +	else
> >>>> +		return IXGBE_RSCCTL_MAXDESC_1;
> >>>> +}
> >>>> +
> >>>> +/* (Taken from FreeBSD tree)
> >>>> +** Setup the correct IVAR register for a particular MSIX interrupt
> >>>> +**   (yes this is all very magic and confusing :)
> >>>> +**  - entry is the register array entry
> >>>> +**  - vector is the MSIX vector for this queue
> >>>> +**  - type is RX/TX/MISC
> >>>> +*/
> >>>> +static void
> >>>> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
> >>>> +{
> >>>> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >>>> +	u32 ivar, index;
> >>>> +
> >>>> +	vector |= IXGBE_IVAR_ALLOC_VAL;
> >>>> +
> >>>> +	switch (hw->mac.type) {
> >>>> +
> >>>> +	case ixgbe_mac_82598EB:
> >>>> +		if (type == -1)
> >>>> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
> >>>> +		else
> >>>> +			entry += (type * 64);
> >>>> +		index = (entry >> 2) & 0x1F;
> >>>> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
> >>>> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
> >>>> +		ivar |= (vector << (8 * (entry & 0x3)));
> >>>> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
> >>>> +		break;
> >>>> +
> >>>> +	case ixgbe_mac_82599EB:
> >>>> +	case ixgbe_mac_X540:
> >>>> +		if (type == -1) { /* MISC IVAR */
> >>>> +			index = (entry & 1) * 8;
> >>>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
> >>>> +			ivar &= ~(0xFF << index);
> >>>> +			ivar |= (vector << index);
> >>>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
> >>>> +		} else {	/* RX/TX IVARS */
> >>>> +			index = (16 * (entry & 1)) + (8 * type);
> >>>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
> >>>> +			ivar &= ~(0xFF << index);
> >>>> +			ivar |= (vector << index);
> >>>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
> >>>> +		}
> >>>> +
> >>>> +		break;
> >>>> +
> >>>> +	default:
> >>>> +		break;
> >>>> +	}
> >>>> +}
> >>>> +
> >>>>    void set_rx_function(struct rte_eth_dev *dev)
> >>>>    {
> >>>>    	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >>>> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
> >>>>    			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
> >>>>    		}
> >>>>    	}
> >>>> +
> >>>> +	/*
> >>>> +	 * Initialize the appropriate LRO callback.
> >>>> +	 *
> >>>> +	 * If all queues satisfy the bulk allocation preconditions
> >>>> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
> >>>> +	 * Otherwise use a single allocation version.
> >>>> +	 */
> >>>> +	if (dev->data->lro) {
> >>>> +		if (hw->rx_bulk_alloc_allowed) {
> >>>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
> >>>> +					   "allocation version");
> >>>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
> >>>> +		} else {
> >>>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
> >>>> +					   "allocation version");
> >>>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
> >>>> +		}
> >>>> +	}
> >>>>    }
> >>> As I understand, ixgbe_recv_pkts_lro() can handle both LRO and normal scattered packets?
> >> Not as it is now. It may be easily patched to do so though.
> >>
> >>> If that so, then can we remove ixgbe_recv_scattered_pkts() at all and use ixgbe_recv_scattered_pkts() for both cases?
> >> This was explicitly requested from me by Bruce Richardson (see
> >> "[dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx
> >> flow?" thread) to separate the complicated handling from the simple high
> >> performance one. The handling in the RSC routine is more generic and
> >> thus is a bit of overkill for the simple scattered case: e.g. there is
> >> no need to a sw_rsc_ring.
> > I think Bruce meant ixgbe_recv_pkts_bulk_alloc() not ixgbe_recv_scattered_pkts()
> > when he told about simple and high performance RX path.
> >
> >> Therefore I preferred to advance with small steps here. And if there
> >> will be a decision to join these flows - it may be done with a rather
> >> small patch in the future.
> > Ok, that's understandable and I wouldn't insist to do that in the same patch.
> > It just worries me that number of our ixgbe RX functions keeps increasing.
> 
> Let's have this series get to the master and I'll send a follow-up
> series that kills non-vector scatter callback. Agreed? ;)

Yes, as I said above, am ok with that.

> 
> >
> >
> >>>>    /*
> >>>> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>>>    	uint32_t maxfrs;
> >>>>    	uint32_t srrctl;
> >>>>    	uint32_t rdrxctl;
> >>>> +	uint32_t rscctl;
> >>>> +	uint32_t psrtype;
> >>>> +	uint32_t rfctl;
> >>>>    	uint32_t rxcsum;
> >>>>    	uint16_t buf_size;
> >>>>    	uint16_t i;
> >>>>    	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
> >>>> +	struct rte_eth_dev_info dev_info = { 0 };
> >>>> +	bool rsc_capable = false;
> >>>> +
> >>>> +	/* Sanity check */
> >>>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> >>>> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
> >>>> +		rsc_capable = true;
> >>> @ 7.11.1 82599 spec says:
> >>> " Note that in SR-IOV mode the RSC must be disabled globally by setting the RFCTL.RSC_DIS bit."
> >>> Add a check?
> >> Good catch! Will add a check. Thanks.
> >>
> >>>> +
> >>>> +	if (!rsc_capable && rx_conf->enable_lro) {
> >>>> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
> >>>> +				   "support it");
> >>>> +		return -EINVAL;
> >>>> +	}
> >>>>
> >>>>    	PMD_INIT_FUNC_TRACE();
> >>>>    	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> >>>> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>>>    	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
> >>>>
> >>>>    	/*
> >>>> +	 * RFCTL configuration
> >>>> +	 *
> >>>> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
> >>>> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
> >>>> +	 */
> >>>> +	if (rsc_capable) {
> >>>> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
> >>>> +		if (rx_conf->enable_lro) {
> >>>> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
> >>>> +				   IXGBE_RFCTL_NFSR_DIS);
> >>>> +		} else {
> >>>> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
> >>>> +		}
> >>>> +
> >>>> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
> >>>> +	}
> >>>> +
> >>>> +
> >>>> +	/*
> >>>>    	 * Configure CRC stripping, if any.
> >>>>    	 */
> >>>>    	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
> >>>>    	if (rx_conf->hw_strip_crc)
> >>>>    		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
> >>>> -	else
> >>>> +	else {
> >>>>    		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
> >>>> +		if (rx_conf->enable_lro) {
> >>>> +			/*
> >>>> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
> >>>> +			 * 3.0 RSC configuration requires HW CRC stripping being
> >>>> +			 * enabled. If user requested both HW CRC stripping off
> >>>> +			 * and RSC on - return an error.
> >>>> +			 */
> >>>> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
> >>>> +					    "is disabled");
> >>>> +			return -EINVAL;
> >>>> +		}
> >>>> +	}
> >>>>
> >>>>    	/*
> >>>>    	 * Configure jumbo frame support, if any.
> >>>> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>>>    		 * Configure Header Split
> >>>>    		 */
> >>>>    		if (rx_conf->header_split) {
> >>>> +			/*
> >>>> +			 * Print a warning if split_hdr_size is less
> >>>> +			 * than 128 bytes when RSC is requested.
> >>>> +			 */
> >>>> +			if (rx_conf->enable_lro &&
> >>>> +			    rx_conf->split_hdr_size < 128)
> >>>> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
> >>>> +						   "128 bytes (%d)!",
> >>>> +					     rx_conf->split_hdr_size);
> >>>> +
> >>>>    			if (hw->mac.type == ixgbe_mac_82599EB) {
> >>>>    				/* Must setup the PSRTYPE register */
> >>>> -				uint32_t psrtype;
> >>>>    				psrtype = IXGBE_PSRTYPE_TCPHDR |
> >>>>    					IXGBE_PSRTYPE_UDPHDR   |
> >>>>    					IXGBE_PSRTYPE_IPV4HDR  |
> >>>> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>>>    			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
> >>>>    		} else
> >>>>    #endif
> >>>> +		{
> >>>>    			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
> >>>> +			/*
> >>>> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
> >>>> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
> >>>> +			 * should be configured even if header split is not
> >>>> +			 * enabled. In the later case we will configure it 128
> >>>> +			 * bytes following the recommendation in the spec.
> >>>> +			 */
> >>>> +			if (rx_conf->enable_lro)
> >>>> +				srrctl |=
> >>>> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
> >>>> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
> >>>> +		}
> >>>>
> >>>>    		/* Set if packets are dropped when no descriptors available */
> >>>>    		if (rxq->drop_en)
> >>>> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>>>    				       RTE_PKTMBUF_HEADROOM);
> >>>>    		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
> >>>>    			   IXGBE_SRRCTL_BSIZEPKT_MASK);
> >>>> +
> >>>> +		/*
> >>>> +		 * TODO: Consider setting the Receive Descriptor Minimum
> >>>> +		 * Threshold Size for and RSC case. This is not an obviously
> >>>> +		 * beneficiary option but the one worth considering...
> >>>> +		 */
> >>>> +
> >>>>    		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
> >>>>
> >>>>    		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
> >>>> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>>>    		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
> >>>>    					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
> >>>>    			dev->data->scattered_rx = 1;
> >>>> +
> >>>> +		/* RSC per-queue configuration */
> >>>> +		if (rx_conf->enable_lro) {
> >>>> +			uint32_t eitr;
> >>>> +
> >>>> +			rscctl =
> >>>> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
> >>>> +			psrtype =
> >>>> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
> >>>> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
> >>>> +
> >>>> +			rscctl |= IXGBE_RSCCTL_RSCEN;
> >>>> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
> >>>> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
> >>>> +
> >>>> +			/*
> >>>> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
> >>>> +			 *
> >>>> +			 * Full-sized RSC aggregations for a 10Gb/s link will
> >>>> +			 * arrive at about 20K aggregation/s rate.
> >>>> +			 *
> >>>> +			 * 2K inst/s rate will make only 10% of the
> >>>> +			 * aggregations to be closed due to the interrupt timer
> >>>> +			 * expiration for a streaming at wire-speed case.
> >>>> +			 *
> >>>> +			 * For a sparse streaming case this setting will yield
> >>>> +			 * at most 500us latency for a single RSC aggregation.
> >>>> +			 */
> >>>> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
> >>> Again probably create some macro for ITR Interval default value here.
> >> Well, again - it's the only place where it's used and I've extensively
> >> explained it in the comments in the code. Therefore I think it's the
> >> most readable way to write this.
> >> If it would be used in at least two places - then I would have put it in
> >> a macro...
> > I think it is a good practise to use macros instead of raw numbers in such places.
> > You probably can make these macros self-explanatory:
> > /* EITR Inteval in 2us uinits for 1G and 10G. */
> > #define IXGBE_EITR_INTERVAL_US	2
> >
> > #define IXGBE_EITR_INTERVAL_SHIFT	3
> >
> > #define IXGBE_EITR_INTERVAL(us)	((us) / IXGBE_EITR_INTERVAL_US << IXGBE_EITR_INTERVAL_SHIFT)
> >
> > /* at most 500us latency for a single RSC aggregation */
> > #define IXGHE_EITR_INTERVAL_DEFAULT  IXGBE_EITR_INTERVAL(500)
> 
> If this value would have a potential be changed one day or if it would
> going to be used somewhere else in the code I would immediately agree
> but here u've added 9 long lines of something that nobody would ever
> care about. The only thing that everybody would care what are the actual
> implication of this value on the RSC functionality. To understand that
> having macros like u propose instead of a proper comment like I propose
> doesn't help much. This is because the thing is not just about the EITR
> interval and the maximum latency. But if we keep my comment then we
> don't need any additional self-explanatory macros because everything has
> been explained in the comment already.
> 
> If one day this parameter is going to be configured from the outside -
> then I agree that there would be a place for macros like above. For the
> current API state I think it would just pump up the code with useless
> code lines.

It is a good approach to do things in  a proper way from the start. 
Here you define macro and use them inside your code.
Then, when someone else will need to manipulate EITR interval - he can use the macros
you defined and he wouldn't need to touch your code.
Same applies to the MAXDESC calculation above.
 
BTW:
eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
...
eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);

Could EITR already contain some previous interval value?
If yes, then we probably either need to clear previous interval value first,
or just write new value of EITR without reading. 

> 
> >
> >>>> +
> >>>> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
> >>>> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
> >>>> +								       psrtype);
> >>>> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
> >>>> +
> >>>> +			/*
> >>>> +			 * RSC requires the mapping of the queue to the
> >>>> +			 * interrupt vector.
> >>>> +			 */
> >>>> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
> >>> Hm, wonder why do we need to setup IVAR for RSC?
> >>> Wouldn't just setting EITR be enough?
> >> Nope. See 82599 spec chapter 4.6.7.2.2.
> > I read it, though it doesn't say 'IVAR must be setup' like it does for EITR.Inerval.
> 
> 82599 Spec, Chapter 4.6.7.2.2 ("RSC Enablement" -> "Per Queue Setting"),
> the last bullet:
> 
> "- Map the relevant Rx queues to an interrupt by setting the relevant IVAR
> registers."
> 
> > That made me thought that it might be optional.
> >
> >> I think I even tried not to map
> >> the queues to IVAR and it didn't work... ;)
> > Pity, but not much we can in that case, I suppose.
> >
> >>>> +
> >>>> +			rxq->rsc_en = 1;
> >>>> +		}
> >>>>    	}
> >>>>
> >>>>    	if (rx_conf->enable_scatter)
> >>>>    		dev->data->scattered_rx = 1;
> >>>>
> >>>> +	if (rx_conf->enable_lro)
> >>>> +		dev->data->lro = 1;
> >>>> +
> >>>>    	set_rx_function(dev);
> >>>>
> >>>>    	/*
> >>>> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >>>>    		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
> >>>>    	}
> >>>>
> >>>> +	/* Finalize RSC configuration  */
> >>>> +	if (rx_conf->enable_lro) {
> >>>> +		/*
> >>>> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
> >>>> +		 */
> >>>> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
> >>>> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
> >>>> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
> >>>> +
> >>>> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
> >>>> +	}
> >>>> +
> >>>> +
> >>>>    	return 0;
> >>>>    }
> >>>>
> >>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> >>>> index bbe5ff3..389173f 100644
> >>>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> >>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> >>>> @@ -79,6 +79,10 @@ struct igb_rx_entry {
> >>>>    	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
> >>>>    };
> >>>>
> >>>> +struct igb_rsc_entry {
> >>>> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
> >>>> +};
> >>>> +
> >>>>    /**
> >>>>     * Structure associated with each descriptor of the TX ring of a TX queue.
> >>>>     */
> >>>> @@ -105,6 +109,7 @@ struct igb_rx_queue {
> >>>>    	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
> >>>>    	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
> >>>>    	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
> >>>> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
> >>>>    	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
> >>>>    	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
> >>>>    	uint64_t            mbuf_initializer; /**< value to init mbufs */
> >>>> @@ -126,6 +131,7 @@ struct igb_rx_queue {
> >>>>    	uint8_t             port_id;  /**< Device port identifier. */
> >>>>    	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
> >>>>    	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
> >>>> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
> >>>>    	uint8_t             rx_deferred_start; /**< not in global dev start. */
> >>>>    #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
> >>>>    	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */
> >>>> --
> >>>> 2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-11 16:32           ` Ananyev, Konstantin
@ 2015-03-11 16:54             ` Vlad Zolotarov
  2015-03-13  9:07               ` Olivier MATZ
  0 siblings, 1 reply; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-11 16:54 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev



On 03/11/15 18:32, Ananyev, Konstantin wrote:
>
>> -----Original Message-----
>> From: Vlad Zolotarov [mailto:vladz@cloudius-systems.com]
>> Sent: Tuesday, March 10, 2015 9:36 PM
>> To: Ananyev, Konstantin; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
>>
>>
>>
>> On 03/10/15 22:09, Ananyev, Konstantin wrote:
>>>>> Hi Vlad,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: dev [mailto:dev-bounces at dpdk.org] On Behalf Of Vlad Zolotarov
>>>>>> Sent: Monday, March 09, 2015 7:07 PM
>>>>>> To: dev at dpdk.org
>>>>>> Subject: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
>>>>>>
>>>>>>        - Only x540 and 82599 devices support LRO.
>>>>>>        - Add the appropriate HW configuration.
>>>>>>        - Add RSC aware rx_pkt_burst() handlers:
>>>>>>           - Implemented bulk allocation and non-bulk allocation versions.
>>>>>>           - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
>>>>>>             and to igb_rx_queue.
>>>>>>           - Use the appropriate handler when LRO is requested.
>>>>>>
>>>>>> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com>
>>>>>> ---
>>>>>> New in v5:
>>>>>>       - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
>>>>>>       - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
>>>>>>
>>>>>> New in v4:
>>>>>>       - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
>>>>>>         RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
>>>>>>
>>>>>> New in v2:
>>>>>>       - Removed rte_eth_dev_data.lro_bulk_alloc.
>>>>>>       - Fixed a few styling and spelling issues.
>>>>>> ---
>>>>>>     lib/librte_ether/rte_ethdev.h       |   9 +-
>>>>>>     lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
>>>>>>     lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
>>>>>>     lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
>>>>>>     lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
>>>>>>     5 files changed, 581 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
>>>>>> index 8db3127..44f081f 100644
>>>>>> --- a/lib/librte_ether/rte_ethdev.h
>>>>>> +++ b/lib/librte_ether/rte_ethdev.h
>>>>>> @@ -172,6 +172,9 @@ extern "C" {
>>>>>>
>>>>>>     #include <stdint.h>
>>>>>>
>>>>>> +/* Use this macro to check if LRO API is supported */
>>>>>> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
>>>>>> +
>>>>>>     #include <rte_log.h>
>>>>>>     #include <rte_interrupts.h>
>>>>>>     #include <rte_pci.h>
>>>>>> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
>>>>>>     	enum rte_eth_rx_mq_mode mq_mode;
>>>>>>     	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
>>>>>>     	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
>>>>>> -	uint8_t header_split : 1, /**< Header Split enable. */
>>>>>> +	uint16_t header_split : 1, /**< Header Split enable. */
>>>>>>     		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
>>>>>>     		hw_vlan_filter   : 1, /**< VLAN filter enable. */
>>>>>>     		hw_vlan_strip    : 1, /**< VLAN strip enable. */
>>>>>>     		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
>>>>>>     		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
>>>>>>     		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
>>>>>> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
>>>>>> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
>>>>>> +		enable_lro       : 1; /**< Enable LRO */
>>>>>>     };
>>>>>>
>>>>>>     /**
>>>>>> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
>>>>>>     	uint8_t port_id;           /**< Device [external] port identifier. */
>>>>>>     	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
>>>>>>     		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
>>>>>> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
>>>>>>     		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
>>>>>>     		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
>>>>>>     };
>>>>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>>>>> index 9d3de1a..765174d 100644
>>>>>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>>>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>>>>> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
>>>>>>
>>>>>>     	/* Clear stored conf */
>>>>>>     	dev->data->scattered_rx = 0;
>>>>>> +	dev->data->lro = 0;
>>>>>>     	hw->rx_bulk_alloc_allowed = false;
>>>>>>     	hw->rx_vec_allowed = false;
>>>>>>
>>>>>> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>>>>>>     		DEV_RX_OFFLOAD_IPV4_CKSUM |
>>>>>>     		DEV_RX_OFFLOAD_UDP_CKSUM  |
>>>>>>     		DEV_RX_OFFLOAD_TCP_CKSUM;
>>>>>> +
>>>>>> +	if (hw->mac.type == ixgbe_mac_82599EB ||
>>>>>> +	    hw->mac.type == ixgbe_mac_X540)
>>>>>> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
>>>>>> +
>>>>>>     	dev_info->tx_offload_capa =
>>>>>>     		DEV_TX_OFFLOAD_VLAN_INSERT |
>>>>>>     		DEV_TX_OFFLOAD_IPV4_CKSUM  |
>>>>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>>>>> index a549f5c..e206584 100644
>>>>>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>>>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>>>>> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>>>>     uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
>>>>>>     		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>>>>>
>>>>>> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
>>>>>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>>>>> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
>>>>>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>>>>> +
>>>>>>     uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>>>>>>     		uint16_t nb_pkts);
>>>>>>
>>>>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>>>>> index 58e619b..944c662 100644
>>>>>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>>>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>>>>> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>>>>     }
>>>>>>
>>>>>>     /**
>>>>>> + * Detect an RSC descriptor.
>>>>>> + */
>>>>>> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
>>>>>> +{
>>>>>> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
>>>>>> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>>      * Initialize the first mbuf of the returned packet:
>>>>>>      *    - RX port identifier,
>>>>>>      *    - hardware offload data, if any:
>>>>>> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
>>>>>>     	}
>>>>>>     }
>>>>>>
>>>>>> +/**
>>>>>> + * Bulk receive handler for and LRO case.
>>>>>> + *
>>>>>> + * @rx_queue Rx queue handle
>>>>>> + * @rx_pkts table of received packets
>>>>>> + * @nb_pkts size of rx_pkts table
>>>>>> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
>>>>>> + *
>>>>>> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
>>>>>> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
>>>>>> + *
>>>>>> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
>>>>>> + * 1) When non-EOP RSC completion arrives:
>>>>>> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
>>>>>> + *       segment's data length.
>>>>>> + *    b) Set the "next" pointer of the current segment to point to the segment
>>>>>> + *       at the NEXTP index.
>>>>>> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
>>>>>> + *       in the sw_rsc_ring.
>>>>>> + * 2) When EOP arrives we just update the cluster's total length and offload
>>>>>> + *    flags and deliver the cluster up to the upper layers. In our case - put it
>>>>>> + *    in the rx_pkts table.
>>>>>> + *
>>>>>> + * Returns the number of received packets/clusters (according to the "bulk
>>>>>> + * receive" interface).
>>>>>> + */
>>>>>> +static inline uint16_t
>>>>>> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
>>>>>> +	       bool bulk_alloc)
>>>>>> +{
>>>>>> +	struct igb_rx_queue *rxq = rx_queue;
>>>>>> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
>>>>>> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
>>>>>> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
>>>>>> +	uint16_t rx_id = rxq->rx_tail;
>>>>>> +	uint16_t nb_rx = 0;
>>>>>> +	uint16_t nb_hold = rxq->nb_rx_hold;
>>>>>> +	uint16_t prev_id = rxq->rx_tail;
>>>>>> +
>>>>>> +	while (nb_rx < nb_pkts) {
>>>>>> +		bool eop;
>>>>>> +		struct igb_rx_entry *rxe;
>>>>>> +		struct igb_rsc_entry *rsc_entry;
>>>>>> +		struct igb_rsc_entry *next_rsc_entry;
>>>>>> +		struct igb_rx_entry *next_rxe;
>>>>>> +		struct rte_mbuf *first_seg;
>>>>>> +		struct rte_mbuf *rxm;
>>>>>> +		struct rte_mbuf *nmb;
>>>>>> +		union ixgbe_adv_rx_desc rxd;
>>>>>> +		uint16_t data_len;
>>>>>> +		uint16_t next_id;
>>>>>> +		volatile union ixgbe_adv_rx_desc *rxdp;
>>>>>> +		uint32_t staterr;
>>>>>> +
>>>>>> +next_desc:
>>>>>> +		/*
>>>>>> +		 * The code in this whole file uses the volatile pointer to
>>>>>> +		 * ensure the read ordering of the status and the rest of the
>>>>>> +		 * descriptor fields (on the compiler level only!!!). This is so
>>>>>> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
>>>>>> +		 * even has the rte_compiler_barrier() for that.
>>>>>> +		 *
>>>>>> +		 * But most importantly this is just wrong because this doesn't
>>>>>> +		 * ensure memory ordering in a general case at all. For
>>>>>> +		 * instance, DPDK is supposed to work on Power CPUs where
>>>>>> +		 * compiler barrier may just not be enough!
>>>>>> +		 *
>>>>>> +		 * I tried to write only this function properly to have a
>>>>>> +		 * starting point (as a part of an LRO/RSC series) but the
>>>>>> +		 * compiler cursed at me when I tried to cast away the
>>>>>> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
>>>>>> +		 * keeping it the way it is for now.
>>>>>> +		 *
>>>>>> +		 * The code in this file is broken in so many other places and
>>>>>> +		 * will just not work on a big endian CPU anyway therefore the
>>>>>> +		 * lines below will have to be revisited together with the rest
>>>>>> +		 * of the ixgbe PMD.
>>>>>> +		 *
>>>>>> +		 * TODO:
>>>>>> +		 *    - Get rid of "volatile" crap and let the compiler do its
>>>>>> +		 *      job.
>>>>>> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
>>>>>> +		 *      memory ordering below.
>>>>> Ok, so you wanted to put rte_rmb(), straight after:
>>>>> staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>>> correct?
>>>>> I agree that for machines with relaxed memory model (PPC) we do need it here.
>>>>> So why not just put it there, instead of complaining about in in comments? ;)
>>>> Because it's not a proper fix and I don't like workarounds.
>>> Why not? For machines with relaxed memory model you would need  rmb() here no matter does rxdp points to volatile or not.
>>>
>>>>> About rxdp being pointer to volatile, why it bothers you that much?
>>>> Because using of "volatile" prevent the compiler from optimizing every
>>>> code piece where the "volatile" variable is participating and that's a
>>>> shame.
>>>> Read this
>>>> https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
>>>> for a more detailed explanation.
>>>>
>>>>> You copy the whole RXD to the local variable anyway, and then reference it only to setup new addresses.
>>>> The fact that we have to copy the whole descriptor while we may not need
>>>> all the data from it at the end is one problem.
>>> I understand that, but I don't think that the difference would that critical.
>>> Though I don't have any data in hand to compare.
>>>
>>>> The proper solution in Rx ring context should go as follows:
>>>>
>>>>    1. Remove the "volatile" qualifier from rx_ring (HW Rx descriptors ring).
>>>>    2. Remove "volatile" at all places where rx_ring is accessed.
>>>>    3. Adjust the code in (2):
>>>>        1. Remove the descriptor copy u've mentioned and access the
>>>>           descriptor data directly.
>>>>        2. Ensure the proper ordering by using the proper memory barriers,
>>>>           which are missing in the DPDK SDK at the moment (see a small
>>>>           discussion about this with Stephen and Avi on "[dpdk-dev]
>>>>           [PATCH v1 5/5] ixgbe: Add LRO support" thread).
>>> I think you are mixing 2 different issues here:
>>>
>>> 1.  For architectures with relaxed memory model we do need rmb() after that line:
>>> staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>> We do need it *always*, not depending on is rx_ring a volatile or not.
>>> If we really plan to support PPC and other architectures that allow read reordering  -
>>> not having an 'rmb()' or similar sync primitive here is a bug.
>>> Same thing applies to 'wmb()' before updating RDT.
>>>
>>> 2. volatile rx_ring vs non-volatile with explicit memory ordering instrincts.
>>> Actually I think that using volatile rx_ring is not a real bug on itself.
>>> Code with volatile rx_ring and fix for #1 in place would work correctly on all architectures.
>>> It might be slower than non-volatile approach, but nothing would be broken.
>>>
>>> About the existing RX/TX functions and PPC support:
>>> Note that all of them were created before PPC support for DPDK was introduced.
>>> At that moment only IA was supported.
>>> That's why in some places where you would expect to see 'mb()' there are 'volatile' and/or ' rte_compiler_barrier' instead.
>>> Why all that places wasn't updated when PPC support was added - that's another question.
>>>   From my understanding - with current implementation some of DPDK PMDs RX/TX functions and  rte_ring wouldn't work correctly
>> on PPC.
>>> So, I suppose we need to decide for ourselves - do we really want to support PPC and other architectures with non-IA memory
>> model or not?
>>> If not, then I think we don't need any mb()s inside recv_pkts_lro() - just rte_compiler_barrier seems enough, and no point to
>> complain about
>>> it in comments.
>>> If yes - then why to introduce a new function with a known potential bug?
>> In order to introduce a new function with the proper implementation or
>> to fix any other places with the similar weakness I would need a proper
>> tools like a proper platform-dependent barrier-macros similar to
>> smp_Xmb() Linux macros that reduce to a compiler barrier where
>> appropriate or to a proper memory fence where needed.
> I understand that.
> Let's add new macro for that: rte_smp_Xmb() or something,
> so it would be just rte_compiler_barrier() for x86 and a proper mb() for PPC.

There was an idea to use the C11 built-in memory barriers. I suggest we 
open a separate discussion about that and add these and the appropriate 
fixes in a separate series. There are quite a few places to fix anyway, 
which are currently broken on PPC so this patch doesn't make things any 
worse. However adding a new memory barrier doesn't belong to an LRO 
functionality and thus to this series.

>
>> Unfortunately DPDK doesn't have such at the moment. That's why I put a
>> big fat comment at the place that has to be fixed once they are introduced.
>>
>> U are right though about "volatile" thing not being a bug but it would
>> be strange to keep it after barriers are properly placed. That's why I
>> think these 2 changes should go together.
> Yep, with explicit memory ordering volatile will become redundant and could be removed.
> Though, I don't see why it should be applied separately.
>  From my point: first is a bug fix, second is an enhancement.

Exactly! And this series is about the LRO feature and not about the bug 
fix u've mentioned above. There should be a separate series that fixes 
this bug in all places in the code.

>
>> About the "decision" we have to make - I think it has been decided
>> already since PPC is one the official DPDK targets. Therefore the only
>> thing to decide here is when and who gets to fix these things. One thing
>> is obvious - this patch is not the right place to do it. ;)
> My thought was to introduce such macro(s) and start using it with that patch :)
> But ok, if you feel like it's too much for that patch, let's leave it as it is right now.

I've got my point! ;) I think we have to properly discuss it first since 
there are a few ways to do this.

>
>>>> As it sounds this is going to be a VERY sensitive patchset.
>>>> That's why it should go separately from this patchwork (or from any
>>>> other patchwork).
>>> For that patch, I am not suggesting you to change any other functions, just one that you introducing.
>> I don't think that putting an lfence on x86 there is a good idea. As
>> I've just explained above - once DPDK has proper platform-dependent
>> rmb() macros I'll gladly revisit these lines.
> Sure, plain rte_rmb() would slowdown things a lot here.
> Totally agree with you here - it should be platform dependent macro, see above.
>
>> Frankly, the same could be
>> told about the rte_wmb() before the RDT update but it is much less
>> harmful than lfence so I didn't raise it... ;)
>>
>>>>>> +		 */
>>>>>> +		rxdp = &rx_ring[rx_id];
>>>>>> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>>>>> +
>>>>>> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>>>>> +			break;
>>>>>> +
>>>>>> +		rxd = *rxdp;
>>>>>> +
>>>>>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>>>>>> +				  "staterr=0x%x data_len=%u",
>>>>>> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
>>>>>> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
>>>>>> +
>>>>>> +		if (!bulk_alloc) {
>>>>>> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
>>>>>> +			if (nmb == NULL) {
>>>>>> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
>>>>>> +						  "port_id=%u queue_id=%u",
>>>>>> +					   rxq->port_id, rxq->queue_id);
>>>>>> +
>>>>>> +				rte_eth_devices[rxq->port_id].data->
>>>>>> +							rx_mbuf_alloc_failed++;
>>>>>> +				break;
>>>>>> +			}
>>>>>> +		} else if (nb_hold > rxq->rx_free_thresh) {
>>>>>> +			uint16_t next_rdt = rxq->rx_free_trigger;
>>>>>> +
>>>>>> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
>>>>>> +				rte_wmb();
>>>>>> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
>>>>>> +						    next_rdt);
>>>>>> +				nb_hold -= rxq->rx_free_thresh;
>>>>>> +			} else {
>>>>>> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
>>>>>> +						  "port_id=%u queue_id=%u",
>>>>>> +					   rxq->port_id, rxq->queue_id);
>>>>>> +
>>>>>> +				rte_eth_devices[rxq->port_id].data->
>>>>>> +							rx_mbuf_alloc_failed++;
>>>>>> +				break;
>>>>>> +			}
>>>>>> +		}
>>>>>> +
>>>>>> +		nb_hold++;
>>>>>> +		rxe = &sw_ring[rx_id];
>>>>>> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
>>>>>> +
>>>>>> +		next_id = rx_id + 1;
>>>>>> +		if (next_id == rxq->nb_rx_desc)
>>>>>> +			next_id = 0;
>>>>>> +
>>>>>> +		/* Prefetch next mbuf while processing current one. */
>>>>>> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * When next RX descriptor is on a cache-line boundary,
>>>>>> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
>>>>>> +		 * to mbufs.
>>>>>> +		 */
>>>>>> +		if ((next_id & 0x3) == 0) {
>>>>>> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
>>>>>> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
>>>>>> +		}
>>>>>> +
>>>>>> +		rxm = rxe->mbuf;
>>>>>> +
>>>>>> +		if (!bulk_alloc) {
>>>>>> +			__le64 dma =
>>>>>> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
>>>>>> +			/*
>>>>>> +			 * Update RX descriptor with the physical address of the
>>>>>> +			 * new data buffer of the new allocated mbuf.
>>>>>> +			 */
>>>>>> +			rxe->mbuf = nmb;
>>>>>> +
>>>>>> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
>>>>>> +			rxdp->read.hdr_addr = dma;
>>>>>> +			rxdp->read.pkt_addr = dma;
>>>>>> +		}
>>>>>> +		/*
>>>>>> +		 * Set data length & data buffer address of mbuf.
>>>>>> +		 */
>>>>>> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
>>>>>> +		rxm->data_len = data_len;
>>>>>> +
>>>>>> +		if (!eop) {
>>>>>> +			uint16_t nextp_id;
>>>>>> +			/*
>>>>>> +			 * Get next descriptor index:
>>>>>> +			 *  - For RSC it's in the NEXTP field.
>>>>>> +			 *  - For a scattered packet - it's just a following
>>>>>> +			 *    descriptor.
>>>>>> +			 */
>>>>>> +			if (ixgbe_rsc_count(&rxd))
>>>>>> +				nextp_id =
>>>>>> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
>>>>>> +						       IXGBE_RXDADV_NEXTP_SHIFT;
>>>>>> +			else
>>>>>> +				nextp_id = next_id;
>>>>>> +
>>>>>> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
>>>>>> +			next_rxe = &sw_ring[nextp_id];
>>>>>> +			rte_ixgbe_prefetch(next_rxe);
>>>>>> +		}
>>>>>> +
>>>>>> +		rsc_entry = &sw_rsc_ring[rx_id];
>>>>>> +		first_seg = rsc_entry->fbuf;
>>>>>> +		rsc_entry->fbuf = NULL;
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * If this is the first buffer of the received packet,
>>>>>> +		 * set the pointer to the first mbuf of the packet and
>>>>>> +		 * initialize its context.
>>>>>> +		 * Otherwise, update the total length and the number of segments
>>>>>> +		 * of the current scattered packet, and update the pointer to
>>>>>> +		 * the last mbuf of the current packet.
>>>>>> +		 */
>>>>>> +		if (first_seg == NULL) {
>>>>>> +			first_seg = rxm;
>>>>>> +			first_seg->pkt_len = data_len;
>>>>>> +			first_seg->nb_segs = 1;
>>>>>> +		} else {
>>>>>> +			first_seg->pkt_len += data_len;
>>>>>> +			first_seg->nb_segs++;
>>>>>> +		}
>>>>>> +
>>>>>> +		prev_id = rx_id;
>>>>>> +		rx_id = next_id;
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * If this is not the last buffer of the received packet, update
>>>>>> +		 * the pointer to the first mbuf at the NEXTP entry in the
>>>>>> +		 * sw_rsc_ring and continue to parse the RX ring.
>>>>>> +		 */
>>>>>> +		if (!eop) {
>>>>>> +			rxm->next = next_rxe->mbuf;
>>>>>> +			next_rsc_entry->fbuf = first_seg;
>>>>>> +			goto next_desc;
>>>>> So _recv_pkts_lro() can return with one of rxq->rsc_entry[i] != NULL, correct?
>>>>> If so, then I think you need at ixgbe_rx_queue_release_mbufs() to add the code, that would go through
>>>>> all rsc_entry[] to find one whose fbuf  is != NULL, call rte_pktmbuf_free() for it and reset to NULL.
>>>>>     To handle the case:
>>>>> recv_pkts_lro(rxq, ...);
>>>>> rte_eth_dev_stop();
>>>>> rte_eth_dev_start();
>>>>> recv_pkts_lro(rxq, ...);
>>>> Right. I've missed that part.
>>>>
>>>>> BTW, that also means that you can't do:
>>>>> rxm->next = next_rxe->mbuf;
>>>>> above, and
>>>>> rxm->next = NULL;
>>>>> should be done before 'goto next_desc;' too
>>>> Your proposal will cost cycles in the fast path on account of saving
>>>> cycles in the slow path: we'll have to add another pointer to the
>>>> igb_rsc_entry to hold the last mbuf in the current cluster that we'll
>>>> have to read and update for every new completed RSC descriptor.
>>>>
>>>> The easier way would be to just reset the next-pointer of the last
>>>> descriptor in the RSC cluster to NULL (according to nb_segs) before
>>>> calling for rte_pktmbuf_free() in ixgbe_rx_queue_release_mbufs().
>>> Should work too, I think.
>> The final solution is even nicer - see v7. And it works like a charm
>> too... ;)
> Good to hear :)
>
>
>>>>>> +		}
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * This is the last buffer of the received packet - return
>>>>>> +		 * the current cluster to the user.
>>>>>> +		 */
>>>>>> +		rxm->next = NULL;
>>>>>> +
>>>>>> +		/* Initialize the first mbuf of the returned packet */
>>>>>> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
>>>>>> +					    staterr);
>>>>>> +
>>>>>> +		/* Prefetch data of first segment, if configured to do so. */
>>>>>> +		rte_packet_prefetch((char *)first_seg->buf_addr +
>>>>>> +			first_seg->data_off);
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * Store the mbuf address into the next entry of the array
>>>>>> +		 * of returned packets.
>>>>>> +		 */
>>>>>> +		rx_pkts[nb_rx++] = first_seg;
>>>>>> +	}
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Record index of the next RX descriptor to probe.
>>>>>> +	 */
>>>>>> +	rxq->rx_tail = rx_id;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * If the number of free RX descriptors is greater than the RX free
>>>>>> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
>>>>>> +	 * register.
>>>>>> +	 * Update the RDT with the value of the last processed RX descriptor
>>>>>> +	 * minus 1, to guarantee that the RDT register is never equal to the
>>>>>> +	 * RDH register, which creates a "full" ring situtation from the
>>>>>> +	 * hardware point of view...
>>>>>> +	 */
>>>>>> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
>>>>>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
>>>>>> +			   "nb_hold=%u nb_rx=%u",
>>>>>> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
>>>>>> +
>>>>> I suppose if you do wmb() after rte_rxmbuf_alloc(), you'd better do it here too.
>>>> Right! Missed that when copied this code from
>>>> ixgbe_recv_scattered_pkts()... ;) Note that the barrier is missing there
>>>> too...
>>>> These are the examples of the code that works on x86 only because of
>>>> that "volatile" thing and will break once it's removed. On PPC it is
>>>> broken even with "volatile".
>>> Yep, as I said above -for IA we don't need mb() here - using 'volatile' or compiler barrier seems enough to me.
>>> For PPC - I think we do.
>>>
>>>>>> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
>>>>>> +		nb_hold = 0;
>>>>>> +	}
>>>>>> +
>>>>>> +	rxq->nb_rx_hold = nb_hold;
>>>>>> +	return nb_rx;
>>>>>> +}
>>>>>> +
>>>>>> +uint16_t
>>>>>> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>>>>>> +{
>>>>>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
>>>>>> +}
>>>>>> +
>>>>>> +uint16_t
>>>>>> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>>>> +			       uint16_t nb_pkts)
>>>>>> +{
>>>>>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
>>>>>> +}
>>>>>> +
>>>>>>     uint16_t
>>>>>>     ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>>>>     			  uint16_t nb_pkts)
>>>>>> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
>>>>>>     	if (rxq != NULL) {
>>>>>>     		ixgbe_rx_queue_release_mbufs(rxq);
>>>>>>     		rte_free(rxq->sw_ring);
>>>>>> +		rte_free(rxq->sw_rsc_ring);
>>>>>>     		rte_free(rxq);
>>>>>>     	}
>>>>>>     }
>>>>>> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
>>>>>>     	rxq->nb_rx_hold = 0;
>>>>>>     	rxq->pkt_first_seg = NULL;
>>>>>>     	rxq->pkt_last_seg = NULL;
>>>>>> +	rxq->rsc_en = 0;
>>>>>>     }
>>>>>>
>>>>>>     int
>>>>>> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>>>>>     	struct igb_rx_queue *rxq;
>>>>>>     	struct ixgbe_hw     *hw;
>>>>>>     	uint16_t len;
>>>>>> +	struct rte_eth_dev_info dev_info = { 0 };
>>>>>> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
>>>>>> +	bool rsc_requested = false;
>>>>>> +
>>>>>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>>>>>> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
>>>>>> +	    dev_rx_mode->enable_lro)
>>>>>> +		rsc_requested = true;
>>>>>>
>>>>>>     	PMD_INIT_FUNC_TRACE();
>>>>>>     	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>>>>>     	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
>>>>>>     					  sizeof(struct igb_rx_entry) * len,
>>>>>>     					  RTE_CACHE_LINE_SIZE, socket_id);
>>>>>> -	if (rxq->sw_ring == NULL) {
>>>>>> +	if (!rxq->sw_ring) {
>>>>> Wonder what was wrong with that one? :)
>>>> Nothing - just aligned it with the lines I've added below. ;)
>>>>
>>>>>>     		ixgbe_rx_queue_release(rxq);
>>>>>>     		return (-ENOMEM);
>>>>>>     	}
>>>>>> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
>>>>>> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
>>>>>> +
>>>>>> +	if (rsc_requested) {
>>>>>> +		rxq->sw_rsc_ring =
>>>>>> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
>>>>>> +					   sizeof(struct igb_rsc_entry) * len,
>>>>>> +					   RTE_CACHE_LINE_SIZE, socket_id);
>>>>>> +		if (!rxq->sw_rsc_ring) {
>>>>>> +			ixgbe_rx_queue_release(rxq);
>>>>>> +			return (-ENOMEM);
>>>>>> +		}
>>>>>> +	} else {
>>>>>> +		rxq->sw_rsc_ring = NULL;
>>>>>> +	}
>>>>>> +
>>>>>> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
>>>>>> +			    "dma_addr=0x%"PRIx64,
>>>>>> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
>>>>>> +		     rxq->rx_ring_phys_addr);
>>>>>>
>>>>>>     	if (!rte_is_power_of_2(nb_desc)) {
>>>>>>     		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
>>>>>> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
>>>>>>     	return 0;
>>>>>>     }
>>>>>>
>>>>>> +/**
>>>>>> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
>>>>>> + *
>>>>>> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
>>>>>> + * spec rev. 3.0 chapter 8.2.3.8.13.
>>>>>> + *
>>>>>> + * @pool Memory pool of the Rx queue
>>>>>> + */
>>>>>> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
>>>>>> +{
>>>>>> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
>>>>>> +
>>>>>> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
>>>>>> +	uint16_t maxdesc =
>>>>>> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
>>>>> A  nit: use some macro (UINt16_MAX?) instead of hardcoded constant if possible.
>>>> Using UINT16_MAX here would be very confusing. The value here just like
>>>> values below (16, 8, 4) are values that are explicitly stated in the
>>>> RSCCTL[n].MAXDESC description in the spec and this code piece is
>>>> implementing what spec is demanding. Therefore IMHO using the
>>>> explicit values from the spec here is the most readable way considering
>>>> the reader that will try to compare this code to the spec section
>>>> mentioned above and check that the code is correct.
>>> Ok, if you think UINT16_MAX is confusing, then just add a new one: IXGBE_RSC_MAX_PACKET_SIZE or something.
>>> As I understand, that's sort of upper limit for the RSC packet size supported, right?
>> Why to define a macro for a value that is not used anywhere else but
>> here and that is never going to be changed? How does it make the code
>> more readable or robust?
>>
>>>>>> +
>>>>>> +	if (maxdesc >= 16)
>>>>>> +		return IXGBE_RSCCTL_MAXDESC_16;
>>>>>> +	else if (maxdesc >= 8)
>>>>>> +		return IXGBE_RSCCTL_MAXDESC_8;
>>>>>> +	else if (maxdesc >= 4)
>>>>>> +		return IXGBE_RSCCTL_MAXDESC_4;
>>>>>> +	else
>>>>>> +		return IXGBE_RSCCTL_MAXDESC_1;
>>>>>> +}
>>>>>> +
>>>>>> +/* (Taken from FreeBSD tree)
>>>>>> +** Setup the correct IVAR register for a particular MSIX interrupt
>>>>>> +**   (yes this is all very magic and confusing :)
>>>>>> +**  - entry is the register array entry
>>>>>> +**  - vector is the MSIX vector for this queue
>>>>>> +**  - type is RX/TX/MISC
>>>>>> +*/
>>>>>> +static void
>>>>>> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
>>>>>> +{
>>>>>> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>> +	u32 ivar, index;
>>>>>> +
>>>>>> +	vector |= IXGBE_IVAR_ALLOC_VAL;
>>>>>> +
>>>>>> +	switch (hw->mac.type) {
>>>>>> +
>>>>>> +	case ixgbe_mac_82598EB:
>>>>>> +		if (type == -1)
>>>>>> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
>>>>>> +		else
>>>>>> +			entry += (type * 64);
>>>>>> +		index = (entry >> 2) & 0x1F;
>>>>>> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
>>>>>> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
>>>>>> +		ivar |= (vector << (8 * (entry & 0x3)));
>>>>>> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
>>>>>> +		break;
>>>>>> +
>>>>>> +	case ixgbe_mac_82599EB:
>>>>>> +	case ixgbe_mac_X540:
>>>>>> +		if (type == -1) { /* MISC IVAR */
>>>>>> +			index = (entry & 1) * 8;
>>>>>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
>>>>>> +			ivar &= ~(0xFF << index);
>>>>>> +			ivar |= (vector << index);
>>>>>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
>>>>>> +		} else {	/* RX/TX IVARS */
>>>>>> +			index = (16 * (entry & 1)) + (8 * type);
>>>>>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
>>>>>> +			ivar &= ~(0xFF << index);
>>>>>> +			ivar |= (vector << index);
>>>>>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
>>>>>> +		}
>>>>>> +
>>>>>> +		break;
>>>>>> +
>>>>>> +	default:
>>>>>> +		break;
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>>     void set_rx_function(struct rte_eth_dev *dev)
>>>>>>     {
>>>>>>     	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
>>>>>>     			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
>>>>>>     		}
>>>>>>     	}
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Initialize the appropriate LRO callback.
>>>>>> +	 *
>>>>>> +	 * If all queues satisfy the bulk allocation preconditions
>>>>>> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
>>>>>> +	 * Otherwise use a single allocation version.
>>>>>> +	 */
>>>>>> +	if (dev->data->lro) {
>>>>>> +		if (hw->rx_bulk_alloc_allowed) {
>>>>>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
>>>>>> +					   "allocation version");
>>>>>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
>>>>>> +		} else {
>>>>>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
>>>>>> +					   "allocation version");
>>>>>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
>>>>>> +		}
>>>>>> +	}
>>>>>>     }
>>>>> As I understand, ixgbe_recv_pkts_lro() can handle both LRO and normal scattered packets?
>>>> Not as it is now. It may be easily patched to do so though.
>>>>
>>>>> If that so, then can we remove ixgbe_recv_scattered_pkts() at all and use ixgbe_recv_scattered_pkts() for both cases?
>>>> This was explicitly requested from me by Bruce Richardson (see
>>>> "[dpdk-dev] : ixgbe: why bulk allocation is not used for a scattered Rx
>>>> flow?" thread) to separate the complicated handling from the simple high
>>>> performance one. The handling in the RSC routine is more generic and
>>>> thus is a bit of overkill for the simple scattered case: e.g. there is
>>>> no need to a sw_rsc_ring.
>>> I think Bruce meant ixgbe_recv_pkts_bulk_alloc() not ixgbe_recv_scattered_pkts()
>>> when he told about simple and high performance RX path.
>>>
>>>> Therefore I preferred to advance with small steps here. And if there
>>>> will be a decision to join these flows - it may be done with a rather
>>>> small patch in the future.
>>> Ok, that's understandable and I wouldn't insist to do that in the same patch.
>>> It just worries me that number of our ixgbe RX functions keeps increasing.
>> Let's have this series get to the master and I'll send a follow-up
>> series that kills non-vector scatter callback. Agreed? ;)
> Yes, as I said above, am ok with that.
>
>>>
>>>>>>     /*
>>>>>> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>>>     	uint32_t maxfrs;
>>>>>>     	uint32_t srrctl;
>>>>>>     	uint32_t rdrxctl;
>>>>>> +	uint32_t rscctl;
>>>>>> +	uint32_t psrtype;
>>>>>> +	uint32_t rfctl;
>>>>>>     	uint32_t rxcsum;
>>>>>>     	uint16_t buf_size;
>>>>>>     	uint16_t i;
>>>>>>     	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
>>>>>> +	struct rte_eth_dev_info dev_info = { 0 };
>>>>>> +	bool rsc_capable = false;
>>>>>> +
>>>>>> +	/* Sanity check */
>>>>>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>>>>>> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
>>>>>> +		rsc_capable = true;
>>>>> @ 7.11.1 82599 spec says:
>>>>> " Note that in SR-IOV mode the RSC must be disabled globally by setting the RFCTL.RSC_DIS bit."
>>>>> Add a check?
>>>> Good catch! Will add a check. Thanks.
>>>>
>>>>>> +
>>>>>> +	if (!rsc_capable && rx_conf->enable_lro) {
>>>>>> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
>>>>>> +				   "support it");
>>>>>> +		return -EINVAL;
>>>>>> +	}
>>>>>>
>>>>>>     	PMD_INIT_FUNC_TRACE();
>>>>>>     	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>>>>> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>>>     	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
>>>>>>
>>>>>>     	/*
>>>>>> +	 * RFCTL configuration
>>>>>> +	 *
>>>>>> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
>>>>>> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
>>>>>> +	 */
>>>>>> +	if (rsc_capable) {
>>>>>> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
>>>>>> +		if (rx_conf->enable_lro) {
>>>>>> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
>>>>>> +				   IXGBE_RFCTL_NFSR_DIS);
>>>>>> +		} else {
>>>>>> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
>>>>>> +		}
>>>>>> +
>>>>>> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
>>>>>> +	}
>>>>>> +
>>>>>> +
>>>>>> +	/*
>>>>>>     	 * Configure CRC stripping, if any.
>>>>>>     	 */
>>>>>>     	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
>>>>>>     	if (rx_conf->hw_strip_crc)
>>>>>>     		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
>>>>>> -	else
>>>>>> +	else {
>>>>>>     		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
>>>>>> +		if (rx_conf->enable_lro) {
>>>>>> +			/*
>>>>>> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
>>>>>> +			 * 3.0 RSC configuration requires HW CRC stripping being
>>>>>> +			 * enabled. If user requested both HW CRC stripping off
>>>>>> +			 * and RSC on - return an error.
>>>>>> +			 */
>>>>>> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
>>>>>> +					    "is disabled");
>>>>>> +			return -EINVAL;
>>>>>> +		}
>>>>>> +	}
>>>>>>
>>>>>>     	/*
>>>>>>     	 * Configure jumbo frame support, if any.
>>>>>> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>>>     		 * Configure Header Split
>>>>>>     		 */
>>>>>>     		if (rx_conf->header_split) {
>>>>>> +			/*
>>>>>> +			 * Print a warning if split_hdr_size is less
>>>>>> +			 * than 128 bytes when RSC is requested.
>>>>>> +			 */
>>>>>> +			if (rx_conf->enable_lro &&
>>>>>> +			    rx_conf->split_hdr_size < 128)
>>>>>> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
>>>>>> +						   "128 bytes (%d)!",
>>>>>> +					     rx_conf->split_hdr_size);
>>>>>> +
>>>>>>     			if (hw->mac.type == ixgbe_mac_82599EB) {
>>>>>>     				/* Must setup the PSRTYPE register */
>>>>>> -				uint32_t psrtype;
>>>>>>     				psrtype = IXGBE_PSRTYPE_TCPHDR |
>>>>>>     					IXGBE_PSRTYPE_UDPHDR   |
>>>>>>     					IXGBE_PSRTYPE_IPV4HDR  |
>>>>>> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>>>     			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
>>>>>>     		} else
>>>>>>     #endif
>>>>>> +		{
>>>>>>     			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
>>>>>> +			/*
>>>>>> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
>>>>>> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
>>>>>> +			 * should be configured even if header split is not
>>>>>> +			 * enabled. In the later case we will configure it 128
>>>>>> +			 * bytes following the recommendation in the spec.
>>>>>> +			 */
>>>>>> +			if (rx_conf->enable_lro)
>>>>>> +				srrctl |=
>>>>>> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
>>>>>> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
>>>>>> +		}
>>>>>>
>>>>>>     		/* Set if packets are dropped when no descriptors available */
>>>>>>     		if (rxq->drop_en)
>>>>>> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>>>     				       RTE_PKTMBUF_HEADROOM);
>>>>>>     		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
>>>>>>     			   IXGBE_SRRCTL_BSIZEPKT_MASK);
>>>>>> +
>>>>>> +		/*
>>>>>> +		 * TODO: Consider setting the Receive Descriptor Minimum
>>>>>> +		 * Threshold Size for and RSC case. This is not an obviously
>>>>>> +		 * beneficiary option but the one worth considering...
>>>>>> +		 */
>>>>>> +
>>>>>>     		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
>>>>>>
>>>>>>     		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
>>>>>> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>>>     		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
>>>>>>     					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
>>>>>>     			dev->data->scattered_rx = 1;
>>>>>> +
>>>>>> +		/* RSC per-queue configuration */
>>>>>> +		if (rx_conf->enable_lro) {
>>>>>> +			uint32_t eitr;
>>>>>> +
>>>>>> +			rscctl =
>>>>>> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
>>>>>> +			psrtype =
>>>>>> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
>>>>>> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
>>>>>> +
>>>>>> +			rscctl |= IXGBE_RSCCTL_RSCEN;
>>>>>> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
>>>>>> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
>>>>>> +
>>>>>> +			/*
>>>>>> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
>>>>>> +			 *
>>>>>> +			 * Full-sized RSC aggregations for a 10Gb/s link will
>>>>>> +			 * arrive at about 20K aggregation/s rate.
>>>>>> +			 *
>>>>>> +			 * 2K inst/s rate will make only 10% of the
>>>>>> +			 * aggregations to be closed due to the interrupt timer
>>>>>> +			 * expiration for a streaming at wire-speed case.
>>>>>> +			 *
>>>>>> +			 * For a sparse streaming case this setting will yield
>>>>>> +			 * at most 500us latency for a single RSC aggregation.
>>>>>> +			 */
>>>>>> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
>>>>> Again probably create some macro for ITR Interval default value here.
>>>> Well, again - it's the only place where it's used and I've extensively
>>>> explained it in the comments in the code. Therefore I think it's the
>>>> most readable way to write this.
>>>> If it would be used in at least two places - then I would have put it in
>>>> a macro...
>>> I think it is a good practise to use macros instead of raw numbers in such places.
>>> You probably can make these macros self-explanatory:
>>> /* EITR Inteval in 2us uinits for 1G and 10G. */
>>> #define IXGBE_EITR_INTERVAL_US	2
>>>
>>> #define IXGBE_EITR_INTERVAL_SHIFT	3
>>>
>>> #define IXGBE_EITR_INTERVAL(us)	((us) / IXGBE_EITR_INTERVAL_US << IXGBE_EITR_INTERVAL_SHIFT)
>>>
>>> /* at most 500us latency for a single RSC aggregation */
>>> #define IXGHE_EITR_INTERVAL_DEFAULT  IXGBE_EITR_INTERVAL(500)
>> If this value would have a potential be changed one day or if it would
>> going to be used somewhere else in the code I would immediately agree
>> but here u've added 9 long lines of something that nobody would ever
>> care about. The only thing that everybody would care what are the actual
>> implication of this value on the RSC functionality. To understand that
>> having macros like u propose instead of a proper comment like I propose
>> doesn't help much. This is because the thing is not just about the EITR
>> interval and the maximum latency. But if we keep my comment then we
>> don't need any additional self-explanatory macros because everything has
>> been explained in the comment already.
>>
>> If one day this parameter is going to be configured from the outside -
>> then I agree that there would be a place for macros like above. For the
>> current API state I think it would just pump up the code with useless
>> code lines.
> It is a good approach to do things in  a proper way from the start.
> Here you define macro and use them inside your code.
> Then, when someone else will need to manipulate EITR interval - he can use the macros
> you defined and he wouldn't need to touch your code.
> Same applies to the MAXDESC calculation above.

Ok. I'll add the macros like u ask. We spend too much time discussing 
the matter of such little importance... ;)

>   
> BTW:
> eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
> ...
> eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
>
> Could EITR already contain some previous interval value?
> If yes, then we probably either need to clear previous interval value first,
> or just write new value of EITR without reading.

I think u've caught an issue here! I think the best would be the first 
option - I'll clear the previous interval value and set the new one.

>
>>>>>> +
>>>>>> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
>>>>>> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
>>>>>> +								       psrtype);
>>>>>> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
>>>>>> +
>>>>>> +			/*
>>>>>> +			 * RSC requires the mapping of the queue to the
>>>>>> +			 * interrupt vector.
>>>>>> +			 */
>>>>>> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
>>>>> Hm, wonder why do we need to setup IVAR for RSC?
>>>>> Wouldn't just setting EITR be enough?
>>>> Nope. See 82599 spec chapter 4.6.7.2.2.
>>> I read it, though it doesn't say 'IVAR must be setup' like it does for EITR.Inerval.
>> 82599 Spec, Chapter 4.6.7.2.2 ("RSC Enablement" -> "Per Queue Setting"),
>> the last bullet:
>>
>> "- Map the relevant Rx queues to an interrupt by setting the relevant IVAR
>> registers."
>>
>>> That made me thought that it might be optional.
>>>
>>>> I think I even tried not to map
>>>> the queues to IVAR and it didn't work... ;)
>>> Pity, but not much we can in that case, I suppose.
>>>
>>>>>> +
>>>>>> +			rxq->rsc_en = 1;
>>>>>> +		}
>>>>>>     	}
>>>>>>
>>>>>>     	if (rx_conf->enable_scatter)
>>>>>>     		dev->data->scattered_rx = 1;
>>>>>>
>>>>>> +	if (rx_conf->enable_lro)
>>>>>> +		dev->data->lro = 1;
>>>>>> +
>>>>>>     	set_rx_function(dev);
>>>>>>
>>>>>>     	/*
>>>>>> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>>>>     		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>>>>>>     	}
>>>>>>
>>>>>> +	/* Finalize RSC configuration  */
>>>>>> +	if (rx_conf->enable_lro) {
>>>>>> +		/*
>>>>>> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
>>>>>> +		 */
>>>>>> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
>>>>>> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
>>>>>> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>>>>>> +
>>>>>> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
>>>>>> +	}
>>>>>> +
>>>>>> +
>>>>>>     	return 0;
>>>>>>     }
>>>>>>
>>>>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>>>>> index bbe5ff3..389173f 100644
>>>>>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>>>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>>>>> @@ -79,6 +79,10 @@ struct igb_rx_entry {
>>>>>>     	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
>>>>>>     };
>>>>>>
>>>>>> +struct igb_rsc_entry {
>>>>>> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
>>>>>> +};
>>>>>> +
>>>>>>     /**
>>>>>>      * Structure associated with each descriptor of the TX ring of a TX queue.
>>>>>>      */
>>>>>> @@ -105,6 +109,7 @@ struct igb_rx_queue {
>>>>>>     	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
>>>>>>     	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
>>>>>>     	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
>>>>>> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
>>>>>>     	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
>>>>>>     	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
>>>>>>     	uint64_t            mbuf_initializer; /**< value to init mbufs */
>>>>>> @@ -126,6 +131,7 @@ struct igb_rx_queue {
>>>>>>     	uint8_t             port_id;  /**< Device port identifier. */
>>>>>>     	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
>>>>>>     	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
>>>>>> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
>>>>>>     	uint8_t             rx_deferred_start; /**< not in global dev start. */
>>>>>>     #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
>>>>>>     	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */
>>>>>> --
>>>>>> 2.1.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-11 16:54             ` Vlad Zolotarov
@ 2015-03-13  9:07               ` Olivier MATZ
  2015-03-13 11:28                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 18+ messages in thread
From: Olivier MATZ @ 2015-03-13  9:07 UTC (permalink / raw)
  To: Vlad Zolotarov, Ananyev, Konstantin, dev

Hi Vlad,

On 03/11/2015 05:54 PM, Vlad Zolotarov wrote:
>>>> About the existing RX/TX functions and PPC support:
>>>> Note that all of them were created before PPC support for DPDK was
>>>> introduced.
>>>> At that moment only IA was supported.
>>>> That's why in some places where you would expect to see 'mb()' there
>>>> are 'volatile' and/or ' rte_compiler_barrier' instead.
>>>> Why all that places wasn't updated when PPC support was added -
>>>> that's another question.
>>>>   From my understanding - with current implementation some of DPDK
>>>> PMDs RX/TX functions and  rte_ring wouldn't work correctly
>>> on PPC.
>>>> So, I suppose we need to decide for ourselves - do we really want to
>>>> support PPC and other architectures with non-IA memory
>>> model or not?
>>>> If not, then I think we don't need any mb()s inside recv_pkts_lro()
>>>> - just rte_compiler_barrier seems enough, and no point to
>>> complain about
>>>> it in comments.
>>>> If yes - then why to introduce a new function with a known potential
>>>> bug?
>>> In order to introduce a new function with the proper implementation or
>>> to fix any other places with the similar weakness I would need a proper
>>> tools like a proper platform-dependent barrier-macros similar to
>>> smp_Xmb() Linux macros that reduce to a compiler barrier where
>>> appropriate or to a proper memory fence where needed.
>> I understand that.
>> Let's add new macro for that: rte_smp_Xmb() or something,
>> so it would be just rte_compiler_barrier() for x86 and a proper mb()
>> for PPC.
>
> There was an idea to use the C11 built-in memory barriers. I suggest we
> open a separate discussion about that and add these and the appropriate
> fixes in a separate series. There are quite a few places to fix anyway,
> which are currently broken on PPC so this patch doesn't make things any
> worse. However adding a new memory barrier doesn't belong to an LRO
> functionality and thus to this series.

This is an interesting discussion. Just for reference, I submitted a
patch on this topic but it was probably too early as only Intel
architecture was supported at that time.

See http://dpdk.org/ml/archives/dev/2014-May/002597.html

Regards,
Olivier

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-13  9:07               ` Olivier MATZ
@ 2015-03-13 11:28                 ` Ananyev, Konstantin
  2015-03-13 12:12                   ` Vlad Zolotarov
  0 siblings, 1 reply; 18+ messages in thread
From: Ananyev, Konstantin @ 2015-03-13 11:28 UTC (permalink / raw)
  To: Olivier MATZ, Vlad Zolotarov, dev

Hi Olivier,

> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Friday, March 13, 2015 9:08 AM
> To: Vlad Zolotarov; Ananyev, Konstantin; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
> 
> Hi Vlad,
> 
> On 03/11/2015 05:54 PM, Vlad Zolotarov wrote:
> >>>> About the existing RX/TX functions and PPC support:
> >>>> Note that all of them were created before PPC support for DPDK was
> >>>> introduced.
> >>>> At that moment only IA was supported.
> >>>> That's why in some places where you would expect to see 'mb()' there
> >>>> are 'volatile' and/or ' rte_compiler_barrier' instead.
> >>>> Why all that places wasn't updated when PPC support was added -
> >>>> that's another question.
> >>>>   From my understanding - with current implementation some of DPDK
> >>>> PMDs RX/TX functions and  rte_ring wouldn't work correctly
> >>> on PPC.
> >>>> So, I suppose we need to decide for ourselves - do we really want to
> >>>> support PPC and other architectures with non-IA memory
> >>> model or not?
> >>>> If not, then I think we don't need any mb()s inside recv_pkts_lro()
> >>>> - just rte_compiler_barrier seems enough, and no point to
> >>> complain about
> >>>> it in comments.
> >>>> If yes - then why to introduce a new function with a known potential
> >>>> bug?
> >>> In order to introduce a new function with the proper implementation or
> >>> to fix any other places with the similar weakness I would need a proper
> >>> tools like a proper platform-dependent barrier-macros similar to
> >>> smp_Xmb() Linux macros that reduce to a compiler barrier where
> >>> appropriate or to a proper memory fence where needed.
> >> I understand that.
> >> Let's add new macro for that: rte_smp_Xmb() or something,
> >> so it would be just rte_compiler_barrier() for x86 and a proper mb()
> >> for PPC.
> >
> > There was an idea to use the C11 built-in memory barriers. I suggest we
> > open a separate discussion about that and add these and the appropriate
> > fixes in a separate series. There are quite a few places to fix anyway,
> > which are currently broken on PPC so this patch doesn't make things any
> > worse. However adding a new memory barrier doesn't belong to an LRO
> > functionality and thus to this series.
> 
> This is an interesting discussion. Just for reference, I submitted a
> patch on this topic but it was probably too early as only Intel
> architecture was supported at that time.
> 
> See http://dpdk.org/ml/archives/dev/2014-May/002597.html

I do remember that conversation :)
At that moment, as nothing except IA wasn't supported, I feel it was not needed.
Though now, if we do want to support PPC and other architectures with weak memory model,
I think we do need to introduce some platform dependent set of Xmb() macros.
See http://dpdk.org/ml/archives/dev/2014-October/006729.html

Actually while thinking about it once again:
Is there any good use for rte_compiler_barrier() for PPC memory model?
I can't think about any.
So I wonder can't we just make for PPC:
 #define rte_compiler_barrier    rte_mb
While keeping it as it is for IA.
Would save us from searching/replacing though all the code.

 Konstantin



> 
> Regards,
> Olivier

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-13 11:28                 ` Ananyev, Konstantin
@ 2015-03-13 12:12                   ` Vlad Zolotarov
  0 siblings, 0 replies; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-13 12:12 UTC (permalink / raw)
  To: Ananyev, Konstantin, Olivier MATZ, dev



On 03/13/15 13:28, Ananyev, Konstantin wrote:
> Hi Olivier,
>
>> -----Original Message-----
>> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
>> Sent: Friday, March 13, 2015 9:08 AM
>> To: Vlad Zolotarov; Ananyev, Konstantin; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
>>
>> Hi Vlad,
>>
>> On 03/11/2015 05:54 PM, Vlad Zolotarov wrote:
>>>>>> About the existing RX/TX functions and PPC support:
>>>>>> Note that all of them were created before PPC support for DPDK was
>>>>>> introduced.
>>>>>> At that moment only IA was supported.
>>>>>> That's why in some places where you would expect to see 'mb()' there
>>>>>> are 'volatile' and/or ' rte_compiler_barrier' instead.
>>>>>> Why all that places wasn't updated when PPC support was added -
>>>>>> that's another question.
>>>>>>    From my understanding - with current implementation some of DPDK
>>>>>> PMDs RX/TX functions and  rte_ring wouldn't work correctly
>>>>> on PPC.
>>>>>> So, I suppose we need to decide for ourselves - do we really want to
>>>>>> support PPC and other architectures with non-IA memory
>>>>> model or not?
>>>>>> If not, then I think we don't need any mb()s inside recv_pkts_lro()
>>>>>> - just rte_compiler_barrier seems enough, and no point to
>>>>> complain about
>>>>>> it in comments.
>>>>>> If yes - then why to introduce a new function with a known potential
>>>>>> bug?
>>>>> In order to introduce a new function with the proper implementation or
>>>>> to fix any other places with the similar weakness I would need a proper
>>>>> tools like a proper platform-dependent barrier-macros similar to
>>>>> smp_Xmb() Linux macros that reduce to a compiler barrier where
>>>>> appropriate or to a proper memory fence where needed.
>>>> I understand that.
>>>> Let's add new macro for that: rte_smp_Xmb() or something,
>>>> so it would be just rte_compiler_barrier() for x86 and a proper mb()
>>>> for PPC.
>>> There was an idea to use the C11 built-in memory barriers. I suggest we
>>> open a separate discussion about that and add these and the appropriate
>>> fixes in a separate series. There are quite a few places to fix anyway,
>>> which are currently broken on PPC so this patch doesn't make things any
>>> worse. However adding a new memory barrier doesn't belong to an LRO
>>> functionality and thus to this series.
>> This is an interesting discussion. Just for reference, I submitted a
>> patch on this topic but it was probably too early as only Intel
>> architecture was supported at that time.
>>
>> See http://dpdk.org/ml/archives/dev/2014-May/002597.html
> I do remember that conversation :)
> At that moment, as nothing except IA wasn't supported, I feel it was not needed.
> Though now, if we do want to support PPC and other architectures with weak memory model,
> I think we do need to introduce some platform dependent set of Xmb() macros.
> See http://dpdk.org/ml/archives/dev/2014-October/006729.html
>
> Actually while thinking about it once again:
> Is there any good use for rte_compiler_barrier() for PPC memory model?
> I can't think about any.
> So I wonder can't we just make for PPC:
>   #define rte_compiler_barrier    rte_mb
> While keeping it as it is for IA.
> Would save us from searching/replacing though all the code.

I wonder why should we invent a wheel? Like Avi has proposed we may use 
the existing standard C library primitives for that. See this 
http://en.cppreference.com/w/c/atomic. I don't know what's the state of 
icc in this area though... ;)

Pros:

  * Zero maintenance.
  * Multi-platform support.
  * It seems that this is the direction the industry is going to (as
    opposed to the discussed above mb(), rmb(), wmb() model).

Cons:

  * The model is a bit different from what most of the kernel
    programmers used to.
  * The current code adaptation would be a bit more painful (due to
    first "cons").


I think this could be a very nice move. For a user space for sure. The 
open question is a KNI component. I don't know how much code is shared 
between kernel and user space DPDK code but if there isn't much - then 
we may still go for a built-in C atomics primitives in a user space and 
do whatever we choose in the KNI...

>
>   Konstantin
>
>
>
>> Regards,
>> Olivier

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support Vlad Zolotarov
  2015-03-10  0:30   ` Ananyev, Konstantin
@ 2015-03-16 18:26   ` Vlad Zolotarov
  2015-03-18  0:31     ` Ananyev, Konstantin
  1 sibling, 1 reply; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-16 18:26 UTC (permalink / raw)
  To: dev



On 03/09/15 21:07, Vlad Zolotarov wrote:
>      - Only x540 and 82599 devices support LRO.
>      - Add the appropriate HW configuration.
>      - Add RSC aware rx_pkt_burst() handlers:
>         - Implemented bulk allocation and non-bulk allocation versions.
>         - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
>           and to igb_rx_queue.
>         - Use the appropriate handler when LRO is requested.
>
> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
> ---
> New in v5:
>     - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
>     - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
>
> New in v4:
>     - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
>       RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
>
> New in v2:
>     - Removed rte_eth_dev_data.lro_bulk_alloc.
>     - Fixed a few styling and spelling issues.
> ---
>   lib/librte_ether/rte_ethdev.h       |   9 +-
>   lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
>   lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
>   lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
>   lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
>   5 files changed, 581 insertions(+), 7 deletions(-)
>
> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> index 8db3127..44f081f 100644
> --- a/lib/librte_ether/rte_ethdev.h
> +++ b/lib/librte_ether/rte_ethdev.h
> @@ -172,6 +172,9 @@ extern "C" {
>   
>   #include <stdint.h>
>   
> +/* Use this macro to check if LRO API is supported */
> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
> +
>   #include <rte_log.h>
>   #include <rte_interrupts.h>
>   #include <rte_pci.h>
> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
>   	enum rte_eth_rx_mq_mode mq_mode;
>   	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
>   	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
> -	uint8_t header_split : 1, /**< Header Split enable. */
> +	uint16_t header_split : 1, /**< Header Split enable. */
>   		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
>   		hw_vlan_filter   : 1, /**< VLAN filter enable. */
>   		hw_vlan_strip    : 1, /**< VLAN strip enable. */
>   		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
>   		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
>   		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
> +		enable_lro       : 1; /**< Enable LRO */
>   };
>   
>   /**
> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
>   	uint8_t port_id;           /**< Device [external] port identifier. */
>   	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
>   		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
>   		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
>   		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
>   };
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> index 9d3de1a..765174d 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
>   
>   	/* Clear stored conf */
>   	dev->data->scattered_rx = 0;
> +	dev->data->lro = 0;
>   	hw->rx_bulk_alloc_allowed = false;
>   	hw->rx_vec_allowed = false;
>   
> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>   		DEV_RX_OFFLOAD_IPV4_CKSUM |
>   		DEV_RX_OFFLOAD_UDP_CKSUM  |
>   		DEV_RX_OFFLOAD_TCP_CKSUM;
> +
> +	if (hw->mac.type == ixgbe_mac_82599EB ||
> +	    hw->mac.type == ixgbe_mac_X540)
> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
> +
>   	dev_info->tx_offload_capa =
>   		DEV_TX_OFFLOAD_VLAN_INSERT |
>   		DEV_TX_OFFLOAD_IPV4_CKSUM  |
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> index a549f5c..e206584 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>   uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
>   		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>   
> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> +
>   uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>   		uint16_t nb_pkts);
>   
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> index 58e619b..944c662 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>   }
>   
>   /**
> + * Detect an RSC descriptor.
> + */
> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
> +{
> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
> +}
> +
> +/**
>    * Initialize the first mbuf of the returned packet:
>    *    - RX port identifier,
>    *    - hardware offload data, if any:
> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
>   	}
>   }
>   
> +/**
> + * Bulk receive handler for and LRO case.
> + *
> + * @rx_queue Rx queue handle
> + * @rx_pkts table of received packets
> + * @nb_pkts size of rx_pkts table
> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
> + *
> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
> + *
> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
> + * 1) When non-EOP RSC completion arrives:
> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
> + *       segment's data length.
> + *    b) Set the "next" pointer of the current segment to point to the segment
> + *       at the NEXTP index.
> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
> + *       in the sw_rsc_ring.
> + * 2) When EOP arrives we just update the cluster's total length and offload
> + *    flags and deliver the cluster up to the upper layers. In our case - put it
> + *    in the rx_pkts table.
> + *
> + * Returns the number of received packets/clusters (according to the "bulk
> + * receive" interface).
> + */
> +static inline uint16_t
> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
> +	       bool bulk_alloc)
> +{
> +	struct igb_rx_queue *rxq = rx_queue;
> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
> +	uint16_t rx_id = rxq->rx_tail;
> +	uint16_t nb_rx = 0;
> +	uint16_t nb_hold = rxq->nb_rx_hold;
> +	uint16_t prev_id = rxq->rx_tail;
> +
> +	while (nb_rx < nb_pkts) {
> +		bool eop;
> +		struct igb_rx_entry *rxe;
> +		struct igb_rsc_entry *rsc_entry;
> +		struct igb_rsc_entry *next_rsc_entry;
> +		struct igb_rx_entry *next_rxe;
> +		struct rte_mbuf *first_seg;
> +		struct rte_mbuf *rxm;
> +		struct rte_mbuf *nmb;
> +		union ixgbe_adv_rx_desc rxd;
> +		uint16_t data_len;
> +		uint16_t next_id;
> +		volatile union ixgbe_adv_rx_desc *rxdp;
> +		uint32_t staterr;
> +
> +next_desc:
> +		/*
> +		 * The code in this whole file uses the volatile pointer to
> +		 * ensure the read ordering of the status and the rest of the
> +		 * descriptor fields (on the compiler level only!!!). This is so
> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
> +		 * even has the rte_compiler_barrier() for that.
> +		 *
> +		 * But most importantly this is just wrong because this doesn't
> +		 * ensure memory ordering in a general case at all. For
> +		 * instance, DPDK is supposed to work on Power CPUs where
> +		 * compiler barrier may just not be enough!
> +		 *
> +		 * I tried to write only this function properly to have a
> +		 * starting point (as a part of an LRO/RSC series) but the
> +		 * compiler cursed at me when I tried to cast away the
> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
> +		 * keeping it the way it is for now.
> +		 *
> +		 * The code in this file is broken in so many other places and
> +		 * will just not work on a big endian CPU anyway therefore the
> +		 * lines below will have to be revisited together with the rest
> +		 * of the ixgbe PMD.
> +		 *
> +		 * TODO:
> +		 *    - Get rid of "volatile" crap and let the compiler do its
> +		 *      job.
> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
> +		 *      memory ordering below.
> +		 */
> +		rxdp = &rx_ring[rx_id];
> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> +
> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
> +			break;
> +
> +		rxd = *rxdp;
> +
> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> +				  "staterr=0x%x data_len=%u",
> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
> +
> +		if (!bulk_alloc) {
> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
> +			if (nmb == NULL) {
> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
> +						  "port_id=%u queue_id=%u",
> +					   rxq->port_id, rxq->queue_id);
> +
> +				rte_eth_devices[rxq->port_id].data->
> +							rx_mbuf_alloc_failed++;
> +				break;
> +			}
> +		} else if (nb_hold > rxq->rx_free_thresh) {
> +			uint16_t next_rdt = rxq->rx_free_trigger;
> +
> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
> +				rte_wmb();
> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
> +						    next_rdt);
> +				nb_hold -= rxq->rx_free_thresh;
> +			} else {
> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
> +						  "port_id=%u queue_id=%u",
> +					   rxq->port_id, rxq->queue_id);
> +
> +				rte_eth_devices[rxq->port_id].data->
> +							rx_mbuf_alloc_failed++;
> +				break;
> +			}
> +		}
> +
> +		nb_hold++;
> +		rxe = &sw_ring[rx_id];
> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
> +
> +		next_id = rx_id + 1;
> +		if (next_id == rxq->nb_rx_desc)
> +			next_id = 0;
> +
> +		/* Prefetch next mbuf while processing current one. */
> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
> +
> +		/*
> +		 * When next RX descriptor is on a cache-line boundary,
> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
> +		 * to mbufs.
> +		 */
> +		if ((next_id & 0x3) == 0) {
> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
> +		}
> +
> +		rxm = rxe->mbuf;
> +
> +		if (!bulk_alloc) {
> +			__le64 dma =
> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
> +			/*
> +			 * Update RX descriptor with the physical address of the
> +			 * new data buffer of the new allocated mbuf.
> +			 */
> +			rxe->mbuf = nmb;
> +
> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
> +			rxdp->read.hdr_addr = dma;
> +			rxdp->read.pkt_addr = dma;
> +		}
> +		/*
> +		 * Set data length & data buffer address of mbuf.
> +		 */
> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
> +		rxm->data_len = data_len;
> +
> +		if (!eop) {
> +			uint16_t nextp_id;
> +			/*
> +			 * Get next descriptor index:
> +			 *  - For RSC it's in the NEXTP field.
> +			 *  - For a scattered packet - it's just a following
> +			 *    descriptor.
> +			 */
> +			if (ixgbe_rsc_count(&rxd))
> +				nextp_id =
> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
> +						       IXGBE_RXDADV_NEXTP_SHIFT;
> +			else
> +				nextp_id = next_id;
> +
> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
> +			next_rxe = &sw_ring[nextp_id];
> +			rte_ixgbe_prefetch(next_rxe);
> +		}
> +
> +		rsc_entry = &sw_rsc_ring[rx_id];
> +		first_seg = rsc_entry->fbuf;
> +		rsc_entry->fbuf = NULL;
> +
> +		/*
> +		 * If this is the first buffer of the received packet,
> +		 * set the pointer to the first mbuf of the packet and
> +		 * initialize its context.
> +		 * Otherwise, update the total length and the number of segments
> +		 * of the current scattered packet, and update the pointer to
> +		 * the last mbuf of the current packet.
> +		 */
> +		if (first_seg == NULL) {
> +			first_seg = rxm;
> +			first_seg->pkt_len = data_len;
> +			first_seg->nb_segs = 1;
> +		} else {
> +			first_seg->pkt_len += data_len;
> +			first_seg->nb_segs++;
> +		}
> +
> +		prev_id = rx_id;
> +		rx_id = next_id;
> +
> +		/*
> +		 * If this is not the last buffer of the received packet, update
> +		 * the pointer to the first mbuf at the NEXTP entry in the
> +		 * sw_rsc_ring and continue to parse the RX ring.
> +		 */
> +		if (!eop) {
> +			rxm->next = next_rxe->mbuf;
> +			next_rsc_entry->fbuf = first_seg;
> +			goto next_desc;
> +		}
> +
> +		/*
> +		 * This is the last buffer of the received packet - return
> +		 * the current cluster to the user.
> +		 */
> +		rxm->next = NULL;
> +
> +		/* Initialize the first mbuf of the returned packet */
> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
> +					    staterr);
> +
> +		/* Prefetch data of first segment, if configured to do so. */
> +		rte_packet_prefetch((char *)first_seg->buf_addr +
> +			first_seg->data_off);
> +
> +		/*
> +		 * Store the mbuf address into the next entry of the array
> +		 * of returned packets.
> +		 */
> +		rx_pkts[nb_rx++] = first_seg;
> +	}
> +
> +	/*
> +	 * Record index of the next RX descriptor to probe.
> +	 */
> +	rxq->rx_tail = rx_id;
> +
> +	/*
> +	 * If the number of free RX descriptors is greater than the RX free
> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
> +	 * register.
> +	 * Update the RDT with the value of the last processed RX descriptor
> +	 * minus 1, to guarantee that the RDT register is never equal to the
> +	 * RDH register, which creates a "full" ring situtation from the
> +	 * hardware point of view...
> +	 */
> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
> +			   "nb_hold=%u nb_rx=%u",
> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
> +
> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
> +		nb_hold = 0;
> +	}
> +
> +	rxq->nb_rx_hold = nb_hold;
> +	return nb_rx;
> +}
> +
> +uint16_t
> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> +{
> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
> +}
> +
> +uint16_t
> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
> +			       uint16_t nb_pkts)
> +{
> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
> +}
> +
>   uint16_t
>   ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>   			  uint16_t nb_pkts)
> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
>   	if (rxq != NULL) {
>   		ixgbe_rx_queue_release_mbufs(rxq);
>   		rte_free(rxq->sw_ring);
> +		rte_free(rxq->sw_rsc_ring);
>   		rte_free(rxq);
>   	}
>   }
> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
>   	rxq->nb_rx_hold = 0;
>   	rxq->pkt_first_seg = NULL;
>   	rxq->pkt_last_seg = NULL;
> +	rxq->rsc_en = 0;
>   }
>   
>   int
> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>   	struct igb_rx_queue *rxq;
>   	struct ixgbe_hw     *hw;
>   	uint16_t len;
> +	struct rte_eth_dev_info dev_info = { 0 };
> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
> +	bool rsc_requested = false;
> +
> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
> +	    dev_rx_mode->enable_lro)
> +		rsc_requested = true;
>   
>   	PMD_INIT_FUNC_TRACE();
>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>   	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
>   					  sizeof(struct igb_rx_entry) * len,
>   					  RTE_CACHE_LINE_SIZE, socket_id);
> -	if (rxq->sw_ring == NULL) {
> +	if (!rxq->sw_ring) {
>   		ixgbe_rx_queue_release(rxq);
>   		return (-ENOMEM);
>   	}
> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
> +
> +	if (rsc_requested) {
> +		rxq->sw_rsc_ring =
> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
> +					   sizeof(struct igb_rsc_entry) * len,
> +					   RTE_CACHE_LINE_SIZE, socket_id);
> +		if (!rxq->sw_rsc_ring) {
> +			ixgbe_rx_queue_release(rxq);
> +			return (-ENOMEM);
> +		}
> +	} else {
> +		rxq->sw_rsc_ring = NULL;
> +	}
> +
> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
> +			    "dma_addr=0x%"PRIx64,
> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
> +		     rxq->rx_ring_phys_addr);
>   
>   	if (!rte_is_power_of_2(nb_desc)) {
>   		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
>   	return 0;
>   }
>   
> +/**
> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
> + *
> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
> + * spec rev. 3.0 chapter 8.2.3.8.13.
> + *
> + * @pool Memory pool of the Rx queue
> + */
> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
> +{
> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
> +
> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
> +	uint16_t maxdesc =
> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
> +
> +	if (maxdesc >= 16)
> +		return IXGBE_RSCCTL_MAXDESC_16;
> +	else if (maxdesc >= 8)
> +		return IXGBE_RSCCTL_MAXDESC_8;
> +	else if (maxdesc >= 4)
> +		return IXGBE_RSCCTL_MAXDESC_4;
> +	else
> +		return IXGBE_RSCCTL_MAXDESC_1;
> +}
> +
> +/* (Taken from FreeBSD tree)
> +** Setup the correct IVAR register for a particular MSIX interrupt
> +**   (yes this is all very magic and confusing :)
> +**  - entry is the register array entry
> +**  - vector is the MSIX vector for this queue
> +**  - type is RX/TX/MISC
> +*/
> +static void
> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
> +{
> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> +	u32 ivar, index;
> +
> +	vector |= IXGBE_IVAR_ALLOC_VAL;
> +
> +	switch (hw->mac.type) {
> +
> +	case ixgbe_mac_82598EB:
> +		if (type == -1)
> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
> +		else
> +			entry += (type * 64);
> +		index = (entry >> 2) & 0x1F;
> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
> +		ivar |= (vector << (8 * (entry & 0x3)));
> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
> +		break;
> +
> +	case ixgbe_mac_82599EB:
> +	case ixgbe_mac_X540:
> +		if (type == -1) { /* MISC IVAR */
> +			index = (entry & 1) * 8;
> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
> +			ivar &= ~(0xFF << index);
> +			ivar |= (vector << index);
> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
> +		} else {	/* RX/TX IVARS */
> +			index = (16 * (entry & 1)) + (8 * type);
> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
> +			ivar &= ~(0xFF << index);
> +			ivar |= (vector << index);
> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
> +		}
> +
> +		break;
> +
> +	default:
> +		break;
> +	}
> +}
> +
>   void set_rx_function(struct rte_eth_dev *dev)
>   {
>   	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
>   			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
>   		}
>   	}
> +
> +	/*
> +	 * Initialize the appropriate LRO callback.
> +	 *
> +	 * If all queues satisfy the bulk allocation preconditions
> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
> +	 * Otherwise use a single allocation version.
> +	 */
> +	if (dev->data->lro) {
> +		if (hw->rx_bulk_alloc_allowed) {
> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
> +					   "allocation version");
> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
> +		} else {
> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
> +					   "allocation version");
> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
> +		}
> +	}
>   }
>   
>   /*
> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>   	uint32_t maxfrs;
>   	uint32_t srrctl;
>   	uint32_t rdrxctl;
> +	uint32_t rscctl;
> +	uint32_t psrtype;
> +	uint32_t rfctl;
>   	uint32_t rxcsum;
>   	uint16_t buf_size;
>   	uint16_t i;
>   	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
> +	struct rte_eth_dev_info dev_info = { 0 };
> +	bool rsc_capable = false;
> +
> +	/* Sanity check */
> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
> +		rsc_capable = true;
> +
> +	if (!rsc_capable && rx_conf->enable_lro) {
> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
> +				   "support it");
> +		return -EINVAL;
> +	}
>   
>   	PMD_INIT_FUNC_TRACE();
>   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>   	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
>   
>   	/*
> +	 * RFCTL configuration
> +	 *
> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
> +	 */
> +	if (rsc_capable) {
> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
> +		if (rx_conf->enable_lro) {
> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
> +				   IXGBE_RFCTL_NFSR_DIS);
> +		} else {
> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
> +		}
> +
> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
> +	}
> +
> +
> +	/*
>   	 * Configure CRC stripping, if any.
>   	 */
>   	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
>   	if (rx_conf->hw_strip_crc)
>   		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
> -	else
> +	else {
>   		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
> +		if (rx_conf->enable_lro) {
> +			/*
> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
> +			 * 3.0 RSC configuration requires HW CRC stripping being
> +			 * enabled. If user requested both HW CRC stripping off
> +			 * and RSC on - return an error.
> +			 */
> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
> +					    "is disabled");
> +			return -EINVAL;
> +		}
> +	}
>   
>   	/*
>   	 * Configure jumbo frame support, if any.
> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>   		 * Configure Header Split
>   		 */
>   		if (rx_conf->header_split) {
> +			/*
> +			 * Print a warning if split_hdr_size is less
> +			 * than 128 bytes when RSC is requested.
> +			 */
> +			if (rx_conf->enable_lro &&
> +			    rx_conf->split_hdr_size < 128)
> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
> +						   "128 bytes (%d)!",
> +					     rx_conf->split_hdr_size);
> +
>   			if (hw->mac.type == ixgbe_mac_82599EB) {
>   				/* Must setup the PSRTYPE register */
> -				uint32_t psrtype;
>   				psrtype = IXGBE_PSRTYPE_TCPHDR |
>   					IXGBE_PSRTYPE_UDPHDR   |
>   					IXGBE_PSRTYPE_IPV4HDR  |
> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>   			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
>   		} else
>   #endif
> +		{
>   			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
> +			/*
> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
> +			 * should be configured even if header split is not
> +			 * enabled. In the later case we will configure it 128
> +			 * bytes following the recommendation in the spec.
> +			 */
> +			if (rx_conf->enable_lro)
> +				srrctl |=
> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
> +		}
>   
>   		/* Set if packets are dropped when no descriptors available */
>   		if (rxq->drop_en)
> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>   				       RTE_PKTMBUF_HEADROOM);
>   		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
>   			   IXGBE_SRRCTL_BSIZEPKT_MASK);
> +
> +		/*
> +		 * TODO: Consider setting the Receive Descriptor Minimum
> +		 * Threshold Size for and RSC case. This is not an obviously
> +		 * beneficiary option but the one worth considering...
> +		 */
> +
>   		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
>   
>   		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>   		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
>   					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
>   			dev->data->scattered_rx = 1;
> +
> +		/* RSC per-queue configuration */
> +		if (rx_conf->enable_lro) {
> +			uint32_t eitr;
> +
> +			rscctl =
> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
> +			psrtype =
> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
> +
> +			rscctl |= IXGBE_RSCCTL_RSCEN;
> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
> +
> +			/*
> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
> +			 *
> +			 * Full-sized RSC aggregations for a 10Gb/s link will
> +			 * arrive at about 20K aggregation/s rate.
> +			 *
> +			 * 2K inst/s rate will make only 10% of the
> +			 * aggregations to be closed due to the interrupt timer
> +			 * expiration for a streaming at wire-speed case.
> +			 *
> +			 * For a sparse streaming case this setting will yield
> +			 * at most 500us latency for a single RSC aggregation.
> +			 */
> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
> +
> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
> +								       psrtype);
> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
> +
> +			/*
> +			 * RSC requires the mapping of the queue to the
> +			 * interrupt vector.
> +			 */
> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
> +
> +			rxq->rsc_en = 1;
> +		}
>   	}
>   
>   	if (rx_conf->enable_scatter)
>   		dev->data->scattered_rx = 1;
>   
> +	if (rx_conf->enable_lro)
> +		dev->data->lro = 1;
> +
>   	set_rx_function(dev);
>   
>   	/*
> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>   		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>   	}
>   
> +	/* Finalize RSC configuration  */
> +	if (rx_conf->enable_lro) {
> +		/*
> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
> +		 */
> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
> +
> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
> +	}
> +
> +
>   	return 0;
>   }

I've just noticed that RTE_HEADER_SPLIT_ENABLE used in 
ixgbe_dev_rx_init() is not enabled anywhere: neither in config_XXX files 
nor anywhere else.
It looks like this macro is used only in ixgbe_rxtx.c. Seems like this 
is some sort of a legacy leftovers, isn't it?

Konstantin, could u, pls., comment?

>   
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> index bbe5ff3..389173f 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> @@ -79,6 +79,10 @@ struct igb_rx_entry {
>   	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
>   };
>   
> +struct igb_rsc_entry {
> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
> +};
> +
>   /**
>    * Structure associated with each descriptor of the TX ring of a TX queue.
>    */
> @@ -105,6 +109,7 @@ struct igb_rx_queue {
>   	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
>   	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
>   	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
>   	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
>   	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
>   	uint64_t            mbuf_initializer; /**< value to init mbufs */
> @@ -126,6 +131,7 @@ struct igb_rx_queue {
>   	uint8_t             port_id;  /**< Device port identifier. */
>   	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
>   	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
>   	uint8_t             rx_deferred_start; /**< not in global dev start. */
>   #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
>   	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-16 18:26   ` Vlad Zolotarov
@ 2015-03-18  0:31     ` Ananyev, Konstantin
  2015-03-18 10:29       ` Vlad Zolotarov
  0 siblings, 1 reply; 18+ messages in thread
From: Ananyev, Konstantin @ 2015-03-18  0:31 UTC (permalink / raw)
  To: Vlad Zolotarov, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
> Sent: Monday, March 16, 2015 6:27 PM
> To: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
> 
> 
> 
> On 03/09/15 21:07, Vlad Zolotarov wrote:
> >      - Only x540 and 82599 devices support LRO.
> >      - Add the appropriate HW configuration.
> >      - Add RSC aware rx_pkt_burst() handlers:
> >         - Implemented bulk allocation and non-bulk allocation versions.
> >         - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
> >           and to igb_rx_queue.
> >         - Use the appropriate handler when LRO is requested.
> >
> > Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
> > ---
> > New in v5:
> >     - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
> >     - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
> >
> > New in v4:
> >     - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
> >       RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
> >
> > New in v2:
> >     - Removed rte_eth_dev_data.lro_bulk_alloc.
> >     - Fixed a few styling and spelling issues.
> > ---
> >   lib/librte_ether/rte_ethdev.h       |   9 +-
> >   lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
> >   lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
> >   lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
> >   lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
> >   5 files changed, 581 insertions(+), 7 deletions(-)
> >
> > diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
> > index 8db3127..44f081f 100644
> > --- a/lib/librte_ether/rte_ethdev.h
> > +++ b/lib/librte_ether/rte_ethdev.h
> > @@ -172,6 +172,9 @@ extern "C" {
> >
> >   #include <stdint.h>
> >
> > +/* Use this macro to check if LRO API is supported */
> > +#define RTE_ETHDEV_HAS_LRO_SUPPORT
> > +
> >   #include <rte_log.h>
> >   #include <rte_interrupts.h>
> >   #include <rte_pci.h>
> > @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
> >   	enum rte_eth_rx_mq_mode mq_mode;
> >   	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
> >   	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
> > -	uint8_t header_split : 1, /**< Header Split enable. */
> > +	uint16_t header_split : 1, /**< Header Split enable. */
> >   		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
> >   		hw_vlan_filter   : 1, /**< VLAN filter enable. */
> >   		hw_vlan_strip    : 1, /**< VLAN strip enable. */
> >   		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
> >   		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
> >   		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
> > -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
> > +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
> > +		enable_lro       : 1; /**< Enable LRO */
> >   };
> >
> >   /**
> > @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
> >   	uint8_t port_id;           /**< Device [external] port identifier. */
> >   	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
> >   		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
> > +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
> >   		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
> >   		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
> >   };
> > diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> > index 9d3de1a..765174d 100644
> > --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> > +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
> > @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
> >
> >   	/* Clear stored conf */
> >   	dev->data->scattered_rx = 0;
> > +	dev->data->lro = 0;
> >   	hw->rx_bulk_alloc_allowed = false;
> >   	hw->rx_vec_allowed = false;
> >
> > @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
> >   		DEV_RX_OFFLOAD_IPV4_CKSUM |
> >   		DEV_RX_OFFLOAD_UDP_CKSUM  |
> >   		DEV_RX_OFFLOAD_TCP_CKSUM;
> > +
> > +	if (hw->mac.type == ixgbe_mac_82599EB ||
> > +	    hw->mac.type == ixgbe_mac_X540)
> > +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
> > +
> >   	dev_info->tx_offload_capa =
> >   		DEV_TX_OFFLOAD_VLAN_INSERT |
> >   		DEV_TX_OFFLOAD_IPV4_CKSUM  |
> > diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> > index a549f5c..e206584 100644
> > --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> > +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
> > @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
> >   uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
> >   		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> >
> > +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
> > +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> > +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
> > +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
> > +
> >   uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
> >   		uint16_t nb_pkts);
> >
> > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > index 58e619b..944c662 100644
> > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> > @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> >   }
> >
> >   /**
> > + * Detect an RSC descriptor.
> > + */
> > +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
> > +{
> > +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
> > +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
> > +}
> > +
> > +/**
> >    * Initialize the first mbuf of the returned packet:
> >    *    - RX port identifier,
> >    *    - hardware offload data, if any:
> > @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
> >   	}
> >   }
> >
> > +/**
> > + * Bulk receive handler for and LRO case.
> > + *
> > + * @rx_queue Rx queue handle
> > + * @rx_pkts table of received packets
> > + * @nb_pkts size of rx_pkts table
> > + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
> > + *
> > + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
> > + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
> > + *
> > + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
> > + * 1) When non-EOP RSC completion arrives:
> > + *    a) Update the HEAD of the current RSC aggregation cluster with the new
> > + *       segment's data length.
> > + *    b) Set the "next" pointer of the current segment to point to the segment
> > + *       at the NEXTP index.
> > + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
> > + *       in the sw_rsc_ring.
> > + * 2) When EOP arrives we just update the cluster's total length and offload
> > + *    flags and deliver the cluster up to the upper layers. In our case - put it
> > + *    in the rx_pkts table.
> > + *
> > + * Returns the number of received packets/clusters (according to the "bulk
> > + * receive" interface).
> > + */
> > +static inline uint16_t
> > +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
> > +	       bool bulk_alloc)
> > +{
> > +	struct igb_rx_queue *rxq = rx_queue;
> > +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
> > +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
> > +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
> > +	uint16_t rx_id = rxq->rx_tail;
> > +	uint16_t nb_rx = 0;
> > +	uint16_t nb_hold = rxq->nb_rx_hold;
> > +	uint16_t prev_id = rxq->rx_tail;
> > +
> > +	while (nb_rx < nb_pkts) {
> > +		bool eop;
> > +		struct igb_rx_entry *rxe;
> > +		struct igb_rsc_entry *rsc_entry;
> > +		struct igb_rsc_entry *next_rsc_entry;
> > +		struct igb_rx_entry *next_rxe;
> > +		struct rte_mbuf *first_seg;
> > +		struct rte_mbuf *rxm;
> > +		struct rte_mbuf *nmb;
> > +		union ixgbe_adv_rx_desc rxd;
> > +		uint16_t data_len;
> > +		uint16_t next_id;
> > +		volatile union ixgbe_adv_rx_desc *rxdp;
> > +		uint32_t staterr;
> > +
> > +next_desc:
> > +		/*
> > +		 * The code in this whole file uses the volatile pointer to
> > +		 * ensure the read ordering of the status and the rest of the
> > +		 * descriptor fields (on the compiler level only!!!). This is so
> > +		 * UGLY - why not to just use the compiler barrier instead? DPDK
> > +		 * even has the rte_compiler_barrier() for that.
> > +		 *
> > +		 * But most importantly this is just wrong because this doesn't
> > +		 * ensure memory ordering in a general case at all. For
> > +		 * instance, DPDK is supposed to work on Power CPUs where
> > +		 * compiler barrier may just not be enough!
> > +		 *
> > +		 * I tried to write only this function properly to have a
> > +		 * starting point (as a part of an LRO/RSC series) but the
> > +		 * compiler cursed at me when I tried to cast away the
> > +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
> > +		 * keeping it the way it is for now.
> > +		 *
> > +		 * The code in this file is broken in so many other places and
> > +		 * will just not work on a big endian CPU anyway therefore the
> > +		 * lines below will have to be revisited together with the rest
> > +		 * of the ixgbe PMD.
> > +		 *
> > +		 * TODO:
> > +		 *    - Get rid of "volatile" crap and let the compiler do its
> > +		 *      job.
> > +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
> > +		 *      memory ordering below.
> > +		 */
> > +		rxdp = &rx_ring[rx_id];
> > +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
> > +
> > +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
> > +			break;
> > +
> > +		rxd = *rxdp;
> > +
> > +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
> > +				  "staterr=0x%x data_len=%u",
> > +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
> > +			   rte_le_to_cpu_16(rxd.wb.upper.length));
> > +
> > +		if (!bulk_alloc) {
> > +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
> > +			if (nmb == NULL) {
> > +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
> > +						  "port_id=%u queue_id=%u",
> > +					   rxq->port_id, rxq->queue_id);
> > +
> > +				rte_eth_devices[rxq->port_id].data->
> > +							rx_mbuf_alloc_failed++;
> > +				break;
> > +			}
> > +		} else if (nb_hold > rxq->rx_free_thresh) {
> > +			uint16_t next_rdt = rxq->rx_free_trigger;
> > +
> > +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
> > +				rte_wmb();
> > +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
> > +						    next_rdt);
> > +				nb_hold -= rxq->rx_free_thresh;
> > +			} else {
> > +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
> > +						  "port_id=%u queue_id=%u",
> > +					   rxq->port_id, rxq->queue_id);
> > +
> > +				rte_eth_devices[rxq->port_id].data->
> > +							rx_mbuf_alloc_failed++;
> > +				break;
> > +			}
> > +		}
> > +
> > +		nb_hold++;
> > +		rxe = &sw_ring[rx_id];
> > +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
> > +
> > +		next_id = rx_id + 1;
> > +		if (next_id == rxq->nb_rx_desc)
> > +			next_id = 0;
> > +
> > +		/* Prefetch next mbuf while processing current one. */
> > +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
> > +
> > +		/*
> > +		 * When next RX descriptor is on a cache-line boundary,
> > +		 * prefetch the next 4 RX descriptors and the next 4 pointers
> > +		 * to mbufs.
> > +		 */
> > +		if ((next_id & 0x3) == 0) {
> > +			rte_ixgbe_prefetch(&rx_ring[next_id]);
> > +			rte_ixgbe_prefetch(&sw_ring[next_id]);
> > +		}
> > +
> > +		rxm = rxe->mbuf;
> > +
> > +		if (!bulk_alloc) {
> > +			__le64 dma =
> > +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
> > +			/*
> > +			 * Update RX descriptor with the physical address of the
> > +			 * new data buffer of the new allocated mbuf.
> > +			 */
> > +			rxe->mbuf = nmb;
> > +
> > +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
> > +			rxdp->read.hdr_addr = dma;
> > +			rxdp->read.pkt_addr = dma;
> > +		}
> > +		/*
> > +		 * Set data length & data buffer address of mbuf.
> > +		 */
> > +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
> > +		rxm->data_len = data_len;
> > +
> > +		if (!eop) {
> > +			uint16_t nextp_id;
> > +			/*
> > +			 * Get next descriptor index:
> > +			 *  - For RSC it's in the NEXTP field.
> > +			 *  - For a scattered packet - it's just a following
> > +			 *    descriptor.
> > +			 */
> > +			if (ixgbe_rsc_count(&rxd))
> > +				nextp_id =
> > +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
> > +						       IXGBE_RXDADV_NEXTP_SHIFT;
> > +			else
> > +				nextp_id = next_id;
> > +
> > +			next_rsc_entry = &sw_rsc_ring[nextp_id];
> > +			next_rxe = &sw_ring[nextp_id];
> > +			rte_ixgbe_prefetch(next_rxe);
> > +		}
> > +
> > +		rsc_entry = &sw_rsc_ring[rx_id];
> > +		first_seg = rsc_entry->fbuf;
> > +		rsc_entry->fbuf = NULL;
> > +
> > +		/*
> > +		 * If this is the first buffer of the received packet,
> > +		 * set the pointer to the first mbuf of the packet and
> > +		 * initialize its context.
> > +		 * Otherwise, update the total length and the number of segments
> > +		 * of the current scattered packet, and update the pointer to
> > +		 * the last mbuf of the current packet.
> > +		 */
> > +		if (first_seg == NULL) {
> > +			first_seg = rxm;
> > +			first_seg->pkt_len = data_len;
> > +			first_seg->nb_segs = 1;
> > +		} else {
> > +			first_seg->pkt_len += data_len;
> > +			first_seg->nb_segs++;
> > +		}
> > +
> > +		prev_id = rx_id;
> > +		rx_id = next_id;
> > +
> > +		/*
> > +		 * If this is not the last buffer of the received packet, update
> > +		 * the pointer to the first mbuf at the NEXTP entry in the
> > +		 * sw_rsc_ring and continue to parse the RX ring.
> > +		 */
> > +		if (!eop) {
> > +			rxm->next = next_rxe->mbuf;
> > +			next_rsc_entry->fbuf = first_seg;
> > +			goto next_desc;
> > +		}
> > +
> > +		/*
> > +		 * This is the last buffer of the received packet - return
> > +		 * the current cluster to the user.
> > +		 */
> > +		rxm->next = NULL;
> > +
> > +		/* Initialize the first mbuf of the returned packet */
> > +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
> > +					    staterr);
> > +
> > +		/* Prefetch data of first segment, if configured to do so. */
> > +		rte_packet_prefetch((char *)first_seg->buf_addr +
> > +			first_seg->data_off);
> > +
> > +		/*
> > +		 * Store the mbuf address into the next entry of the array
> > +		 * of returned packets.
> > +		 */
> > +		rx_pkts[nb_rx++] = first_seg;
> > +	}
> > +
> > +	/*
> > +	 * Record index of the next RX descriptor to probe.
> > +	 */
> > +	rxq->rx_tail = rx_id;
> > +
> > +	/*
> > +	 * If the number of free RX descriptors is greater than the RX free
> > +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
> > +	 * register.
> > +	 * Update the RDT with the value of the last processed RX descriptor
> > +	 * minus 1, to guarantee that the RDT register is never equal to the
> > +	 * RDH register, which creates a "full" ring situtation from the
> > +	 * hardware point of view...
> > +	 */
> > +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
> > +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
> > +			   "nb_hold=%u nb_rx=%u",
> > +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
> > +
> > +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
> > +		nb_hold = 0;
> > +	}
> > +
> > +	rxq->nb_rx_hold = nb_hold;
> > +	return nb_rx;
> > +}
> > +
> > +uint16_t
> > +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
> > +{
> > +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
> > +}
> > +
> > +uint16_t
> > +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
> > +			       uint16_t nb_pkts)
> > +{
> > +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
> > +}
> > +
> >   uint16_t
> >   ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
> >   			  uint16_t nb_pkts)
> > @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
> >   	if (rxq != NULL) {
> >   		ixgbe_rx_queue_release_mbufs(rxq);
> >   		rte_free(rxq->sw_ring);
> > +		rte_free(rxq->sw_rsc_ring);
> >   		rte_free(rxq);
> >   	}
> >   }
> > @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
> >   	rxq->nb_rx_hold = 0;
> >   	rxq->pkt_first_seg = NULL;
> >   	rxq->pkt_last_seg = NULL;
> > +	rxq->rsc_en = 0;
> >   }
> >
> >   int
> > @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
> >   	struct igb_rx_queue *rxq;
> >   	struct ixgbe_hw     *hw;
> >   	uint16_t len;
> > +	struct rte_eth_dev_info dev_info = { 0 };
> > +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
> > +	bool rsc_requested = false;
> > +
> > +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> > +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
> > +	    dev_rx_mode->enable_lro)
> > +		rsc_requested = true;
> >
> >   	PMD_INIT_FUNC_TRACE();
> >   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
> >   	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
> >   					  sizeof(struct igb_rx_entry) * len,
> >   					  RTE_CACHE_LINE_SIZE, socket_id);
> > -	if (rxq->sw_ring == NULL) {
> > +	if (!rxq->sw_ring) {
> >   		ixgbe_rx_queue_release(rxq);
> >   		return (-ENOMEM);
> >   	}
> > -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
> > -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
> > +
> > +	if (rsc_requested) {
> > +		rxq->sw_rsc_ring =
> > +			rte_zmalloc_socket("rxq->sw_rsc_ring",
> > +					   sizeof(struct igb_rsc_entry) * len,
> > +					   RTE_CACHE_LINE_SIZE, socket_id);
> > +		if (!rxq->sw_rsc_ring) {
> > +			ixgbe_rx_queue_release(rxq);
> > +			return (-ENOMEM);
> > +		}
> > +	} else {
> > +		rxq->sw_rsc_ring = NULL;
> > +	}
> > +
> > +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
> > +			    "dma_addr=0x%"PRIx64,
> > +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
> > +		     rxq->rx_ring_phys_addr);
> >
> >   	if (!rte_is_power_of_2(nb_desc)) {
> >   		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
> > @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
> >   	return 0;
> >   }
> >
> > +/**
> > + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
> > + *
> > + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
> > + * spec rev. 3.0 chapter 8.2.3.8.13.
> > + *
> > + * @pool Memory pool of the Rx queue
> > + */
> > +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
> > +{
> > +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
> > +
> > +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
> > +	uint16_t maxdesc =
> > +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
> > +
> > +	if (maxdesc >= 16)
> > +		return IXGBE_RSCCTL_MAXDESC_16;
> > +	else if (maxdesc >= 8)
> > +		return IXGBE_RSCCTL_MAXDESC_8;
> > +	else if (maxdesc >= 4)
> > +		return IXGBE_RSCCTL_MAXDESC_4;
> > +	else
> > +		return IXGBE_RSCCTL_MAXDESC_1;
> > +}
> > +
> > +/* (Taken from FreeBSD tree)
> > +** Setup the correct IVAR register for a particular MSIX interrupt
> > +**   (yes this is all very magic and confusing :)
> > +**  - entry is the register array entry
> > +**  - vector is the MSIX vector for this queue
> > +**  - type is RX/TX/MISC
> > +*/
> > +static void
> > +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
> > +{
> > +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > +	u32 ivar, index;
> > +
> > +	vector |= IXGBE_IVAR_ALLOC_VAL;
> > +
> > +	switch (hw->mac.type) {
> > +
> > +	case ixgbe_mac_82598EB:
> > +		if (type == -1)
> > +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
> > +		else
> > +			entry += (type * 64);
> > +		index = (entry >> 2) & 0x1F;
> > +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
> > +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
> > +		ivar |= (vector << (8 * (entry & 0x3)));
> > +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
> > +		break;
> > +
> > +	case ixgbe_mac_82599EB:
> > +	case ixgbe_mac_X540:
> > +		if (type == -1) { /* MISC IVAR */
> > +			index = (entry & 1) * 8;
> > +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
> > +			ivar &= ~(0xFF << index);
> > +			ivar |= (vector << index);
> > +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
> > +		} else {	/* RX/TX IVARS */
> > +			index = (16 * (entry & 1)) + (8 * type);
> > +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
> > +			ivar &= ~(0xFF << index);
> > +			ivar |= (vector << index);
> > +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
> > +		}
> > +
> > +		break;
> > +
> > +	default:
> > +		break;
> > +	}
> > +}
> > +
> >   void set_rx_function(struct rte_eth_dev *dev)
> >   {
> >   	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
> >   			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
> >   		}
> >   	}
> > +
> > +	/*
> > +	 * Initialize the appropriate LRO callback.
> > +	 *
> > +	 * If all queues satisfy the bulk allocation preconditions
> > +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
> > +	 * Otherwise use a single allocation version.
> > +	 */
> > +	if (dev->data->lro) {
> > +		if (hw->rx_bulk_alloc_allowed) {
> > +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
> > +					   "allocation version");
> > +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
> > +		} else {
> > +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
> > +					   "allocation version");
> > +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
> > +		}
> > +	}
> >   }
> >
> >   /*
> > @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >   	uint32_t maxfrs;
> >   	uint32_t srrctl;
> >   	uint32_t rdrxctl;
> > +	uint32_t rscctl;
> > +	uint32_t psrtype;
> > +	uint32_t rfctl;
> >   	uint32_t rxcsum;
> >   	uint16_t buf_size;
> >   	uint16_t i;
> >   	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
> > +	struct rte_eth_dev_info dev_info = { 0 };
> > +	bool rsc_capable = false;
> > +
> > +	/* Sanity check */
> > +	dev->dev_ops->dev_infos_get(dev, &dev_info);
> > +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
> > +		rsc_capable = true;
> > +
> > +	if (!rsc_capable && rx_conf->enable_lro) {
> > +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
> > +				   "support it");
> > +		return -EINVAL;
> > +	}
> >
> >   	PMD_INIT_FUNC_TRACE();
> >   	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
> > @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >   	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
> >
> >   	/*
> > +	 * RFCTL configuration
> > +	 *
> > +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
> > +	 * and RFCTL.NFSR_DIS when RSC is enabled.
> > +	 */
> > +	if (rsc_capable) {
> > +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
> > +		if (rx_conf->enable_lro) {
> > +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
> > +				   IXGBE_RFCTL_NFSR_DIS);
> > +		} else {
> > +			rfctl |= IXGBE_RFCTL_RSC_DIS;
> > +		}
> > +
> > +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
> > +	}
> > +
> > +
> > +	/*
> >   	 * Configure CRC stripping, if any.
> >   	 */
> >   	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
> >   	if (rx_conf->hw_strip_crc)
> >   		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
> > -	else
> > +	else {
> >   		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
> > +		if (rx_conf->enable_lro) {
> > +			/*
> > +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
> > +			 * 3.0 RSC configuration requires HW CRC stripping being
> > +			 * enabled. If user requested both HW CRC stripping off
> > +			 * and RSC on - return an error.
> > +			 */
> > +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
> > +					    "is disabled");
> > +			return -EINVAL;
> > +		}
> > +	}
> >
> >   	/*
> >   	 * Configure jumbo frame support, if any.
> > @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >   		 * Configure Header Split
> >   		 */
> >   		if (rx_conf->header_split) {
> > +			/*
> > +			 * Print a warning if split_hdr_size is less
> > +			 * than 128 bytes when RSC is requested.
> > +			 */
> > +			if (rx_conf->enable_lro &&
> > +			    rx_conf->split_hdr_size < 128)
> > +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
> > +						   "128 bytes (%d)!",
> > +					     rx_conf->split_hdr_size);
> > +
> >   			if (hw->mac.type == ixgbe_mac_82599EB) {
> >   				/* Must setup the PSRTYPE register */
> > -				uint32_t psrtype;
> >   				psrtype = IXGBE_PSRTYPE_TCPHDR |
> >   					IXGBE_PSRTYPE_UDPHDR   |
> >   					IXGBE_PSRTYPE_IPV4HDR  |
> > @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >   			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
> >   		} else
> >   #endif
> > +		{
> >   			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
> > +			/*
> > +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
> > +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
> > +			 * should be configured even if header split is not
> > +			 * enabled. In the later case we will configure it 128
> > +			 * bytes following the recommendation in the spec.
> > +			 */
> > +			if (rx_conf->enable_lro)
> > +				srrctl |=
> > +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
> > +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
> > +		}
> >
> >   		/* Set if packets are dropped when no descriptors available */
> >   		if (rxq->drop_en)
> > @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >   				       RTE_PKTMBUF_HEADROOM);
> >   		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
> >   			   IXGBE_SRRCTL_BSIZEPKT_MASK);
> > +
> > +		/*
> > +		 * TODO: Consider setting the Receive Descriptor Minimum
> > +		 * Threshold Size for and RSC case. This is not an obviously
> > +		 * beneficiary option but the one worth considering...
> > +		 */
> > +
> >   		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
> >
> >   		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
> > @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >   		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
> >   					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
> >   			dev->data->scattered_rx = 1;
> > +
> > +		/* RSC per-queue configuration */
> > +		if (rx_conf->enable_lro) {
> > +			uint32_t eitr;
> > +
> > +			rscctl =
> > +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
> > +			psrtype =
> > +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
> > +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
> > +
> > +			rscctl |= IXGBE_RSCCTL_RSCEN;
> > +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
> > +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
> > +
> > +			/*
> > +			 * RSC: Set ITR interval corresponding to 2K ints/s.
> > +			 *
> > +			 * Full-sized RSC aggregations for a 10Gb/s link will
> > +			 * arrive at about 20K aggregation/s rate.
> > +			 *
> > +			 * 2K inst/s rate will make only 10% of the
> > +			 * aggregations to be closed due to the interrupt timer
> > +			 * expiration for a streaming at wire-speed case.
> > +			 *
> > +			 * For a sparse streaming case this setting will yield
> > +			 * at most 500us latency for a single RSC aggregation.
> > +			 */
> > +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
> > +
> > +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
> > +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
> > +								       psrtype);
> > +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
> > +
> > +			/*
> > +			 * RSC requires the mapping of the queue to the
> > +			 * interrupt vector.
> > +			 */
> > +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
> > +
> > +			rxq->rsc_en = 1;
> > +		}
> >   	}
> >
> >   	if (rx_conf->enable_scatter)
> >   		dev->data->scattered_rx = 1;
> >
> > +	if (rx_conf->enable_lro)
> > +		dev->data->lro = 1;
> > +
> >   	set_rx_function(dev);
> >
> >   	/*
> > @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
> >   		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
> >   	}
> >
> > +	/* Finalize RSC configuration  */
> > +	if (rx_conf->enable_lro) {
> > +		/*
> > +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
> > +		 */
> > +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
> > +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
> > +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
> > +
> > +		PMD_INIT_LOG(INFO, "enabling LRO mode");
> > +	}
> > +
> > +
> >   	return 0;
> >   }
> 
> I've just noticed that RTE_HEADER_SPLIT_ENABLE used in
> ixgbe_dev_rx_init() is not enabled anywhere: neither in config_XXX files
> nor anywhere else.
> It looks like this macro is used only in ixgbe_rxtx.c. Seems like this
> is some sort of a legacy leftovers, isn't it?

Yes, I believe so.
I presume there was an attempt a while ago to support Header Split RX functionality.
Though, as I am aware, right now our ixgbe PMD don't support it. 

Konstantin

> 
> Konstantin, could u, pls., comment?
> 
> >
> > diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> > index bbe5ff3..389173f 100644
> > --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> > +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
> > @@ -79,6 +79,10 @@ struct igb_rx_entry {
> >   	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
> >   };
> >
> > +struct igb_rsc_entry {
> > +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
> > +};
> > +
> >   /**
> >    * Structure associated with each descriptor of the TX ring of a TX queue.
> >    */
> > @@ -105,6 +109,7 @@ struct igb_rx_queue {
> >   	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
> >   	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
> >   	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
> > +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
> >   	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
> >   	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
> >   	uint64_t            mbuf_initializer; /**< value to init mbufs */
> > @@ -126,6 +131,7 @@ struct igb_rx_queue {
> >   	uint8_t             port_id;  /**< Device port identifier. */
> >   	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
> >   	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
> > +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
> >   	uint8_t             rx_deferred_start; /**< not in global dev start. */
> >   #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
> >   	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
  2015-03-18  0:31     ` Ananyev, Konstantin
@ 2015-03-18 10:29       ` Vlad Zolotarov
  0 siblings, 0 replies; 18+ messages in thread
From: Vlad Zolotarov @ 2015-03-18 10:29 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev



On 03/18/15 02:31, Ananyev, Konstantin wrote:
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Vlad Zolotarov
>> Sent: Monday, March 16, 2015 6:27 PM
>> To: dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support
>>
>>
>>
>> On 03/09/15 21:07, Vlad Zolotarov wrote:
>>>       - Only x540 and 82599 devices support LRO.
>>>       - Add the appropriate HW configuration.
>>>       - Add RSC aware rx_pkt_burst() handlers:
>>>          - Implemented bulk allocation and non-bulk allocation versions.
>>>          - Add LRO-specific fields to rte_eth_rxmode, to rte_eth_dev_data
>>>            and to igb_rx_queue.
>>>          - Use the appropriate handler when LRO is requested.
>>>
>>> Signed-off-by: Vlad Zolotarov <vladz@cloudius-systems.com>
>>> ---
>>> New in v5:
>>>      - Put the RTE_ETHDEV_HAS_LRO_SUPPORT definition at the beginning of rte_ethdev.h.
>>>      - Removed the "TODO: Remove me" comment near RTE_ETHDEV_HAS_LRO_SUPPORT.
>>>
>>> New in v4:
>>>      - Define RTE_ETHDEV_HAS_LRO_SUPPORT in rte_ethdev.h instead of
>>>        RTE_ETHDEV_LRO_SUPPORT defined in config/common_linuxapp.
>>>
>>> New in v2:
>>>      - Removed rte_eth_dev_data.lro_bulk_alloc.
>>>      - Fixed a few styling and spelling issues.
>>> ---
>>>    lib/librte_ether/rte_ethdev.h       |   9 +-
>>>    lib/librte_pmd_ixgbe/ixgbe_ethdev.c |   6 +
>>>    lib/librte_pmd_ixgbe/ixgbe_ethdev.h |   5 +
>>>    lib/librte_pmd_ixgbe/ixgbe_rxtx.c   | 562 +++++++++++++++++++++++++++++++++++-
>>>    lib/librte_pmd_ixgbe/ixgbe_rxtx.h   |   6 +
>>>    5 files changed, 581 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
>>> index 8db3127..44f081f 100644
>>> --- a/lib/librte_ether/rte_ethdev.h
>>> +++ b/lib/librte_ether/rte_ethdev.h
>>> @@ -172,6 +172,9 @@ extern "C" {
>>>
>>>    #include <stdint.h>
>>>
>>> +/* Use this macro to check if LRO API is supported */
>>> +#define RTE_ETHDEV_HAS_LRO_SUPPORT
>>> +
>>>    #include <rte_log.h>
>>>    #include <rte_interrupts.h>
>>>    #include <rte_pci.h>
>>> @@ -320,14 +323,15 @@ struct rte_eth_rxmode {
>>>    	enum rte_eth_rx_mq_mode mq_mode;
>>>    	uint32_t max_rx_pkt_len;  /**< Only used if jumbo_frame enabled. */
>>>    	uint16_t split_hdr_size;  /**< hdr buf size (header_split enabled).*/
>>> -	uint8_t header_split : 1, /**< Header Split enable. */
>>> +	uint16_t header_split : 1, /**< Header Split enable. */
>>>    		hw_ip_checksum   : 1, /**< IP/UDP/TCP checksum offload enable. */
>>>    		hw_vlan_filter   : 1, /**< VLAN filter enable. */
>>>    		hw_vlan_strip    : 1, /**< VLAN strip enable. */
>>>    		hw_vlan_extend   : 1, /**< Extended VLAN enable. */
>>>    		jumbo_frame      : 1, /**< Jumbo Frame Receipt enable. */
>>>    		hw_strip_crc     : 1, /**< Enable CRC stripping by hardware. */
>>> -		enable_scatter   : 1; /**< Enable scatter packets rx handler */
>>> +		enable_scatter   : 1, /**< Enable scatter packets rx handler */
>>> +		enable_lro       : 1; /**< Enable LRO */
>>>    };
>>>
>>>    /**
>>> @@ -1515,6 +1519,7 @@ struct rte_eth_dev_data {
>>>    	uint8_t port_id;           /**< Device [external] port identifier. */
>>>    	uint8_t promiscuous   : 1, /**< RX promiscuous mode ON(1) / OFF(0). */
>>>    		scattered_rx : 1,  /**< RX of scattered packets is ON(1) / OFF(0) */
>>> +		lro          : 1,  /**< RX LRO is ON(1) / OFF(0) */
>>>    		all_multicast : 1, /**< RX all multicast mode ON(1) / OFF(0). */
>>>    		dev_started : 1;   /**< Device state: STARTED(1) / STOPPED(0). */
>>>    };
>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>> index 9d3de1a..765174d 100644
>>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.c
>>> @@ -1648,6 +1648,7 @@ ixgbe_dev_stop(struct rte_eth_dev *dev)
>>>
>>>    	/* Clear stored conf */
>>>    	dev->data->scattered_rx = 0;
>>> +	dev->data->lro = 0;
>>>    	hw->rx_bulk_alloc_allowed = false;
>>>    	hw->rx_vec_allowed = false;
>>>
>>> @@ -2018,6 +2019,11 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *dev_info)
>>>    		DEV_RX_OFFLOAD_IPV4_CKSUM |
>>>    		DEV_RX_OFFLOAD_UDP_CKSUM  |
>>>    		DEV_RX_OFFLOAD_TCP_CKSUM;
>>> +
>>> +	if (hw->mac.type == ixgbe_mac_82599EB ||
>>> +	    hw->mac.type == ixgbe_mac_X540)
>>> +		dev_info->rx_offload_capa |= DEV_RX_OFFLOAD_TCP_LRO;
>>> +
>>>    	dev_info->tx_offload_capa =
>>>    		DEV_TX_OFFLOAD_VLAN_INSERT |
>>>    		DEV_TX_OFFLOAD_IPV4_CKSUM  |
>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>> index a549f5c..e206584 100644
>>> --- a/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_ethdev.h
>>> @@ -349,6 +349,11 @@ uint16_t ixgbe_recv_pkts_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>    uint16_t ixgbe_recv_scattered_pkts(void *rx_queue,
>>>    		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>>
>>> +uint16_t ixgbe_recv_pkts_lro(void *rx_queue,
>>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>> +uint16_t ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue,
>>> +		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
>>> +
>>>    uint16_t ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
>>>    		uint16_t nb_pkts);
>>>
>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>> index 58e619b..944c662 100644
>>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
>>> @@ -1366,6 +1366,15 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>    }
>>>
>>>    /**
>>> + * Detect an RSC descriptor.
>>> + */
>>> +static inline uint32_t ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx)
>>> +{
>>> +	return (rte_le_to_cpu_32(rx->wb.lower.lo_dword.data) &
>>> +		IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT;
>>> +}
>>> +
>>> +/**
>>>     * Initialize the first mbuf of the returned packet:
>>>     *    - RX port identifier,
>>>     *    - hardware offload data, if any:
>>> @@ -1410,6 +1419,291 @@ static inline void ixgbe_fill_cluster_head_buf(
>>>    	}
>>>    }
>>>
>>> +/**
>>> + * Bulk receive handler for and LRO case.
>>> + *
>>> + * @rx_queue Rx queue handle
>>> + * @rx_pkts table of received packets
>>> + * @nb_pkts size of rx_pkts table
>>> + * @bulk_alloc if TRUE bulk allocation is used for a HW ring refilling
>>> + *
>>> + * Handles the Rx HW ring completions when RSC feature is configured. Uses an
>>> + * additional ring of igb_rsc_entry's that will hold the relevant RSC info.
>>> + *
>>> + * We use the same logic as in Lunux and in FreeBSD ixgbe drivers:
>>> + * 1) When non-EOP RSC completion arrives:
>>> + *    a) Update the HEAD of the current RSC aggregation cluster with the new
>>> + *       segment's data length.
>>> + *    b) Set the "next" pointer of the current segment to point to the segment
>>> + *       at the NEXTP index.
>>> + *    c) Pass the HEAD of RSC aggregation cluster on to the next NEXTP entry
>>> + *       in the sw_rsc_ring.
>>> + * 2) When EOP arrives we just update the cluster's total length and offload
>>> + *    flags and deliver the cluster up to the upper layers. In our case - put it
>>> + *    in the rx_pkts table.
>>> + *
>>> + * Returns the number of received packets/clusters (according to the "bulk
>>> + * receive" interface).
>>> + */
>>> +static inline uint16_t
>>> +_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts,
>>> +	       bool bulk_alloc)
>>> +{
>>> +	struct igb_rx_queue *rxq = rx_queue;
>>> +	volatile union ixgbe_adv_rx_desc *rx_ring = rxq->rx_ring;
>>> +	struct igb_rx_entry *sw_ring = rxq->sw_ring;
>>> +	struct igb_rsc_entry *sw_rsc_ring = rxq->sw_rsc_ring;
>>> +	uint16_t rx_id = rxq->rx_tail;
>>> +	uint16_t nb_rx = 0;
>>> +	uint16_t nb_hold = rxq->nb_rx_hold;
>>> +	uint16_t prev_id = rxq->rx_tail;
>>> +
>>> +	while (nb_rx < nb_pkts) {
>>> +		bool eop;
>>> +		struct igb_rx_entry *rxe;
>>> +		struct igb_rsc_entry *rsc_entry;
>>> +		struct igb_rsc_entry *next_rsc_entry;
>>> +		struct igb_rx_entry *next_rxe;
>>> +		struct rte_mbuf *first_seg;
>>> +		struct rte_mbuf *rxm;
>>> +		struct rte_mbuf *nmb;
>>> +		union ixgbe_adv_rx_desc rxd;
>>> +		uint16_t data_len;
>>> +		uint16_t next_id;
>>> +		volatile union ixgbe_adv_rx_desc *rxdp;
>>> +		uint32_t staterr;
>>> +
>>> +next_desc:
>>> +		/*
>>> +		 * The code in this whole file uses the volatile pointer to
>>> +		 * ensure the read ordering of the status and the rest of the
>>> +		 * descriptor fields (on the compiler level only!!!). This is so
>>> +		 * UGLY - why not to just use the compiler barrier instead? DPDK
>>> +		 * even has the rte_compiler_barrier() for that.
>>> +		 *
>>> +		 * But most importantly this is just wrong because this doesn't
>>> +		 * ensure memory ordering in a general case at all. For
>>> +		 * instance, DPDK is supposed to work on Power CPUs where
>>> +		 * compiler barrier may just not be enough!
>>> +		 *
>>> +		 * I tried to write only this function properly to have a
>>> +		 * starting point (as a part of an LRO/RSC series) but the
>>> +		 * compiler cursed at me when I tried to cast away the
>>> +		 * "volatile" from rx_ring (yes, it's volatile too!!!). So, I'm
>>> +		 * keeping it the way it is for now.
>>> +		 *
>>> +		 * The code in this file is broken in so many other places and
>>> +		 * will just not work on a big endian CPU anyway therefore the
>>> +		 * lines below will have to be revisited together with the rest
>>> +		 * of the ixgbe PMD.
>>> +		 *
>>> +		 * TODO:
>>> +		 *    - Get rid of "volatile" crap and let the compiler do its
>>> +		 *      job.
>>> +		 *    - Use the proper memory barrier (rte_rmb()) to ensure the
>>> +		 *      memory ordering below.
>>> +		 */
>>> +		rxdp = &rx_ring[rx_id];
>>> +		staterr = rte_le_to_cpu_32(rxdp->wb.upper.status_error);
>>> +
>>> +		if (!(staterr & IXGBE_RXDADV_STAT_DD))
>>> +			break;
>>> +
>>> +		rxd = *rxdp;
>>> +
>>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_id=%u "
>>> +				  "staterr=0x%x data_len=%u",
>>> +			   rxq->port_id, rxq->queue_id, rx_id, staterr,
>>> +			   rte_le_to_cpu_16(rxd.wb.upper.length));
>>> +
>>> +		if (!bulk_alloc) {
>>> +			nmb = rte_rxmbuf_alloc(rxq->mb_pool);
>>> +			if (nmb == NULL) {
>>> +				PMD_RX_LOG(DEBUG, "RX mbuf alloc failed "
>>> +						  "port_id=%u queue_id=%u",
>>> +					   rxq->port_id, rxq->queue_id);
>>> +
>>> +				rte_eth_devices[rxq->port_id].data->
>>> +							rx_mbuf_alloc_failed++;
>>> +				break;
>>> +			}
>>> +		} else if (nb_hold > rxq->rx_free_thresh) {
>>> +			uint16_t next_rdt = rxq->rx_free_trigger;
>>> +
>>> +			if (!ixgbe_rx_alloc_bufs(rxq, false)) {
>>> +				rte_wmb();
>>> +				IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr,
>>> +						    next_rdt);
>>> +				nb_hold -= rxq->rx_free_thresh;
>>> +			} else {
>>> +				PMD_RX_LOG(DEBUG, "RX bulk alloc failed "
>>> +						  "port_id=%u queue_id=%u",
>>> +					   rxq->port_id, rxq->queue_id);
>>> +
>>> +				rte_eth_devices[rxq->port_id].data->
>>> +							rx_mbuf_alloc_failed++;
>>> +				break;
>>> +			}
>>> +		}
>>> +
>>> +		nb_hold++;
>>> +		rxe = &sw_ring[rx_id];
>>> +		eop = staterr & IXGBE_RXDADV_STAT_EOP;
>>> +
>>> +		next_id = rx_id + 1;
>>> +		if (next_id == rxq->nb_rx_desc)
>>> +			next_id = 0;
>>> +
>>> +		/* Prefetch next mbuf while processing current one. */
>>> +		rte_ixgbe_prefetch(sw_ring[next_id].mbuf);
>>> +
>>> +		/*
>>> +		 * When next RX descriptor is on a cache-line boundary,
>>> +		 * prefetch the next 4 RX descriptors and the next 4 pointers
>>> +		 * to mbufs.
>>> +		 */
>>> +		if ((next_id & 0x3) == 0) {
>>> +			rte_ixgbe_prefetch(&rx_ring[next_id]);
>>> +			rte_ixgbe_prefetch(&sw_ring[next_id]);
>>> +		}
>>> +
>>> +		rxm = rxe->mbuf;
>>> +
>>> +		if (!bulk_alloc) {
>>> +			__le64 dma =
>>> +			  rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(nmb));
>>> +			/*
>>> +			 * Update RX descriptor with the physical address of the
>>> +			 * new data buffer of the new allocated mbuf.
>>> +			 */
>>> +			rxe->mbuf = nmb;
>>> +
>>> +			rxm->data_off = RTE_PKTMBUF_HEADROOM;
>>> +			rxdp->read.hdr_addr = dma;
>>> +			rxdp->read.pkt_addr = dma;
>>> +		}
>>> +		/*
>>> +		 * Set data length & data buffer address of mbuf.
>>> +		 */
>>> +		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
>>> +		rxm->data_len = data_len;
>>> +
>>> +		if (!eop) {
>>> +			uint16_t nextp_id;
>>> +			/*
>>> +			 * Get next descriptor index:
>>> +			 *  - For RSC it's in the NEXTP field.
>>> +			 *  - For a scattered packet - it's just a following
>>> +			 *    descriptor.
>>> +			 */
>>> +			if (ixgbe_rsc_count(&rxd))
>>> +				nextp_id =
>>> +					(staterr & IXGBE_RXDADV_NEXTP_MASK) >>
>>> +						       IXGBE_RXDADV_NEXTP_SHIFT;
>>> +			else
>>> +				nextp_id = next_id;
>>> +
>>> +			next_rsc_entry = &sw_rsc_ring[nextp_id];
>>> +			next_rxe = &sw_ring[nextp_id];
>>> +			rte_ixgbe_prefetch(next_rxe);
>>> +		}
>>> +
>>> +		rsc_entry = &sw_rsc_ring[rx_id];
>>> +		first_seg = rsc_entry->fbuf;
>>> +		rsc_entry->fbuf = NULL;
>>> +
>>> +		/*
>>> +		 * If this is the first buffer of the received packet,
>>> +		 * set the pointer to the first mbuf of the packet and
>>> +		 * initialize its context.
>>> +		 * Otherwise, update the total length and the number of segments
>>> +		 * of the current scattered packet, and update the pointer to
>>> +		 * the last mbuf of the current packet.
>>> +		 */
>>> +		if (first_seg == NULL) {
>>> +			first_seg = rxm;
>>> +			first_seg->pkt_len = data_len;
>>> +			first_seg->nb_segs = 1;
>>> +		} else {
>>> +			first_seg->pkt_len += data_len;
>>> +			first_seg->nb_segs++;
>>> +		}
>>> +
>>> +		prev_id = rx_id;
>>> +		rx_id = next_id;
>>> +
>>> +		/*
>>> +		 * If this is not the last buffer of the received packet, update
>>> +		 * the pointer to the first mbuf at the NEXTP entry in the
>>> +		 * sw_rsc_ring and continue to parse the RX ring.
>>> +		 */
>>> +		if (!eop) {
>>> +			rxm->next = next_rxe->mbuf;
>>> +			next_rsc_entry->fbuf = first_seg;
>>> +			goto next_desc;
>>> +		}
>>> +
>>> +		/*
>>> +		 * This is the last buffer of the received packet - return
>>> +		 * the current cluster to the user.
>>> +		 */
>>> +		rxm->next = NULL;
>>> +
>>> +		/* Initialize the first mbuf of the returned packet */
>>> +		ixgbe_fill_cluster_head_buf(first_seg, &rxd, rxq->port_id,
>>> +					    staterr);
>>> +
>>> +		/* Prefetch data of first segment, if configured to do so. */
>>> +		rte_packet_prefetch((char *)first_seg->buf_addr +
>>> +			first_seg->data_off);
>>> +
>>> +		/*
>>> +		 * Store the mbuf address into the next entry of the array
>>> +		 * of returned packets.
>>> +		 */
>>> +		rx_pkts[nb_rx++] = first_seg;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Record index of the next RX descriptor to probe.
>>> +	 */
>>> +	rxq->rx_tail = rx_id;
>>> +
>>> +	/*
>>> +	 * If the number of free RX descriptors is greater than the RX free
>>> +	 * threshold of the queue, advance the Receive Descriptor Tail (RDT)
>>> +	 * register.
>>> +	 * Update the RDT with the value of the last processed RX descriptor
>>> +	 * minus 1, to guarantee that the RDT register is never equal to the
>>> +	 * RDH register, which creates a "full" ring situtation from the
>>> +	 * hardware point of view...
>>> +	 */
>>> +	if (!bulk_alloc && nb_hold > rxq->rx_free_thresh) {
>>> +		PMD_RX_LOG(DEBUG, "port_id=%u queue_id=%u rx_tail=%u "
>>> +			   "nb_hold=%u nb_rx=%u",
>>> +			   rxq->port_id, rxq->queue_id, rx_id, nb_hold, nb_rx);
>>> +
>>> +		IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, prev_id);
>>> +		nb_hold = 0;
>>> +	}
>>> +
>>> +	rxq->nb_rx_hold = nb_hold;
>>> +	return nb_rx;
>>> +}
>>> +
>>> +uint16_t
>>> +ixgbe_recv_pkts_lro(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
>>> +{
>>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, false);
>>> +}
>>> +
>>> +uint16_t
>>> +ixgbe_recv_pkts_lro_bulk_alloc(void *rx_queue, struct rte_mbuf **rx_pkts,
>>> +			       uint16_t nb_pkts)
>>> +{
>>> +	return _recv_pkts_lro(rx_queue, rx_pkts, nb_pkts, true);
>>> +}
>>> +
>>>    uint16_t
>>>    ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
>>>    			  uint16_t nb_pkts)
>>> @@ -2024,6 +2318,7 @@ ixgbe_rx_queue_release(struct igb_rx_queue *rxq)
>>>    	if (rxq != NULL) {
>>>    		ixgbe_rx_queue_release_mbufs(rxq);
>>>    		rte_free(rxq->sw_ring);
>>> +		rte_free(rxq->sw_rsc_ring);
>>>    		rte_free(rxq);
>>>    	}
>>>    }
>>> @@ -2146,6 +2441,7 @@ ixgbe_reset_rx_queue(struct ixgbe_hw *hw, struct igb_rx_queue *rxq)
>>>    	rxq->nb_rx_hold = 0;
>>>    	rxq->pkt_first_seg = NULL;
>>>    	rxq->pkt_last_seg = NULL;
>>> +	rxq->rsc_en = 0;
>>>    }
>>>
>>>    int
>>> @@ -2160,6 +2456,14 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>>    	struct igb_rx_queue *rxq;
>>>    	struct ixgbe_hw     *hw;
>>>    	uint16_t len;
>>> +	struct rte_eth_dev_info dev_info = { 0 };
>>> +	struct rte_eth_rxmode *dev_rx_mode = &dev->data->dev_conf.rxmode;
>>> +	bool rsc_requested = false;
>>> +
>>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>>> +	if ((dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO) &&
>>> +	    dev_rx_mode->enable_lro)
>>> +		rsc_requested = true;
>>>
>>>    	PMD_INIT_FUNC_TRACE();
>>>    	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>> @@ -2265,12 +2569,28 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
>>>    	rxq->sw_ring = rte_zmalloc_socket("rxq->sw_ring",
>>>    					  sizeof(struct igb_rx_entry) * len,
>>>    					  RTE_CACHE_LINE_SIZE, socket_id);
>>> -	if (rxq->sw_ring == NULL) {
>>> +	if (!rxq->sw_ring) {
>>>    		ixgbe_rx_queue_release(rxq);
>>>    		return (-ENOMEM);
>>>    	}
>>> -	PMD_INIT_LOG(DEBUG, "sw_ring=%p hw_ring=%p dma_addr=0x%"PRIx64,
>>> -		     rxq->sw_ring, rxq->rx_ring, rxq->rx_ring_phys_addr);
>>> +
>>> +	if (rsc_requested) {
>>> +		rxq->sw_rsc_ring =
>>> +			rte_zmalloc_socket("rxq->sw_rsc_ring",
>>> +					   sizeof(struct igb_rsc_entry) * len,
>>> +					   RTE_CACHE_LINE_SIZE, socket_id);
>>> +		if (!rxq->sw_rsc_ring) {
>>> +			ixgbe_rx_queue_release(rxq);
>>> +			return (-ENOMEM);
>>> +		}
>>> +	} else {
>>> +		rxq->sw_rsc_ring = NULL;
>>> +	}
>>> +
>>> +	PMD_INIT_LOG(DEBUG, "sw_ring=%p sw_rsc_ring=%p hw_ring=%p "
>>> +			    "dma_addr=0x%"PRIx64,
>>> +		     rxq->sw_ring, rxq->sw_rsc_ring, rxq->rx_ring,
>>> +		     rxq->rx_ring_phys_addr);
>>>
>>>    	if (!rte_is_power_of_2(nb_desc)) {
>>>    		PMD_INIT_LOG(DEBUG, "queue[%d] doesn't meet Vector Rx "
>>> @@ -3515,6 +3835,84 @@ ixgbe_dev_mq_tx_configure(struct rte_eth_dev *dev)
>>>    	return 0;
>>>    }
>>>
>>> +/**
>>> + * get_rscctl_maxdesc - Calculate the RSCCTL[n].MAXDESC for PF
>>> + *
>>> + * Return the RSCCTL[n].MAXDESC for 82599 and x540 PF devices according to the
>>> + * spec rev. 3.0 chapter 8.2.3.8.13.
>>> + *
>>> + * @pool Memory pool of the Rx queue
>>> + */
>>> +static inline uint32_t get_rscctl_maxdesc(struct rte_mempool *pool)
>>> +{
>>> +	struct rte_pktmbuf_pool_private *mp_priv = rte_mempool_get_priv(pool);
>>> +
>>> +	/* MAXDESC * SRRCTL.BSIZEPKT must not exceed 64 KB minus one */
>>> +	uint16_t maxdesc =
>>> +		65535 / (mp_priv->mbuf_data_room_size - RTE_PKTMBUF_HEADROOM);
>>> +
>>> +	if (maxdesc >= 16)
>>> +		return IXGBE_RSCCTL_MAXDESC_16;
>>> +	else if (maxdesc >= 8)
>>> +		return IXGBE_RSCCTL_MAXDESC_8;
>>> +	else if (maxdesc >= 4)
>>> +		return IXGBE_RSCCTL_MAXDESC_4;
>>> +	else
>>> +		return IXGBE_RSCCTL_MAXDESC_1;
>>> +}
>>> +
>>> +/* (Taken from FreeBSD tree)
>>> +** Setup the correct IVAR register for a particular MSIX interrupt
>>> +**   (yes this is all very magic and confusing :)
>>> +**  - entry is the register array entry
>>> +**  - vector is the MSIX vector for this queue
>>> +**  - type is RX/TX/MISC
>>> +*/
>>> +static void
>>> +ixgbe_set_ivar(struct rte_eth_dev *dev, u8 entry, u8 vector, s8 type)
>>> +{
>>> +	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>> +	u32 ivar, index;
>>> +
>>> +	vector |= IXGBE_IVAR_ALLOC_VAL;
>>> +
>>> +	switch (hw->mac.type) {
>>> +
>>> +	case ixgbe_mac_82598EB:
>>> +		if (type == -1)
>>> +			entry = IXGBE_IVAR_OTHER_CAUSES_INDEX;
>>> +		else
>>> +			entry += (type * 64);
>>> +		index = (entry >> 2) & 0x1F;
>>> +		ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(index));
>>> +		ivar &= ~(0xFF << (8 * (entry & 0x3)));
>>> +		ivar |= (vector << (8 * (entry & 0x3)));
>>> +		IXGBE_WRITE_REG(hw, IXGBE_IVAR(index), ivar);
>>> +		break;
>>> +
>>> +	case ixgbe_mac_82599EB:
>>> +	case ixgbe_mac_X540:
>>> +		if (type == -1) { /* MISC IVAR */
>>> +			index = (entry & 1) * 8;
>>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR_MISC);
>>> +			ivar &= ~(0xFF << index);
>>> +			ivar |= (vector << index);
>>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR_MISC, ivar);
>>> +		} else {	/* RX/TX IVARS */
>>> +			index = (16 * (entry & 1)) + (8 * type);
>>> +			ivar = IXGBE_READ_REG(hw, IXGBE_IVAR(entry >> 1));
>>> +			ivar &= ~(0xFF << index);
>>> +			ivar |= (vector << index);
>>> +			IXGBE_WRITE_REG(hw, IXGBE_IVAR(entry >> 1), ivar);
>>> +		}
>>> +
>>> +		break;
>>> +
>>> +	default:
>>> +		break;
>>> +	}
>>> +}
>>> +
>>>    void set_rx_function(struct rte_eth_dev *dev)
>>>    {
>>>    	struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>> @@ -3565,6 +3963,25 @@ void set_rx_function(struct rte_eth_dev *dev)
>>>    			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
>>>    		}
>>>    	}
>>> +
>>> +	/*
>>> +	 * Initialize the appropriate LRO callback.
>>> +	 *
>>> +	 * If all queues satisfy the bulk allocation preconditions
>>> +	 * (hw->rx_bulk_alloc_allowed is TRUE) then we may use bulk allocation.
>>> +	 * Otherwise use a single allocation version.
>>> +	 */
>>> +	if (dev->data->lro) {
>>> +		if (hw->rx_bulk_alloc_allowed) {
>>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a bulk "
>>> +					   "allocation version");
>>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro_bulk_alloc;
>>> +		} else {
>>> +			PMD_INIT_LOG(INFO, "LRO is requested. Using a single "
>>> +					   "allocation version");
>>> +			dev->rx_pkt_burst = ixgbe_recv_pkts_lro;
>>> +		}
>>> +	}
>>>    }
>>>
>>>    /*
>>> @@ -3583,10 +4000,26 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>    	uint32_t maxfrs;
>>>    	uint32_t srrctl;
>>>    	uint32_t rdrxctl;
>>> +	uint32_t rscctl;
>>> +	uint32_t psrtype;
>>> +	uint32_t rfctl;
>>>    	uint32_t rxcsum;
>>>    	uint16_t buf_size;
>>>    	uint16_t i;
>>>    	struct rte_eth_rxmode *rx_conf = &dev->data->dev_conf.rxmode;
>>> +	struct rte_eth_dev_info dev_info = { 0 };
>>> +	bool rsc_capable = false;
>>> +
>>> +	/* Sanity check */
>>> +	dev->dev_ops->dev_infos_get(dev, &dev_info);
>>> +	if (dev_info.rx_offload_capa & DEV_RX_OFFLOAD_TCP_LRO)
>>> +		rsc_capable = true;
>>> +
>>> +	if (!rsc_capable && rx_conf->enable_lro) {
>>> +		PMD_INIT_LOG(CRIT, "LRO is requested on HW that doesn't "
>>> +				   "support it");
>>> +		return -EINVAL;
>>> +	}
>>>
>>>    	PMD_INIT_FUNC_TRACE();
>>>    	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
>>> @@ -3606,13 +4039,44 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>    	IXGBE_WRITE_REG(hw, IXGBE_FCTRL, fctrl);
>>>
>>>    	/*
>>> +	 * RFCTL configuration
>>> +	 *
>>> +	 * Since NFS packets coalescing is not supported - clear RFCTL.NFSW_DIS
>>> +	 * and RFCTL.NFSR_DIS when RSC is enabled.
>>> +	 */
>>> +	if (rsc_capable) {
>>> +		rfctl = IXGBE_READ_REG(hw, IXGBE_RFCTL);
>>> +		if (rx_conf->enable_lro) {
>>> +			rfctl &= ~(IXGBE_RFCTL_RSC_DIS | IXGBE_RFCTL_NFSW_DIS |
>>> +				   IXGBE_RFCTL_NFSR_DIS);
>>> +		} else {
>>> +			rfctl |= IXGBE_RFCTL_RSC_DIS;
>>> +		}
>>> +
>>> +		IXGBE_WRITE_REG(hw, IXGBE_RFCTL, rfctl);
>>> +	}
>>> +
>>> +
>>> +	/*
>>>    	 * Configure CRC stripping, if any.
>>>    	 */
>>>    	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
>>>    	if (rx_conf->hw_strip_crc)
>>>    		hlreg0 |= IXGBE_HLREG0_RXCRCSTRP;
>>> -	else
>>> +	else {
>>>    		hlreg0 &= ~IXGBE_HLREG0_RXCRCSTRP;
>>> +		if (rx_conf->enable_lro) {
>>> +			/*
>>> +			 * According to chapter of 4.6.7.2.1 of the Spec Rev.
>>> +			 * 3.0 RSC configuration requires HW CRC stripping being
>>> +			 * enabled. If user requested both HW CRC stripping off
>>> +			 * and RSC on - return an error.
>>> +			 */
>>> +			PMD_INIT_LOG(CRIT, "LRO can't be enabled when HW CRC "
>>> +					    "is disabled");
>>> +			return -EINVAL;
>>> +		}
>>> +	}
>>>
>>>    	/*
>>>    	 * Configure jumbo frame support, if any.
>>> @@ -3664,9 +4128,18 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>    		 * Configure Header Split
>>>    		 */
>>>    		if (rx_conf->header_split) {
>>> +			/*
>>> +			 * Print a warning if split_hdr_size is less
>>> +			 * than 128 bytes when RSC is requested.
>>> +			 */
>>> +			if (rx_conf->enable_lro &&
>>> +			    rx_conf->split_hdr_size < 128)
>>> +				PMD_INIT_LOG(INFO, "split_hdr_size less than "
>>> +						   "128 bytes (%d)!",
>>> +					     rx_conf->split_hdr_size);
>>> +
>>>    			if (hw->mac.type == ixgbe_mac_82599EB) {
>>>    				/* Must setup the PSRTYPE register */
>>> -				uint32_t psrtype;
>>>    				psrtype = IXGBE_PSRTYPE_TCPHDR |
>>>    					IXGBE_PSRTYPE_UDPHDR   |
>>>    					IXGBE_PSRTYPE_IPV4HDR  |
>>> @@ -3679,7 +4152,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>    			srrctl |= IXGBE_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS;
>>>    		} else
>>>    #endif
>>> +		{
>>>    			srrctl = IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF;
>>> +			/*
>>> +			 * Following the 4.6.7.2.1 chapter of the 82599/x540
>>> +			 * Spec if RSC is enabled the SRRCTL[n].BSIZEHEADER
>>> +			 * should be configured even if header split is not
>>> +			 * enabled. In the later case we will configure it 128
>>> +			 * bytes following the recommendation in the spec.
>>> +			 */
>>> +			if (rx_conf->enable_lro)
>>> +				srrctl |=
>>> +				     ((128 << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &
>>> +						    IXGBE_SRRCTL_BSIZEHDR_MASK);
>>> +		}
>>>
>>>    		/* Set if packets are dropped when no descriptors available */
>>>    		if (rxq->drop_en)
>>> @@ -3696,6 +4182,13 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>    				       RTE_PKTMBUF_HEADROOM);
>>>    		srrctl |= ((buf_size >> IXGBE_SRRCTL_BSIZEPKT_SHIFT) &
>>>    			   IXGBE_SRRCTL_BSIZEPKT_MASK);
>>> +
>>> +		/*
>>> +		 * TODO: Consider setting the Receive Descriptor Minimum
>>> +		 * Threshold Size for and RSC case. This is not an obviously
>>> +		 * beneficiary option but the one worth considering...
>>> +		 */
>>> +
>>>    		IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(rxq->reg_idx), srrctl);
>>>
>>>    		buf_size = (uint16_t) ((srrctl & IXGBE_SRRCTL_BSIZEPKT_MASK) <<
>>> @@ -3705,11 +4198,57 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>    		if (dev->data->dev_conf.rxmode.max_rx_pkt_len +
>>>    					    2 * IXGBE_VLAN_TAG_SIZE > buf_size)
>>>    			dev->data->scattered_rx = 1;
>>> +
>>> +		/* RSC per-queue configuration */
>>> +		if (rx_conf->enable_lro) {
>>> +			uint32_t eitr;
>>> +
>>> +			rscctl =
>>> +				IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxq->reg_idx));
>>> +			psrtype =
>>> +				IXGBE_READ_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx));
>>> +			eitr = IXGBE_READ_REG(hw, IXGBE_EITR(rxq->reg_idx));
>>> +
>>> +			rscctl |= IXGBE_RSCCTL_RSCEN;
>>> +			rscctl |= get_rscctl_maxdesc(rxq->mb_pool);
>>> +			psrtype |= IXGBE_PSRTYPE_TCPHDR;
>>> +
>>> +			/*
>>> +			 * RSC: Set ITR interval corresponding to 2K ints/s.
>>> +			 *
>>> +			 * Full-sized RSC aggregations for a 10Gb/s link will
>>> +			 * arrive at about 20K aggregation/s rate.
>>> +			 *
>>> +			 * 2K inst/s rate will make only 10% of the
>>> +			 * aggregations to be closed due to the interrupt timer
>>> +			 * expiration for a streaming at wire-speed case.
>>> +			 *
>>> +			 * For a sparse streaming case this setting will yield
>>> +			 * at most 500us latency for a single RSC aggregation.
>>> +			 */
>>> +			eitr   |= (2000 | IXGBE_EITR_CNT_WDIS);
>>> +
>>> +			IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxq->reg_idx), rscctl);
>>> +			IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(rxq->reg_idx),
>>> +								       psrtype);
>>> +			IXGBE_WRITE_REG(hw, IXGBE_EITR(rxq->reg_idx), eitr);
>>> +
>>> +			/*
>>> +			 * RSC requires the mapping of the queue to the
>>> +			 * interrupt vector.
>>> +			 */
>>> +			ixgbe_set_ivar(dev, rxq->reg_idx, i, 0);
>>> +
>>> +			rxq->rsc_en = 1;
>>> +		}
>>>    	}
>>>
>>>    	if (rx_conf->enable_scatter)
>>>    		dev->data->scattered_rx = 1;
>>>
>>> +	if (rx_conf->enable_lro)
>>> +		dev->data->lro = 1;
>>> +
>>>    	set_rx_function(dev);
>>>
>>>    	/*
>>> @@ -3742,6 +4281,19 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
>>>    		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>>>    	}
>>>
>>> +	/* Finalize RSC configuration  */
>>> +	if (rx_conf->enable_lro) {
>>> +		/*
>>> +		 * Follow the instructions in the 4.6.7.2.1 of the Spec Rev. 3.0
>>> +		 */
>>> +		rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL);
>>> +		rdrxctl |= IXGBE_RDRXCTL_RSCACKC;
>>> +		IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl);
>>> +
>>> +		PMD_INIT_LOG(INFO, "enabling LRO mode");
>>> +	}
>>> +
>>> +
>>>    	return 0;
>>>    }
>> I've just noticed that RTE_HEADER_SPLIT_ENABLE used in
>> ixgbe_dev_rx_init() is not enabled anywhere: neither in config_XXX files
>> nor anywhere else.
>> It looks like this macro is used only in ixgbe_rxtx.c. Seems like this
>> is some sort of a legacy leftovers, isn't it?
> Yes, I believe so.
> I presume there was an attempt a while ago to support Header Split RX functionality.
> Though, as I am aware, right now our ixgbe PMD don't support it.

I see. Good to know. In this case I won't add the dead code in the LRO 
patch too.

Thanks for clarification.

>
> Konstantin
>
>> Konstantin, could u, pls., comment?
>>
>>> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>> index bbe5ff3..389173f 100644
>>> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
>>> @@ -79,6 +79,10 @@ struct igb_rx_entry {
>>>    	struct rte_mbuf *mbuf; /**< mbuf associated with RX descriptor. */
>>>    };
>>>
>>> +struct igb_rsc_entry {
>>> +	struct rte_mbuf *fbuf; /**< First segment of the fragmented packet. */
>>> +};
>>> +
>>>    /**
>>>     * Structure associated with each descriptor of the TX ring of a TX queue.
>>>     */
>>> @@ -105,6 +109,7 @@ struct igb_rx_queue {
>>>    	volatile uint32_t   *rdt_reg_addr; /**< RDT register address. */
>>>    	volatile uint32_t   *rdh_reg_addr; /**< RDH register address. */
>>>    	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
>>> +	struct igb_rsc_entry *sw_rsc_ring; /**< address of RSC software ring. */
>>>    	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
>>>    	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
>>>    	uint64_t            mbuf_initializer; /**< value to init mbufs */
>>> @@ -126,6 +131,7 @@ struct igb_rx_queue {
>>>    	uint8_t             port_id;  /**< Device port identifier. */
>>>    	uint8_t             crc_len;  /**< 0 if CRC stripped, 4 otherwise. */
>>>    	uint8_t             drop_en;  /**< If not 0, set SRRCTL.Drop_En. */
>>> +	uint8_t             rsc_en;   /**< If not 0, RSC is enabled. */
>>>    	uint8_t             rx_deferred_start; /**< not in global dev start. */
>>>    #ifdef RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC
>>>    	/** need to alloc dummy mbuf, for wraparound when scanning hw ring */

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-03-18 10:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-09 19:07 [dpdk-dev] [PATCH v6 0/3]: Add LRO support to ixgbe PMD Vlad Zolotarov
2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 1/3] ixgbe: Cleanups Vlad Zolotarov
2015-03-09 20:15   ` Ananyev, Konstantin
2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 2/3] ixgbe: Code refactoring Vlad Zolotarov
2015-03-09 19:07 ` [dpdk-dev] [PATCH v6 3/3] ixgbe: Add LRO support Vlad Zolotarov
2015-03-10  0:30   ` Ananyev, Konstantin
2015-03-10 13:22     ` Vlad Zolotarov
2015-03-10 20:09       ` Ananyev, Konstantin
2015-03-10 21:36         ` Vlad Zolotarov
2015-03-11 16:32           ` Ananyev, Konstantin
2015-03-11 16:54             ` Vlad Zolotarov
2015-03-13  9:07               ` Olivier MATZ
2015-03-13 11:28                 ` Ananyev, Konstantin
2015-03-13 12:12                   ` Vlad Zolotarov
2015-03-10 17:51     ` Vlad Zolotarov
2015-03-16 18:26   ` Vlad Zolotarov
2015-03-18  0:31     ` Ananyev, Konstantin
2015-03-18 10:29       ` Vlad Zolotarov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).