[dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2
@ 2014-09-03 15:49 Bruce Richardson
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 01/13] mbuf: replace data pointer by an offset Bruce Richardson
                   ` (13 more replies)
  0 siblings, 14 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

This patch set continues on from the changes in part 1, and depends
upon that patch set.

This patch set reorders the fields in the mbuf structure and splits
the structure across two cache lines, given lots of new space for new
fields to be added. This set uses some of that space by expanding the 
ol_flags field and by adding in a packet_type field. A part 3 patchset
is planned to introduce other new fields into the new mbuf structure.

With the splitting of the mbuf across multiple cache lines, performance
degradations are seen inside the drivers, both fast-path and slow path.
For the fast-path, this patchset reworks the way in which the pool pointer
is used to free packets post-TX, which removes the perf regression. For
the slow path, an alternative approach is taken - a new scattered packets
RX function is introduced into the vector PMD. Using this function,
throughput for the slow path RX-TX using testpmd is increased by up to 20%
over the original baseline.

Bruce Richardson (12):
  mbuf: reorder fields by time of use
  mbuf: add packet_type field
  mbuf: expand ol_flags field to 64-bits
  mbuf: introduce a flag to indicate a control mbuf
  mbuf: minor changes for readability
  mbuf: use macros only to access the mbuf metadata
  mbuf: add named points inside the mbuf structure
  ixgbe: rework vector pmd following mbuf changes
  mbuf: split mbuf across two cache lines.
  mbuf: move l2_len and l3_len to second cache line
  ixgbe: Fix perf regression due to moved pool ptr
  ixgbe: Improve slow-path perf: vector scattered RX

Olivier Matz (1):
  mbuf: replace data pointer by an offset

 app/test-pmd/csumonly.c                     |   2 +-
 app/test-pmd/flowgen.c                      |   2 +-
 app/test-pmd/icmpecho.c                     |   2 +-
 app/test-pmd/ieee1588fwd.c                  |   4 +-
 app/test-pmd/macfwd-retry.c                 |   2 +-
 app/test-pmd/macfwd.c                       |   2 +-
 app/test-pmd/macswap.c                      |   2 +-
 app/test-pmd/rxonly.c                       |   2 +-
 app/test-pmd/testpmd.c                      |   2 +-
 app/test-pmd/txonly.c                       |   7 +-
 app/test/packet_burst_generator.c           |   7 +-
 app/test/test_mbuf.c                        |   8 +-
 app/test/test_table_acl.c                   |   7 +-
 app/test/test_table_pipeline.c              |   8 +-
 examples/exception_path/main.c              |   3 +-
 examples/vhost/main.c                       |  37 +--
 examples/vhost_xen/main.c                   |  14 +-
 lib/librte_ip_frag/rte_ipv4_fragmentation.c |   6 +-
 lib/librte_ip_frag/rte_ipv6_fragmentation.c |   6 +-
 lib/librte_mbuf/rte_mbuf.c                  |  10 +-
 lib/librte_mbuf/rte_mbuf.h                  | 127 +++++----
 lib/librte_pmd_bond/rte_eth_bond_pmd.c      |   4 +-
 lib/librte_pmd_e1000/em_rxtx.c              |  12 +-
 lib/librte_pmd_e1000/igb_rxtx.c             |  13 +-
 lib/librte_pmd_i40e/i40e_rxtx.c             |  17 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c           |  56 ++--
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h           |  22 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c       | 390 +++++++++++++++++-----------
 lib/librte_pmd_pcap/rte_eth_pcap.c          |   9 +-
 lib/librte_pmd_virtio/virtio_rxtx.c         |  10 +-
 lib/librte_pmd_virtio/virtqueue.h           |   3 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c       |   5 +-
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c    |   2 +-
 33 files changed, 466 insertions(+), 337 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 01/13] mbuf: replace data pointer by an offset
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08  9:52   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 02/13] mbuf: reorder fields by time of use Bruce Richardson
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

From: Olivier Matz <olivier.matz@6wind.com>

Original patch:
 The mbuf structure already contains a pointer to the beginning of the
 buffer (m->buf_addr). It is not needed to use 8 bytes again to store
 another pointer to the beginning of the data.

 Using a 16 bits unsigned integer is enough as we know that a mbuf is
 never longer than 64KB. We gain 6 bytes in the structure thanks to
 this modification.

 Signed-off-by: Olivier Matz <olivier.matz@6wind.com>

This version:
* Updated original patch to apply to latest on mainline.
* Disabled vector PMD in config as it relies heavily on the mbuf layout
  This will be re-enabled in a subsequent commit once vPMD has been
  reworked to take account of mbuf changes.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 app/test-pmd/csumonly.c                     |  2 +-
 app/test-pmd/flowgen.c                      |  2 +-
 app/test-pmd/icmpecho.c                     |  2 +-
 app/test-pmd/ieee1588fwd.c                  |  4 +--
 app/test-pmd/macfwd-retry.c                 |  2 +-
 app/test-pmd/macfwd.c                       |  2 +-
 app/test-pmd/macswap.c                      |  2 +-
 app/test-pmd/rxonly.c                       |  2 +-
 app/test-pmd/testpmd.c                      |  2 +-
 app/test-pmd/txonly.c                       |  7 ++---
 app/test/packet_burst_generator.c           |  7 ++---
 app/test/test_mbuf.c                        |  6 ++---
 app/test/test_table_acl.c                   |  7 ++---
 app/test/test_table_pipeline.c              |  8 ++----
 config/common_linuxapp                      |  2 +-
 examples/exception_path/main.c              |  3 ++-
 examples/vhost/main.c                       | 37 +++++++++++++------------
 examples/vhost_xen/main.c                   | 14 +++++-----
 lib/librte_ip_frag/rte_ipv4_fragmentation.c |  6 ++---
 lib/librte_ip_frag/rte_ipv6_fragmentation.c |  6 ++---
 lib/librte_mbuf/rte_mbuf.c                  |  6 ++---
 lib/librte_mbuf/rte_mbuf.h                  | 42 +++++++++++++----------------
 lib/librte_pmd_bond/rte_eth_bond_pmd.c      |  4 +--
 lib/librte_pmd_e1000/em_rxtx.c              | 12 ++++-----
 lib/librte_pmd_e1000/igb_rxtx.c             | 13 +++++----
 lib/librte_pmd_i40e/i40e_rxtx.c             | 17 ++++++------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c           | 13 ++++-----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h           |  3 +--
 lib/librte_pmd_pcap/rte_eth_pcap.c          |  9 ++++---
 lib/librte_pmd_virtio/virtio_rxtx.c         | 10 +++----
 lib/librte_pmd_virtio/virtqueue.h           |  3 +--
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c       |  5 ++--
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c    |  2 +-
 33 files changed, 130 insertions(+), 132 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 28b66f5..2ce4c42 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -263,7 +263,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		pkt_ol_flags = mb->ol_flags;
 		ol_flags = (uint16_t) (pkt_ol_flags & (~PKT_TX_L4_MASK));
 
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
 		if (eth_type == ETHER_TYPE_VLAN) {
 			/* Only allow single VLAN label here */
diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index b091b6d..8b4ed9a 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -175,7 +175,7 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 		pkt->next = NULL;
 
 		/* Initialize Ethernet header. */
-		eth_hdr = (struct ether_hdr *)pkt->data;
+		eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
 		ether_addr_copy(&cfg_ether_dst, &eth_hdr->d_addr);
 		ether_addr_copy(&cfg_ether_src, &eth_hdr->s_addr);
 		eth_hdr->ether_type = rte_cpu_to_be_16(ETHER_TYPE_IPv4);
diff --git a/app/test-pmd/icmpecho.c b/app/test-pmd/icmpecho.c
index 4a277b8..7fd4b6d 100644
--- a/app/test-pmd/icmpecho.c
+++ b/app/test-pmd/icmpecho.c
@@ -330,7 +330,7 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	nb_replies = 0;
 	for (i = 0; i < nb_rx; i++) {
 		pkt = pkts_burst[i];
-		eth_h = (struct ether_hdr *) pkt->data;
+		eth_h = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
 		eth_type = RTE_BE_TO_CPU_16(eth_h->ether_type);
 		l2_len = sizeof(struct ether_hdr);
 		if (verbose_level > 0) {
diff --git a/app/test-pmd/ieee1588fwd.c b/app/test-pmd/ieee1588fwd.c
index ab5e06e..976aa28 100644
--- a/app/test-pmd/ieee1588fwd.c
+++ b/app/test-pmd/ieee1588fwd.c
@@ -546,7 +546,7 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	 * Check that the received packet is a PTP packet that was detected
 	 * by the hardware.
 	 */
-	eth_hdr = (struct ether_hdr *)mb->data;
+	eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 	eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
 	if (! (mb->ol_flags & PKT_RX_IEEE1588_PTP)) {
 		if (eth_type == ETHER_TYPE_1588) {
@@ -574,7 +574,7 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	 * Check that the received PTP packet is a PTP V2 packet of type
 	 * PTP_SYNC_MESSAGE.
 	 */
-	ptp_hdr = (struct ptpv2_msg *) ((char *) mb->data +
+	ptp_hdr = (struct ptpv2_msg *) (rte_pktmbuf_mtod(mb, char *) +
 					sizeof(struct ether_hdr));
 	if (ptp_hdr->version != 0x02) {
 		printf("Port %u Received PTP V2 Ethernet frame with wrong PTP"
diff --git a/app/test-pmd/macfwd-retry.c b/app/test-pmd/macfwd-retry.c
index 5122983..83da26f 100644
--- a/app/test-pmd/macfwd-retry.c
+++ b/app/test-pmd/macfwd-retry.c
@@ -119,7 +119,7 @@ pkt_burst_mac_retry_forward(struct fwd_stream *fs)
 	fs->rx_packets += nb_rx;
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 4b905cd..38bae23 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -110,7 +110,7 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 	txp = &ports[fs->tx_port];
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index c5b3a0c..1786095 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -110,7 +110,7 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 	txp = &ports[fs->tx_port];
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 
 		/* Swap dest and src mac addresses. */
 		ether_addr_copy(&eth_hdr->d_addr, &addr);
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 5bc74da..7ba36a1 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -149,7 +149,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 			rte_pktmbuf_free(mb);
 			continue;
 		}
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		ol_flags = mb->ol_flags;
 		print_ether_addr("  src=", &eth_hdr->s_addr);
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index b6596f8..32b705e 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -404,7 +404,7 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 			mb_ctor_arg->seg_buf_offset);
 	mb->buf_len      = mb_ctor_arg->seg_buf_size;
 	mb->ol_flags     = 0;
-	mb->data         = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+	mb->data_off     = RTE_PKTMBUF_HEADROOM;
 	mb->nb_segs      = 1;
 	mb->l2_len       = 0;
 	mb->l3_len       = 0;
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 8135264..2b3f0b9 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -111,13 +111,13 @@ copy_buf_to_pkt_segs(void* buf, unsigned len, struct rte_mbuf *pkt,
 		seg = seg->next;
 	}
 	copy_len = seg->data_len - offset;
-	seg_buf = ((char *) seg->data + offset);
+	seg_buf = (rte_pktmbuf_mtod(seg, char *) + offset);
 	while (len > copy_len) {
 		rte_memcpy(seg_buf, buf, (size_t) copy_len);
 		len -= copy_len;
 		buf = ((char*) buf + copy_len);
 		seg = seg->next;
-		seg_buf = seg->data;
+		seg_buf = rte_pktmbuf_mtod(seg, char *);
 	}
 	rte_memcpy(seg_buf, buf, (size_t) len);
 }
@@ -126,7 +126,8 @@ static inline void
 copy_buf_to_pkt(void* buf, unsigned len, struct rte_mbuf *pkt, unsigned offset)
 {
 	if (offset + len <= pkt->data_len) {
-		rte_memcpy(((char *) pkt->data + offset), buf, (size_t) len);
+		rte_memcpy((rte_pktmbuf_mtod(pkt, char *) + offset),
+			buf, (size_t) len);
 		return;
 	}
 	copy_buf_to_pkt_segs(buf, len, pkt, offset);
diff --git a/app/test/packet_burst_generator.c b/app/test/packet_burst_generator.c
index db7f023..9e747a4 100644
--- a/app/test/packet_burst_generator.c
+++ b/app/test/packet_burst_generator.c
@@ -59,13 +59,13 @@ copy_buf_to_pkt_segs(void *buf, unsigned len, struct rte_mbuf *pkt,
 		seg = seg->next;
 	}
 	copy_len = seg->data_len - offset;
-	seg_buf = ((char *) seg->data + offset);
+	seg_buf = rte_pktmbuf_mtod(seg, char *) + offset;
 	while (len > copy_len) {
 		rte_memcpy(seg_buf, buf, (size_t) copy_len);
 		len -= copy_len;
 		buf = ((char *) buf + copy_len);
 		seg = seg->next;
-		seg_buf = seg->data;
+		seg_buf = rte_pktmbuf_mtod(seg, void *);
 	}
 	rte_memcpy(seg_buf, buf, (size_t) len);
 }
@@ -74,7 +74,8 @@ static inline void
 copy_buf_to_pkt(void *buf, unsigned len, struct rte_mbuf *pkt, unsigned offset)
 {
 	if (offset + len <= pkt->data_len) {
-		rte_memcpy(((char *) pkt->data + offset), buf, (size_t) len);
+		rte_memcpy(rte_pktmbuf_mtod(pkt, char *) + offset,
+				buf, (size_t) len);
 		return;
 	}
 	copy_buf_to_pkt_segs(buf, len, pkt, offset);
diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index b81e622..e3d896b 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -432,7 +432,7 @@ test_pktmbuf_pool_ptr(void)
 			printf("rte_pktmbuf_alloc() failed (%u)\n", i);
 			ret = -1;
 		}
-		m[i]->data = RTE_PTR_ADD(m[i]->data, 64);
+		m[i]->data_off += 64;
 	}
 
 	/* free them */
@@ -451,8 +451,8 @@ test_pktmbuf_pool_ptr(void)
 			printf("rte_pktmbuf_alloc() failed (%u)\n", i);
 			ret = -1;
 		}
-		if (m[i]->data != RTE_PTR_ADD(m[i]->buf_addr, RTE_PKTMBUF_HEADROOM)) {
-			printf ("data pointer not set properly\n");
+		if (m[i]->data_off != RTE_PKTMBUF_HEADROOM) {
+			printf ("invalid data_off\n");
 			ret = -1;
 		}
 	}
diff --git a/app/test/test_table_acl.c b/app/test/test_table_acl.c
index 4db680a..0f2b57e 100644
--- a/app/test/test_table_acl.c
+++ b/app/test/test_table_acl.c
@@ -513,7 +513,7 @@ test_pipeline_single_filter(int expected_count)
 			struct rte_mbuf *mbuf;
 
 			mbuf = rte_pktmbuf_alloc(pool);
-			memset(mbuf->data, 0x00,
+			memset(rte_pktmbuf_mtod(mbuf, char *), 0x00,
 				sizeof(struct ipv4_5tuple));
 
 			five_tuple.proto = j;
@@ -522,7 +522,7 @@ test_pipeline_single_filter(int expected_count)
 			five_tuple.port_src = rte_bswap16(100 + j);
 			five_tuple.port_dst = rte_bswap16(200 + j);
 
-			memcpy(mbuf->data, &five_tuple,
+			memcpy(rte_pktmbuf_mtod(mbuf, char *), &five_tuple,
 				sizeof(struct ipv4_5tuple));
 			RTE_LOG(INFO, PIPELINE, "%s: Enqueue onto ring %d\n",
 				__func__, i);
@@ -549,7 +549,8 @@ test_pipeline_single_filter(int expected_count)
 			printf("Got %d object(s) from ring %d!\n", ret, i);
 			for (j = 0; j < ret; j++) {
 				mbuf = (struct rte_mbuf *)objs[j];
-				rte_hexdump(stdout, "mbuf", mbuf->data, 64);
+				rte_hexdump(stdout, "mbuf",
+					rte_pktmbuf_mtod(mbuf, char *), 64);
 				rte_pktmbuf_free(mbuf);
 			}
 			tx_count += ret;
diff --git a/app/test/test_table_pipeline.c b/app/test/test_table_pipeline.c
index 15a038b..a0a9e04 100644
--- a/app/test/test_table_pipeline.c
+++ b/app/test/test_table_pipeline.c
@@ -39,11 +39,6 @@
 #include "test_table.h"
 #include "test_table_pipeline.h"
 
-#define RTE_CBUF_UINT8_PTR(cbuf, offset)			\
-	(&cbuf->data[offset])
-#define RTE_CBUF_UINT32_PTR(cbuf, offset)			\
-	(&cbuf->data32[offset/sizeof(uint32_t)])
-
 #if 0
 
 static rte_pipeline_port_out_action_handler port_action_0x00
@@ -498,7 +493,8 @@ test_pipeline_single_filter(int test_type, int expected_count)
 			printf("Got %d object(s) from ring %d!\n", ret, i);
 			for (j = 0; j < ret; j++) {
 				mbuf = (struct rte_mbuf *)objs[j];
-				rte_hexdump(stdout, "Object:", mbuf->data,
+				rte_hexdump(stdout, "Object:",
+					rte_pktmbuf_mtod(mbuf, char *),
 					mbuf->data_len);
 				rte_pktmbuf_free(mbuf);
 			}
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 5bee910..b140af7 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -191,7 +191,7 @@ CONFIG_RTE_LIBRTE_IXGBE_DEBUG_DRIVER=n
 CONFIG_RTE_LIBRTE_IXGBE_PF_DISABLE_STRIP_CRC=n
 CONFIG_RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC=y
 CONFIG_RTE_LIBRTE_IXGBE_ALLOW_UNSUPPORTED_SFP=n
-CONFIG_RTE_IXGBE_INC_VECTOR=y
+CONFIG_RTE_IXGBE_INC_VECTOR=n
 CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=y
 
 #
diff --git a/examples/exception_path/main.c b/examples/exception_path/main.c
index 5045ef8..f286bf2 100644
--- a/examples/exception_path/main.c
+++ b/examples/exception_path/main.c
@@ -302,7 +302,8 @@ main_loop(__attribute__((unused)) void *arg)
 			if (m == NULL)
 				continue;
 
-			ret = read(tap_fd, m->data, MAX_PACKET_SZ);
+			ret = read(tap_fd, rte_pktmbuf_mtod(m, void *),
+				MAX_PACKET_SZ);
 			lcore_stats[lcore_id].rx++;
 			if (unlikely(ret < 0)) {
 				FATAL_ERROR("Reading from %s interface failed",
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 4e1c103..85ee8b8 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -1033,7 +1033,7 @@ virtio_dev_rx(struct virtio_net *dev, struct rte_mbuf **pkts, uint32_t count)
 
 		/* Copy mbuf data to buffer */
 		rte_memcpy((void *)(uintptr_t)buff_addr,
-			(const void *)buff->data,
+			rte_pktmbuf_mtod(buff, const void *),
 			rte_pktmbuf_data_len(buff));
 		PRINT_PACKET(dev, (uintptr_t)buff_addr,
 			rte_pktmbuf_data_len(buff), 0);
@@ -1438,7 +1438,7 @@ link_vmdq(struct virtio_net *dev, struct rte_mbuf *m)
 	int i, ret;
 
 	/* Learn MAC address of guest device from packet */
-	pkt_hdr = (struct ether_hdr *)m->data;
+	pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	dev_ll = ll_root_used;
 
@@ -1525,7 +1525,7 @@ virtio_tx_local(struct virtio_net *dev, struct rte_mbuf *m)
 	struct ether_hdr *pkt_hdr;
 	uint64_t ret = 0;
 
-	pkt_hdr = (struct ether_hdr *)m->data;
+	pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	/*get the used devices list*/
 	dev_ll = ll_root_used;
@@ -1593,7 +1593,7 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 	unsigned len, ret, offset = 0;
 	const uint16_t lcore_id = rte_lcore_id();
 	struct virtio_net_data_ll *dev_ll = ll_root_used;
-	struct ether_hdr *pkt_hdr = (struct ether_hdr *)m->data;
+	struct ether_hdr *pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	/*check if destination is local VM*/
 	if ((vm2vm_mode == VM2VM_SOFTWARE) && (virtio_tx_local(dev, m) == 0))
@@ -1652,18 +1652,21 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 	mbuf->nb_segs = m->nb_segs;
 
 	/* Copy ethernet header to mbuf. */
-	rte_memcpy((void*)mbuf->data, (const void*)m->data, ETH_HLEN);
+	rte_memcpy(rte_pktmbuf_mtod(mbuf, void *),
+		rte_pktmbuf_mtod(m, const void *),
+		ETH_HLEN);
 
 
 	/* Setup vlan header. Bytes need to be re-ordered for network with htons()*/
-	vlan_hdr = (struct vlan_ethhdr *) mbuf->data;
+	vlan_hdr = rte_pktmbuf_mtod(mbuf, struct vlan_ethhdr *);
 	vlan_hdr->h_vlan_encapsulated_proto = vlan_hdr->h_vlan_proto;
 	vlan_hdr->h_vlan_proto = htons(ETH_P_8021Q);
 	vlan_hdr->h_vlan_TCI = htons(vlan_tag);
 
 	/* Copy the remaining packet contents to the mbuf. */
-	rte_memcpy((void*) ((uint8_t*)mbuf->data + VLAN_ETH_HLEN),
-		(const void*) ((uint8_t*)m->data + ETH_HLEN), (m->data_len - ETH_HLEN));
+	rte_memcpy((void *)(rte_pktmbuf_mtod(mbuf, uint8_t *) + VLAN_ETH_HLEN),
+		(const void *)(rte_pktmbuf_mtod(m, uint8_t *) + ETH_HLEN),
+		(m->data_len - ETH_HLEN));
 
 	/* Copy the remaining segments for the whole packet. */
 	prev = mbuf;
@@ -1778,7 +1781,7 @@ virtio_dev_tx(struct virtio_net* dev, struct rte_mempool *mbuf_pool)
 		/* Setup dummy mbuf. This is copied to a real mbuf if transmitted out the physical port. */
 		m.data_len = desc->len;
 		m.pkt_len = desc->len;
-		m.data = (void*)(uintptr_t)buff_addr;
+		m.data_off = 0;
 
 		PRINT_PACKET(dev, (uintptr_t)buff_addr, desc->len, 0);
 
@@ -2333,7 +2336,7 @@ attach_rxmbuf_zcp(struct virtio_net *dev)
 	}
 
 	mbuf->buf_addr = (void *)(uintptr_t)(buff_addr - RTE_PKTMBUF_HEADROOM);
-	mbuf->data = (void *)(uintptr_t)(buff_addr);
+	mbuf->data_off = RTE_PKTMBUF_HEADROOM;
 	mbuf->buf_physaddr = phys_addr - RTE_PKTMBUF_HEADROOM;
 	mbuf->data_len = desc->len;
 	MBUF_HEADROOM_UINT32(mbuf) = (uint32_t)desc_idx;
@@ -2370,7 +2373,7 @@ static inline void pktmbuf_detach_zcp(struct rte_mbuf *m)
 
 	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->data = (char *) m->buf_addr + buf_ofs;
+	m->data_off = buf_ofs;
 
 	m->data_len = 0;
 }
@@ -2604,7 +2607,7 @@ virtio_tx_route_zcp(struct virtio_net *dev, struct rte_mbuf *m,
 	unsigned len, ret, offset = 0;
 	struct vpool *vpool;
 	struct virtio_net_data_ll *dev_ll = ll_root_used;
-	struct ether_hdr *pkt_hdr = (struct ether_hdr *)m->data;
+	struct ether_hdr *pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 	uint16_t vlan_tag = (uint16_t)vlan_tags[(uint16_t)dev->device_fh];
 
 	/*Add packet to the port tx queue*/
@@ -2681,11 +2684,11 @@ virtio_tx_route_zcp(struct virtio_net *dev, struct rte_mbuf *m,
 	mbuf->pkt_len = mbuf->data_len;
 	if (unlikely(need_copy)) {
 		/* Copy the packet contents to the mbuf. */
-		rte_memcpy((void *)((uint8_t *)mbuf->data),
-			(const void *) ((uint8_t *)m->data),
+		rte_memcpy(rte_pktmbuf_mtod(mbuf, void *),
+			rte_pktmbuf_mtod(m, void *),
 			m->data_len);
 	} else {
-		mbuf->data = m->data;
+		mbuf->data_off = m->data_off;
 		mbuf->buf_physaddr = m->buf_physaddr;
 		mbuf->buf_addr = m->buf_addr;
 	}
@@ -2819,8 +2822,8 @@ virtio_dev_tx_zcp(struct virtio_net *dev)
 		m.data_len = desc->len;
 		m.nb_segs = 1;
 		m.next = NULL;
-		m.data = (void *)(uintptr_t)buff_addr;
-		m.buf_addr = m.data;
+		m.data_off = 0;
+		m.buf_addr = (void *)(uintptr_t)buff_addr;
 		m.buf_physaddr = phys_addr;
 
 		/*
diff --git a/examples/vhost_xen/main.c b/examples/vhost_xen/main.c
index 8162cd8..56ffec8 100644
--- a/examples/vhost_xen/main.c
+++ b/examples/vhost_xen/main.c
@@ -808,7 +808,7 @@ virtio_tx_local(struct virtio_net *dev, struct rte_mbuf *m)
 	struct ether_hdr *pkt_hdr;
 	uint64_t ret = 0;
 
-	pkt_hdr = (struct ether_hdr *)m->data;
+	pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	/*get the used devices list*/
 	dev_ll = ll_root_used;
@@ -883,18 +883,20 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 	mbuf->pkt_len = mbuf->data_len;
 
 	/* Copy ethernet header to mbuf. */
-	rte_memcpy((void*)mbuf->data, (const void*)m->data, ETH_HLEN);
+	rte_memcpy(rte_pktmbuf_mtod(mbuf, void*),
+			rte_pktmbuf_mtod(m, const void*), ETH_HLEN);
 
 
 	/* Setup vlan header. Bytes need to be re-ordered for network with htons()*/
-	vlan_hdr = (struct vlan_ethhdr *) mbuf->data;
+	vlan_hdr = rte_pktmbuf_mtod(mbuf, struct vlan_ethhdr *);
 	vlan_hdr->h_vlan_encapsulated_proto = vlan_hdr->h_vlan_proto;
 	vlan_hdr->h_vlan_proto = htons(ETH_P_8021Q);
 	vlan_hdr->h_vlan_TCI = htons(vlan_tag);
 
 	/* Copy the remaining packet contents to the mbuf. */
-	rte_memcpy((void*) ((uint8_t*)mbuf->data + VLAN_ETH_HLEN),
-		(const void*) ((uint8_t*)m->data + ETH_HLEN), (m->data_len - ETH_HLEN));
+	rte_memcpy((void *)(rte_pktmbuf_mtod(mbuf, uint8_t *) + VLAN_ETH_HLEN),
+		(const void *)(rte_pktmbuf_mtod(m, uint8_t *) + ETH_HLEN),
+		(m->data_len - ETH_HLEN));
 	tx_q->m_table[len] = mbuf;
 	len++;
 	if (enable_stats) {
@@ -981,7 +983,7 @@ virtio_dev_tx(struct virtio_net* dev, struct rte_mempool *mbuf_pool)
 
 		/* Setup dummy mbuf. This is copied to a real mbuf if transmitted out the physical port. */
 		m.data_len = desc->len;
-		m.data = (void*)(uintptr_t)buff_addr;
+		m.data_off = 0;
 		m.nb_segs = 1;
 
 		virtio_tx_route(dev, &m, mbuf_pool, 0);
diff --git a/lib/librte_ip_frag/rte_ipv4_fragmentation.c b/lib/librte_ip_frag/rte_ipv4_fragmentation.c
index 6b9f07d..a4ed923 100644
--- a/lib/librte_ip_frag/rte_ipv4_fragmentation.c
+++ b/lib/librte_ip_frag/rte_ipv4_fragmentation.c
@@ -109,7 +109,7 @@ rte_ipv4_fragment_packet(struct rte_mbuf *pkt_in,
 	/* Fragment size should be a multiply of 8. */
 	IP_FRAG_ASSERT((frag_size & IPV4_HDR_FO_MASK) == 0);
 
-	in_hdr = (struct ipv4_hdr *) pkt_in->data;
+	in_hdr = rte_pktmbuf_mtod(pkt_in, struct ipv4_hdr *);
 	flag_offset = rte_cpu_to_be_16(in_hdr->fragment_offset);
 
 	/* If Don't Fragment flag is set */
@@ -165,7 +165,7 @@ rte_ipv4_fragment_packet(struct rte_mbuf *pkt_in,
 			if (len > (in_seg->data_len - in_seg_data_pos)) {
 				len = in_seg->data_len - in_seg_data_pos;
 			}
-			out_seg->data = (char*) in_seg->data + (uint16_t)in_seg_data_pos;
+			out_seg->data_off = in_seg->data_off + in_seg_data_pos;
 			out_seg->data_len = (uint16_t)len;
 			out_pkt->pkt_len = (uint16_t)(len +
 			    out_pkt->pkt_len);
@@ -188,7 +188,7 @@ rte_ipv4_fragment_packet(struct rte_mbuf *pkt_in,
 
 		/* Build the IP header */
 
-		out_hdr = (struct ipv4_hdr*) out_pkt->data;
+		out_hdr = rte_pktmbuf_mtod(out_pkt, struct ipv4_hdr *);
 
 		__fill_ipv4hdr_frag(out_hdr, in_hdr,
 		    (uint16_t)out_pkt->pkt_len,
diff --git a/lib/librte_ip_frag/rte_ipv6_fragmentation.c b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
index e007662..4ffcc7c 100644
--- a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
+++ b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
@@ -125,7 +125,7 @@ rte_ipv6_fragment_packet(struct rte_mbuf *pkt_in,
 	    (uint16_t)(pkt_in->pkt_len - sizeof (struct ipv6_hdr))))
 		return (-EINVAL);
 
-	in_hdr = (struct ipv6_hdr *) pkt_in->data;
+	in_hdr = rte_pktmbuf_mtod(pkt_in, struct ipv6_hdr *);
 
 	in_seg = pkt_in;
 	in_seg_data_pos = sizeof(struct ipv6_hdr);
@@ -171,7 +171,7 @@ rte_ipv6_fragment_packet(struct rte_mbuf *pkt_in,
 			if (len > (in_seg->data_len - in_seg_data_pos)) {
 				len = in_seg->data_len - in_seg_data_pos;
 			}
-			out_seg->data = (char *) in_seg->data + (uint16_t) in_seg_data_pos;
+			out_seg->data_off = in_seg->data_off + in_seg_data_pos;
 			out_seg->data_len = (uint16_t)len;
 			out_pkt->pkt_len = (uint16_t)(len +
 			    out_pkt->pkt_len);
@@ -196,7 +196,7 @@ rte_ipv6_fragment_packet(struct rte_mbuf *pkt_in,
 
 		/* Build the IP header */
 
-		out_hdr = (struct ipv6_hdr *) out_pkt->data;
+		out_hdr = rte_pktmbuf_mtod(out_pkt, struct ipv6_hdr *);
 
 		__fill_ipv6hdr_frag(out_hdr, in_hdr,
 		    (uint16_t) out_pkt->pkt_len - sizeof(struct ipv6_hdr),
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index c1b2176..26e36eb 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -117,7 +117,7 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->buf_len = (uint16_t)buf_len;
 
 	/* keep some headroom between start of buffer and data */
-	m->data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
+	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
 
 	/* init some constant fields */
 	m->pool = mp;
@@ -183,12 +183,12 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		__rte_mbuf_sanity_check(m, 0);
 
 		fprintf(f, "  segment at 0x%p, data=0x%p, data_len=%u\n",
-		       m, m->data, (unsigned)m->data_len);
+			m, rte_pktmbuf_mtod(m, void *), (unsigned)m->data_len);
 		len = dump_len;
 		if (len > m->data_len)
 			len = m->data_len;
 		if (len != 0)
-			rte_hexdump(f, NULL, m->data, len);
+			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
 		m = m->next;
 		nb_segs --;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 32e8474..669e7f5 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -119,6 +119,13 @@ struct rte_mbuf {
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
 	uint16_t buf_len;         /**< Length of segment buffer. */
+
+	/* valid for any segment */
+	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
+	uint16_t data_off;
+	uint16_t data_len;        /**< Amount of data in segment buffer. */
+	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
+
 #ifdef RTE_MBUF_REFCNT
 	/**
 	 * 16-bit Reference counter.
@@ -138,15 +145,9 @@ struct rte_mbuf {
 	uint16_t reserved;            /**< Unused field. Required for padding */
 	uint16_t ol_flags;            /**< Offload features. */
 
-	/* valid for any segment */
-	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
-	void* data;             /**< Start address of data in segment buffer. */
-	uint16_t data_len;      /**< Amount of data in segment buffer. */
-
 	/* these fields are valid for first segment only */
 	uint8_t nb_segs;        /**< Number of segments. */
 	uint8_t port;           /**< Input port. */
-	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 
 	/* offload features, valid for first segment only */
 	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
@@ -166,7 +167,7 @@ struct rte_mbuf {
 		uint16_t metadata16[0];
 		uint32_t metadata32[0];
 		uint64_t metadata64[0];
-	};
+	} __rte_cache_aligned;
 } __rte_cache_aligned;
 
 #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
@@ -452,7 +453,7 @@ void rte_ctrlmbuf_init(struct rte_mempool *mp, void *opaque_arg,
  * @param m
  *   The control mbuf.
  */
-#define rte_ctrlmbuf_data(m) ((m)->data)
+#define rte_ctrlmbuf_data(m) ((char *)((m)->buf_addr) + (m)->data_off)
 
 /**
  * A macro that returns the length of the carried data.
@@ -517,8 +518,6 @@ void rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg);
  */
 static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 {
-	uint32_t buf_ofs;
-
 	m->next = NULL;
 	m->pkt_len = 0;
 	m->l2_len = 0;
@@ -528,9 +527,8 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 	m->port = 0xff;
 
 	m->ol_flags = 0;
-	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
+	m->data_off = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->data = (char*) m->buf_addr + buf_ofs;
 
 	m->data_len = 0;
 	__rte_mbuf_sanity_check(m, 1);
@@ -587,7 +585,7 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->buf_len = md->buf_len;
 
 	mi->next = md->next;
-	mi->data = md->data;
+	mi->data_off = md->data_off;
 	mi->data_len = md->data_len;
 	mi->port = md->port;
 	mi->vlan_tci = md->vlan_tci;
@@ -618,16 +616,14 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
 	const struct rte_mempool *mp = m->pool;
 	void *buf = RTE_MBUF_TO_BADDR(m);
-	uint32_t buf_ofs;
 	uint32_t buf_len = mp->elt_size - sizeof(*m);
 	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof (*m);
 
 	m->buf_addr = buf;
 	m->buf_len = (uint16_t)buf_len;
 
-	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
+	m->data_off = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->data = (char*) m->buf_addr + buf_ofs;
 
 	m->data_len = 0;
 }
@@ -791,7 +787,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	return (uint16_t) ((char*) m->data - (char*) m->buf_addr);
+	return m->data_off;
 }
 
 /**
@@ -839,7 +835,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
  * @param t
  *   The type to cast the result into.
  */
-#define rte_pktmbuf_mtod(m, t) ((t)((m)->data))
+#define rte_pktmbuf_mtod(m, t) ((t)((char *)(m)->buf_addr + (m)->data_off))
 
 /**
  * A macro that returns the length of the packet.
@@ -884,11 +880,11 @@ static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
 	if (unlikely(len > rte_pktmbuf_headroom(m)))
 		return NULL;
 
-	m->data = (char*) m->data - len;
+	m->data_off -= len;
 	m->data_len = (uint16_t)(m->data_len + len);
 	m->pkt_len  = (m->pkt_len + len);
 
-	return (char*) m->data;
+	return (char*) m->buf_addr + m->data_off;
 }
 
 /**
@@ -917,7 +913,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
 	if (unlikely(len > rte_pktmbuf_tailroom(m_last)))
 		return NULL;
 
-	tail = (char*) m_last->data + m_last->data_len;
+	tail = (char*) m_last->buf_addr + m_last->data_off + m_last->data_len;
 	m_last->data_len = (uint16_t)(m_last->data_len + len);
 	m->pkt_len  = (m->pkt_len + len);
 	return (char*) tail;
@@ -945,9 +941,9 @@ static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)
 		return NULL;
 
 	m->data_len = (uint16_t)(m->data_len - len);
-	m->data = ((char*) m->data + len);
+	m->data_off += len;
 	m->pkt_len  = (m->pkt_len - len);
-	return (char*) m->data;
+	return (char *)m->buf_addr + m->data_off;
 }
 
 /**
diff --git a/lib/librte_pmd_bond/rte_eth_bond_pmd.c b/lib/librte_pmd_bond/rte_eth_bond_pmd.c
index 5979ce5..506a448 100644
--- a/lib/librte_pmd_bond/rte_eth_bond_pmd.c
+++ b/lib/librte_pmd_bond/rte_eth_bond_pmd.c
@@ -198,14 +198,14 @@ xmit_slave_hash(const struct rte_mbuf *buf, uint8_t slave_count, uint8_t policy)
 
 	switch (policy) {
 	case BALANCE_XMIT_POLICY_LAYER2:
-		eth_hdr = (struct ether_hdr *)buf->data;
+		eth_hdr = rte_pktmbuf_mtod(buf, struct ether_hdr *);
 
 		hash = ether_hash(eth_hdr);
 		hash ^= hash >> 8;
 		return hash % slave_count;
 
 	case BALANCE_XMIT_POLICY_LAYER23:
-		eth_hdr = (struct ether_hdr *)buf->data;
+		eth_hdr = rte_pktmbuf_mtod(buf, struct ether_hdr *);
 
 		if (buf->ol_flags & PKT_RX_VLAN_PKT)
 			eth_offset = sizeof(struct ether_hdr) + sizeof(struct vlan_hdr);
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index 4f46bdf..67736d6 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -90,8 +90,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 }
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb)             \
-	(uint64_t) ((mb)->buf_physaddr +       \
-	(uint64_t) ((char *)((mb)->data) - (char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -793,8 +792,8 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.length) -
 				rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -963,7 +962,7 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1035,7 +1034,8 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index 4e9f42e..e364dda 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -95,9 +95,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 }
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr +		   \
-			(uint64_t) ((char *)((mb)->data) -     \
-				(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -776,8 +774,8 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
 				      rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -952,7 +950,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1033,7 +1031,8 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->ol_flags = pkt_flags;
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c
index e41e8d0..25a5f6f 100644
--- a/lib/librte_pmd_i40e/i40e_rxtx.c
+++ b/lib/librte_pmd_i40e/i40e_rxtx.c
@@ -78,9 +78,7 @@
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	((uint64_t)((mb)->buf_physaddr + \
-	(uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr)))
+	((uint64_t)((mb)->buf_physaddr + (mb)->data_off))
 
 static const struct rte_memzone *
 i40e_ring_dma_zone_reserve(struct rte_eth_dev *dev,
@@ -685,7 +683,7 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
 		mb->next = NULL;
-		mb->data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mb->data_off = RTE_PKTMBUF_HEADROOM;
 		mb->nb_segs = 1;
 		mb->port = rxq->port_id;
 		dma_addr = rte_cpu_to_le_64(\
@@ -842,8 +840,8 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		rx_packet_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
 				I40E_RXD_QW1_LENGTH_PBUF_SHIFT) - rxq->crc_len;
 
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_prefetch0(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_prefetch0(RTE_PTR_ADD(rxm->buf_addr, RTE_PKTMBUF_HEADROOM));
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = rx_packet_len;
@@ -946,7 +944,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		rx_packet_len = (qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
 					I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
 		rxm->data_len = rx_packet_len;
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/**
 		 * If this is the first buffer of the received packet, set the
@@ -1015,7 +1013,8 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				rte_le_to_cpu_32(rxd.wb.qword0.hi_dword.rss);
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_prefetch0(first_seg->data);
+		rte_prefetch0(RTE_PTR_ADD(first_seg->buf_addr,
+			first_seg->data_off));
 		rx_pkts[nb_rx++] = first_seg;
 		first_seg = NULL;
 	}
@@ -2131,7 +2130,7 @@ i40e_alloc_rx_queue_mbufs(struct i40e_rx_queue *rxq)
 
 		rte_mbuf_refcnt_set(mbuf, 1);
 		mbuf->next = NULL;
-		mbuf->data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mbuf->data_off = RTE_PKTMBUF_HEADROOM;
 		mbuf->nb_segs = 1;
 		mbuf->port = rxq->port_id;
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index b897b1e..964ae06 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -998,7 +998,7 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
 		mb->next = NULL;
-		mb->data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mb->data_off = RTE_PKTMBUF_HEADROOM;
 		mb->nb_segs = 1;
 		mb->port = rxq->port_id;
 
@@ -1249,8 +1249,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
 				      rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -1432,7 +1432,7 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1523,7 +1523,8 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		}
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
@@ -3213,7 +3214,7 @@ ixgbe_alloc_rx_queue_mbufs(struct igb_rx_queue *rxq)
 
 		rte_mbuf_refcnt_set(mbuf, 1);
 		mbuf->next = NULL;
-		mbuf->data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mbuf->data_off = RTE_PKTMBUF_HEADROOM;
 		mbuf->nb_segs = 1;
 		mbuf->port = rxq->port_id;
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 2f1f40c..38a3a03 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -47,8 +47,7 @@
 #endif
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
diff --git a/lib/librte_pmd_pcap/rte_eth_pcap.c b/lib/librte_pmd_pcap/rte_eth_pcap.c
index 121de65..e60982e 100644
--- a/lib/librte_pmd_pcap/rte_eth_pcap.c
+++ b/lib/librte_pmd_pcap/rte_eth_pcap.c
@@ -151,7 +151,8 @@ eth_pcap_rx(void *queue,
 
 		if (header.len <= buf_size) {
 			/* pcap packet will fit in the mbuf, go ahead and copy */
-			rte_memcpy(mbuf->data, packet, header.len);
+			rte_memcpy(rte_pktmbuf_mtod(mbuf, void *), packet,
+					header.len);
 			mbuf->data_len = (uint16_t)header.len;
 			mbuf->pkt_len = mbuf->data_len;
 			bufs[num_rx] = mbuf;
@@ -202,7 +203,8 @@ eth_pcap_tx_dumper(void *queue,
 		calculate_timestamp(&header.ts);
 		header.len = mbuf->data_len;
 		header.caplen = header.len;
-		pcap_dump((u_char*) dumper_q->dumper, &header, mbuf->data);
+		pcap_dump((u_char*) dumper_q->dumper, &header,
+				rte_pktmbuf_mtod(mbuf, void*));
 		rte_pktmbuf_free(mbuf);
 		num_tx++;
 	}
@@ -237,7 +239,8 @@ eth_pcap_tx(void *queue,
 
 	for (i = 0; i < nb_pkts; i++) {
 		mbuf = bufs[i];
-		ret = pcap_sendpacket(tx_queue->pcap, (u_char*) mbuf->data,
+		ret = pcap_sendpacket(tx_queue->pcap,
+				rte_pktmbuf_mtod(mbuf, u_char *),
 				mbuf->data_len);
 		if (unlikely(ret != 0))
 			break;
diff --git a/lib/librte_pmd_virtio/virtio_rxtx.c b/lib/librte_pmd_virtio/virtio_rxtx.c
index 132ee45..29c9cea 100644
--- a/lib/librte_pmd_virtio/virtio_rxtx.c
+++ b/lib/librte_pmd_virtio/virtio_rxtx.c
@@ -118,7 +118,7 @@ virtqueue_dequeue_burst_rx(struct virtqueue *vq, struct rte_mbuf **rx_pkts,
 		}
 
 		rte_prefetch0(cookie);
-		rte_packet_prefetch(cookie->data);
+		rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
 		rx_pkts[i]  = cookie;
 		vq->vq_used_cons_idx++;
 		vq_ring_free_chain(vq, desc_idx);
@@ -480,7 +480,7 @@ virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		}
 
 		rxm->port = rxvq->port_id;
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
@@ -584,7 +584,7 @@ virtio_recv_mergeable_pkts(void *rx_queue,
 		if (seg_num == 0)
 			seg_num = 1;
 
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 		rxm->nb_segs = seg_num;
 		rxm->next = NULL;
 		rxm->pkt_len = (uint32_t)(len[0] - hdr_size);
@@ -622,9 +622,7 @@ virtio_recv_mergeable_pkts(void *rx_queue,
 			while (extra_idx < rcv_cnt) {
 				rxm = rcv_pkts[extra_idx];
 
-				rxm->data =
-					(char *)rxm->buf_addr +
-					RTE_PKTMBUF_HEADROOM - hdr_size;
+				rxm->data_off = RTE_PKTMBUF_HEADROOM - hdr_size;
 				rxm->next = NULL;
 				rxm->pkt_len = (uint32_t)(len[extra_idx]);
 				rxm->data_len = (uint16_t)(len[extra_idx]);
diff --git a/lib/librte_pmd_virtio/virtqueue.h b/lib/librte_pmd_virtio/virtqueue.h
index d777feb..fdee054 100644
--- a/lib/librte_pmd_virtio/virtqueue.h
+++ b/lib/librte_pmd_virtio/virtqueue.h
@@ -59,8 +59,7 @@
 #define VIRTQUEUE_MAX_NAME_SZ 32
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define VTNET_SQ_RQ_QUEUE_IDX 0
 #define VTNET_SQ_TQ_QUEUE_IDX 1
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index e74b6fd..263f9ce 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -79,8 +79,7 @@
 
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -565,7 +564,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 			rxm->data_len = (uint16_t)rcd->len;
 			rxm->port = rxq->port_id;
 			rxm->vlan_tci = 0;
-			rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+			rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 			rx_pkts[nb_rx++] = rxm;
 
diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
index 22215ed..891cb58 100644
--- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
+++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
@@ -110,7 +110,7 @@ eth_xenvirt_rx(void *q, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		rxm = rx_pkts[i];
 		PMD_RX_LOG(DEBUG, "packet len:%d\n", len[i]);
 		rxm->next = NULL;
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 		rxm->data_len = (uint16_t)(len[i] - sizeof(struct virtio_net_hdr));
 		rxm->nb_segs = 1;
 		rxm->port = pi->port_id;
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 02/13] mbuf: reorder fields by time of use
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 01/13] mbuf: replace data pointer by an offset Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08 10:17   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field Bruce Richardson
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

*  Reorder the fields in the mbuf so that we have fields that are used$
together side-by-side in the structure. This means that we have a$
contiguous block of 8-bytes in the mbuf which are used to reset an mbuf$
of descriptor rearm, and a block of 16-bytes of data (excluding flags)
which are set on RX from the received packet descriptor.
* Use dummy fields as appropraite to ensure alignment or to reserve gaps
for later field additions.
* Place most items which are not used by fast-path RX separately at the end
of the structure so they can later be moved to a separate cache line.
[The l2/l3 length fields are not moved at this stage as doing so will
cause overflow to the next cache line].

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 669e7f5..f136d37 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -115,16 +115,12 @@ extern "C" {
  * The generic rte_mbuf, containing a packet mbuf.
  */
 struct rte_mbuf {
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
-	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	/* valid for any segment */
-	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
+	/* next 8 bytes are initialised on RX descriptor rearm */
+	uint16_t buf_len;         /**< Length of segment buffer. */
 	uint16_t data_off;
-	uint16_t data_len;        /**< Amount of data in segment buffer. */
-	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
 
 #ifdef RTE_MBUF_REFCNT
 	/**
@@ -142,14 +138,17 @@ struct rte_mbuf {
 #else
 	uint16_t refcnt_reserved;     /**< Do not use this field */
 #endif
-	uint16_t reserved;            /**< Unused field. Required for padding */
-	uint16_t ol_flags;            /**< Offload features. */
-
-	/* these fields are valid for first segment only */
 	uint8_t nb_segs;        /**< Number of segments. */
 	uint8_t port;           /**< Input port. */
 
-	/* offload features, valid for first segment only */
+	uint16_t ol_flags;      /**< Offload features. */
+	uint16_t reserved0;     /**< Unused field. Required for padding */
+	uint32_t reserved1;     /**< Unused field. Required for padding */
+
+	/* remaining bytes are set on RX when pulling packet from descriptor */
+	uint16_t reserved2;     /**< Unused field. Required for padding */
+	uint16_t data_len;      /**< Amount of data in segment buffer. */
+	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
 	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
 	uint16_t l2_len:7;      /**< L2 (MAC) Header Length. */
 	uint16_t vlan_tci;      /**< VLAN Tag Control Identifier (CPU order). */
@@ -162,6 +161,10 @@ struct rte_mbuf {
 		uint32_t sched;     /**< Hierarchical scheduler */
 	} hash;                 /**< hash information */
 
+	/* fields only used in slow path or on TX */
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
+
 	union {
 		uint8_t metadata[0];
 		uint16_t metadata16[0];
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 01/13] mbuf: replace data pointer by an offset Bruce Richardson
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 02/13] mbuf: reorder fields by time of use Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08 10:17   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 04/13] mbuf: expand ol_flags field to 64-bits Bruce Richardson
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

Replace a reserved slot with the new packet type field used to identify
the type of the packet, i.e. what protocols are used.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f136d37..8d0c6fb 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -146,7 +146,7 @@ struct rte_mbuf {
 	uint32_t reserved1;     /**< Unused field. Required for padding */
 
 	/* remaining bytes are set on RX when pulling packet from descriptor */
-	uint16_t reserved2;     /**< Unused field. Required for padding */
+	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
 	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 04/13] mbuf: expand ol_flags field to 64-bits
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (2 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08 10:25   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 05/13] mbuf: introduce a flag to indicate a control mbuf Bruce Richardson
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

The offload flags field (ol_flags) was 16-bits and had no further room
for expansion. This patch increases the field size to 64-bits, using up
the remaining reserved space in the single-cache-line mbuf.

NOTE: none of the values for existing flags have been changed, i.e. no
new numbers have been explicitly reserved between existing flag
definitions.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 app/test-pmd/config.c             |  8 ++---
 app/test-pmd/csumonly.c           |  8 ++---
 app/test-pmd/rxonly.c             |  2 +-
 app/test-pmd/testpmd.h            |  4 +--
 app/test-pmd/txonly.c             |  2 +-
 lib/librte_mbuf/rte_mbuf.c        |  2 +-
 lib/librte_mbuf/rte_mbuf.h        |  4 +--
 lib/librte_pmd_e1000/em_rxtx.c    | 34 ++++++++++-----------
 lib/librte_pmd_e1000/igb_rxtx.c   | 62 ++++++++++++++++++---------------------
 lib/librte_pmd_i40e/i40e_rxtx.c   | 31 ++++++++++----------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 61 ++++++++++++++++++--------------------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h |  2 +-
 12 files changed, 104 insertions(+), 116 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 606e34a..adfa9a8 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1714,14 +1714,14 @@ set_qmap(portid_t port_id, uint8_t is_rx, uint16_t queue_id, uint8_t map_value)
 }
 
 void
-tx_cksum_set(portid_t port_id, uint8_t cksum_mask)
+tx_cksum_set(portid_t port_id, uint64_t ol_flags)
 {
-	uint16_t tx_ol_flags;
+	uint64_t tx_ol_flags;
 	if (port_id_is_invalid(port_id))
 		return;
 	/* Clear last 4 bits and then set L3/4 checksum mask again */
-	tx_ol_flags = (uint16_t) (ports[port_id].tx_ol_flags & 0xFFF0);
-	ports[port_id].tx_ol_flags = (uint16_t) ((cksum_mask & 0xf) | tx_ol_flags);
+	tx_ol_flags = ports[port_id].tx_ol_flags & (~0x0Full);
+	ports[port_id].tx_ol_flags = ((ol_flags & 0xf) | tx_ol_flags);
 }
 
 void
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 2ce4c42..fcc4876 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -217,9 +217,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t i;
-	uint16_t ol_flags;
-	uint16_t pkt_ol_flags;
-	uint16_t tx_ol_flags;
+	uint64_t ol_flags;
+	uint64_t pkt_ol_flags;
+	uint64_t tx_ol_flags;
 	uint16_t l4_proto;
 	uint16_t eth_type;
 	uint8_t  l2_len;
@@ -261,7 +261,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		mb = pkts_burst[i];
 		l2_len  = sizeof(struct ether_hdr);
 		pkt_ol_flags = mb->ol_flags;
-		ol_flags = (uint16_t) (pkt_ol_flags & (~PKT_TX_L4_MASK));
+		ol_flags = (pkt_ol_flags & (~PKT_TX_L4_MASK));
 
 		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 7ba36a1..98c788b 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -109,7 +109,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 	struct rte_mbuf  *mb;
 	struct ether_hdr *eth_hdr;
 	uint16_t eth_type;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint16_t nb_rx;
 	uint16_t i;
 #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 09923a8..142091d 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -141,7 +141,7 @@ struct rte_port {
 	struct fwd_stream       *rx_stream; /**< Port RX stream, if unique */
 	struct fwd_stream       *tx_stream; /**< Port TX stream, if unique */
 	unsigned int            socket_id;  /**< For NUMA support */
-	uint16_t                tx_ol_flags;/**< Offload Flags of TX packets. */
+	uint64_t                tx_ol_flags;/**< Offload Flags of TX packets. */
 	uint16_t                tx_vlan_id; /**< Tag Id. in TX VLAN packets. */
 	void                    *fwd_ctx;   /**< Forwarding mode context */
 	uint64_t                rx_bad_ip_csum; /**< rx pkts with bad ip checksum  */
@@ -494,7 +494,7 @@ void tx_vlan_pvid_set(portid_t port_id, uint16_t vlan_id, int on);
 
 void set_qmap(portid_t port_id, uint8_t is_rx, uint16_t queue_id, uint8_t map_value);
 
-void tx_cksum_set(portid_t port_id, uint8_t cksum_mask);
+void tx_cksum_set(portid_t port_id, uint64_t ol_flags);
 
 void set_verbose_level(uint16_t vb_level);
 void set_tx_pkt_segments(unsigned *seg_lengths, unsigned nb_segs);
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 2b3f0b9..3d08005 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -203,7 +203,7 @@ pkt_burst_transmit(struct fwd_stream *fs)
 	uint16_t nb_tx;
 	uint16_t nb_pkt;
 	uint16_t vlan_tci;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint8_t  i;
 #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
 	uint64_t start_tsc;
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 26e36eb..9dfcac3 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -174,7 +174,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 
 	fprintf(f, "dump mbuf at 0x%p, phys=%"PRIx64", buf_len=%u\n",
 	       m, (uint64_t)m->buf_physaddr, (unsigned)m->buf_len);
-	fprintf(f, "  pkt_len=%"PRIu32", ol_flags=%"PRIx16", nb_segs=%u, "
+	fprintf(f, "  pkt_len=%"PRIu32", ol_flags=%"PRIx64", nb_segs=%u, "
 	       "in_port=%u\n", m->pkt_len, m->ol_flags,
 	       (unsigned)m->nb_segs, (unsigned)m->port);
 	nb_segs = m->nb_segs;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 8d0c6fb..6af5b2f 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -141,9 +141,7 @@ struct rte_mbuf {
 	uint8_t nb_segs;        /**< Number of segments. */
 	uint8_t port;           /**< Input port. */
 
-	uint16_t ol_flags;      /**< Offload features. */
-	uint16_t reserved0;     /**< Unused field. Required for padding */
-	uint32_t reserved1;     /**< Unused field. Required for padding */
+	uint64_t ol_flags;      /**< Offload features. */
 
 	/* remaining bytes are set on RX when pulling packet from descriptor */
 	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index 67736d6..eb76660 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -168,7 +168,7 @@ union em_vlan_macip {
  * Structure to check if new context need be built
  */
 struct em_ctx_info {
-	uint16_t flags;               /**< ol_flags related to context build. */
+	uint64_t flags;               /**< ol_flags related to context build. */
 	uint32_t cmp_mask;            /**< compare mask */
 	union em_vlan_macip hdrlen;  /**< L2 and L3 header lenghts */
 };
@@ -238,7 +238,7 @@ struct em_tx_queue {
 static inline void
 em_set_xmit_ctx(struct em_tx_queue* txq,
 		volatile struct e1000_context_desc *ctx_txd,
-		uint16_t flags,
+		uint64_t flags,
 		union em_vlan_macip hdrlen)
 {
 	uint32_t cmp_mask, cmd_len;
@@ -304,7 +304,7 @@ em_set_xmit_ctx(struct em_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_ctx_update(struct em_tx_queue *txq, uint16_t flags,
+what_ctx_update(struct em_tx_queue *txq, uint64_t flags,
 		union em_vlan_macip hdrlen)
 {
 	/* If match with the current context */
@@ -377,7 +377,7 @@ em_xmit_cleanup(struct em_tx_queue *txq)
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_upper(uint16_t ol_flags)
+tx_desc_cksum_flags_to_upper(uint64_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, E1000_TXD_POPTS_TXSM << 8};
 	static const uint32_t l3_olinfo[2] = {0, E1000_TXD_POPTS_IXSM << 8};
@@ -403,12 +403,12 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t popts_spec;
 	uint32_t cmd_type_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
 	uint16_t nb_used;
-	uint16_t tx_ol_req;
+	uint64_t tx_ol_req;
 	uint32_t ctx;
 	uint32_t new_ctx;
 	union em_vlan_macip hdrlen;
@@ -438,8 +438,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		ol_flags = tx_pkt->ol_flags;
 
 		/* If hardware offload required */
-		tx_ol_req = (uint16_t)(ol_flags & (PKT_TX_IP_CKSUM |
-							PKT_TX_L4_MASK));
+		tx_ol_req = (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_L4_MASK));
 		if (tx_ol_req) {
 			hdrlen.f.vlan_tci = tx_pkt->vlan_tci;
 			hdrlen.f.l2_len = tx_pkt->l2_len;
@@ -642,22 +641,21 @@ end_of_tx:
  *
  **********************************************************************/
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	/* Check if VLAN present */
-	pkt_flags = (uint16_t)((rx_status & E1000_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = ((rx_status & E1000_RXD_STAT_VP) ?  PKT_RX_VLAN_PKT : 0);
 
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_error_to_pkt_flags(uint32_t rx_error)
 {
-	uint16_t pkt_flags = 0;
+	uint64_t pkt_flags = 0;
 
 	if (rx_error & E1000_RXD_ERR_IPE)
 		pkt_flags |= PKT_RX_IP_CKSUM_BAD;
@@ -801,8 +799,8 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->port = rxq->port_id;
 
 		rxm->ol_flags = rx_desc_status_to_pkt_flags(status);
-		rxm->ol_flags = (uint16_t)(rxm->ol_flags |
-				rx_desc_error_to_pkt_flags(rxd.errors));
+		rxm->ol_flags = rxm->ol_flags |
+				rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
@@ -1027,8 +1025,8 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->port = rxq->port_id;
 
 		first_seg->ol_flags = rx_desc_status_to_pkt_flags(status);
-		first_seg->ol_flags = (uint16_t)(first_seg->ol_flags |
-					rx_desc_error_to_pkt_flags(rxd.errors));
+		first_seg->ol_flags = first_seg->ol_flags |
+					rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index e364dda..fab888d 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -176,7 +176,7 @@ union igb_vlan_macip {
  * Strucutre to check if new context need be built
  */
 struct igb_advctx_info {
-	uint16_t flags;           /**< ol_flags related to context build. */
+	uint64_t flags;           /**< ol_flags related to context build. */
 	uint32_t cmp_mask;        /**< compare mask for vlan_macip_lens */
 	union igb_vlan_macip vlan_macip_lens; /**< vlan, mac & ip length. */
 };
@@ -244,7 +244,7 @@ struct igb_tx_queue {
 static inline void
 igbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct e1000_adv_tx_context_desc *ctx_txd,
-		uint16_t ol_flags, uint32_t vlan_macip_lens)
+		uint64_t ol_flags, uint32_t vlan_macip_lens)
 {
 	uint32_t type_tucmd_mlhl;
 	uint32_t mss_l4len_idx;
@@ -309,7 +309,7 @@ igbe_set_xmit_ctx(struct igb_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
+what_advctx_update(struct igb_tx_queue *txq, uint64_t flags,
 		uint32_t vlan_macip_lens)
 {
 	/* If match with the current context */
@@ -332,7 +332,7 @@ what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
+tx_desc_cksum_flags_to_olinfo(uint64_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, E1000_ADVTXD_POPTS_TXSM};
 	static const uint32_t l3_olinfo[2] = {0, E1000_ADVTXD_POPTS_IXSM};
@@ -344,7 +344,7 @@ tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
 }
 
 static inline uint32_t
-tx_desc_vlan_flags_to_cmdtype(uint16_t ol_flags)
+tx_desc_vlan_flags_to_cmdtype(uint64_t ol_flags)
 {
 	static uint32_t vlan_cmd[2] = {0, E1000_ADVTXD_DCMD_VLE};
 	return vlan_cmd[(ol_flags & PKT_TX_VLAN_PKT) != 0];
@@ -367,12 +367,12 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t cmd_type_len;
 	uint32_t pkt_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint16_t tx_end;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
-	uint16_t tx_ol_req;
+	uint64_t tx_ol_req;
 	uint32_t new_ctx = 0;
 	uint32_t ctx = 0;
 
@@ -402,7 +402,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
 		vlan_macip_lens.f.l2_len = tx_pkt->l2_len;
 		vlan_macip_lens.f.l3_len = tx_pkt->l3_len;
-		tx_ol_req = (uint16_t)(ol_flags & PKT_TX_OFFLOAD_MASK);
+		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
 
 		/* If a Context Descriptor need be built . */
 		if (tx_ol_req) {
@@ -589,12 +589,12 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
  *  RX functions
  *
  **********************************************************************/
-static inline uint16_t
+static inline uint64_t
 rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 {
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
-	static uint16_t ip_pkt_types_map[16] = {
+	static uint64_t ip_pkt_types_map[16] = {
 		0, PKT_RX_IPV4_HDR, PKT_RX_IPV4_HDR_EXT, PKT_RX_IPV4_HDR_EXT,
 		PKT_RX_IPV6_HDR, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
@@ -607,34 +607,32 @@ rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 		0, 0, 0, 0,
 	};
 
-	pkt_flags = (uint16_t)((hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ?
+	pkt_flags = (hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ?
 				ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F];
 #else
-	pkt_flags = (uint16_t)((hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ? 0 :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = (hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ? 0 :
+				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F];
 #endif
-	return (uint16_t)(pkt_flags | (((hl_tp_rs & 0x0F) == 0) ?
-						0 : PKT_RX_RSS_HASH));
+	return pkt_flags | (((hl_tp_rs & 0x0F) == 0) ?  0 : PKT_RX_RSS_HASH);
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	/* Check if VLAN present */
-	pkt_flags = (uint16_t)((rx_status & E1000_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = (rx_status & E1000_RXD_STAT_VP) ?  PKT_RX_VLAN_PKT : 0;
 
 #if defined(RTE_LIBRTE_IEEE1588)
 	if (rx_status & E1000_RXD_STAT_TMST)
-		pkt_flags = (uint16_t)(pkt_flags | PKT_RX_IEEE1588_TMST);
+		pkt_flags = pkt_flags | PKT_RX_IEEE1588_TMST;
 #endif
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_error_to_pkt_flags(uint32_t rx_status)
 {
 	/*
@@ -642,7 +640,7 @@ rx_desc_error_to_pkt_flags(uint32_t rx_status)
 	 * Bit 29: L4I, L4I integrity error
 	 */
 
-	static uint16_t error_to_pkt_flags_map[4] = {
+	static uint64_t error_to_pkt_flags_map[4] = {
 		0,  PKT_RX_L4_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD,
 		PKT_RX_IP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD
 	};
@@ -669,7 +667,7 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t rx_id;
 	uint16_t nb_rx;
 	uint16_t nb_hold;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -788,10 +786,8 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags | rx_desc_error_to_pkt_flags(staterr);
 		rxm->ol_flags = pkt_flags;
 
 		/*
@@ -848,7 +844,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t nb_rx;
 	uint16_t nb_hold;
 	uint16_t data_len;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -1024,10 +1020,8 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags | rx_desc_error_to_pkt_flags(staterr);
 		first_seg->ol_flags = pkt_flags;
 
 		/* Prefetch data of first segment, if configured to do so. */
diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c
index 25a5f6f..cd05654 100644
--- a/lib/librte_pmd_i40e/i40e_rxtx.c
+++ b/lib/librte_pmd_i40e/i40e_rxtx.c
@@ -91,27 +91,27 @@ static uint16_t i40e_xmit_pkts_simple(void *tx_queue,
 				      uint16_t nb_pkts);
 
 /* Translate the rx descriptor status to pkt flags */
-static inline uint16_t
+static inline uint64_t
 i40e_rxd_status_to_pkt_flags(uint64_t qword)
 {
-	uint16_t flags;
+	uint64_t flags;
 
 	/* Check if VLAN packet */
-	flags = (uint16_t)(qword & (1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT) ?
-							PKT_RX_VLAN_PKT : 0);
+	flags = qword & (1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT) ?
+							PKT_RX_VLAN_PKT : 0;
 
 	/* Check if RSS_HASH */
-	flags |= (uint16_t)((((qword >> I40E_RX_DESC_STATUS_FLTSTAT_SHIFT) &
+	flags |= (((qword >> I40E_RX_DESC_STATUS_FLTSTAT_SHIFT) &
 					I40E_RX_DESC_FLTSTAT_RSS_HASH) ==
-			I40E_RX_DESC_FLTSTAT_RSS_HASH) ? PKT_RX_RSS_HASH : 0);
+			I40E_RX_DESC_FLTSTAT_RSS_HASH) ? PKT_RX_RSS_HASH : 0;
 
 	return flags;
 }
 
-static inline uint16_t
+static inline uint64_t
 i40e_rxd_error_to_pkt_flags(uint64_t qword)
 {
-	uint16_t flags = 0;
+	uint64_t flags = 0;
 	uint64_t error_bits = (qword >> I40E_RXD_QW1_ERROR_SHIFT);
 
 #define I40E_RX_ERR_BITS 0x3f
@@ -143,12 +143,12 @@ i40e_rxd_error_to_pkt_flags(uint64_t qword)
 }
 
 /* Translate pkt types to pkt flags */
-static inline uint16_t
+static inline uint64_t
 i40e_rxd_ptype_to_pkt_flags(uint64_t qword)
 {
 	uint8_t ptype = (uint8_t)((qword & I40E_RXD_QW1_PTYPE_MASK) >>
 					I40E_RXD_QW1_PTYPE_SHIFT);
-	static const uint16_t ip_ptype_map[I40E_MAX_PKT_TYPE] = {
+	static const uint64_t ip_ptype_map[I40E_MAX_PKT_TYPE] = {
 		0, /* PTYPE 0 */
 		0, /* PTYPE 1 */
 		0, /* PTYPE 2 */
@@ -567,7 +567,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 	uint32_t rx_status;
 	int32_t s[I40E_LOOK_AHEAD], nb_dd;
 	int32_t i, j, nb_rx = 0;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	rxdp = &rxq->rx_ring[rxq->rx_tail];
 	rxep = &rxq->sw_ring[rxq->rx_tail];
@@ -789,7 +789,7 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 	uint16_t rx_packet_len;
 	uint16_t rx_id, nb_hold;
 	uint64_t dma_addr;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -896,10 +896,11 @@ i40e_recv_scattered_pkts(void *rx_queue,
 	struct rte_mbuf *last_seg = rxq->pkt_last_seg;
 	struct rte_mbuf *nmb, *rxm;
 	uint16_t rx_id = rxq->rx_tail;
-	uint16_t nb_rx = 0, nb_hold = 0, rx_packet_len, pkt_flags;
+	uint16_t nb_rx = 0, nb_hold = 0, rx_packet_len;
 	uint32_t rx_status;
 	uint64_t qword1;
 	uint64_t dma_addr;
+	uint64_t pkt_flags;
 
 	while (nb_rx < nb_pkts) {
 		rxdp = &rx_ring[rx_id];
@@ -1046,7 +1047,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 
 /* Check if the context descriptor is needed for TX offloading */
 static inline uint16_t
-i40e_calc_context_desc(uint16_t flags)
+i40e_calc_context_desc(uint64_t flags)
 {
 	uint16_t mask = 0;
 
@@ -1075,7 +1076,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 	uint32_t td_offset;
 	uint32_t tx_flags;
 	uint32_t td_tag;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint8_t l2_len;
 	uint8_t l3_len;
 	uint16_t nb_used;
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 964ae06..c7f1642 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -358,7 +358,7 @@ ixgbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 static inline void
 ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct ixgbe_adv_tx_context_desc *ctx_txd,
-		uint16_t ol_flags, uint32_t vlan_macip_lens)
+		uint64_t ol_flags, uint32_t vlan_macip_lens)
 {
 	uint32_t type_tucmd_mlhl;
 	uint32_t mss_l4len_idx;
@@ -421,7 +421,7 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
+what_advctx_update(struct igb_tx_queue *txq, uint64_t flags,
 		uint32_t vlan_macip_lens)
 {
 	/* If match with the current used context */
@@ -444,7 +444,7 @@ what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
+tx_desc_cksum_flags_to_olinfo(uint64_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_TXSM};
 	static const uint32_t l3_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_IXSM};
@@ -456,7 +456,7 @@ tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
 }
 
 static inline uint32_t
-tx_desc_vlan_flags_to_cmdtype(uint16_t ol_flags)
+tx_desc_vlan_flags_to_cmdtype(uint64_t ol_flags)
 {
 	static const uint32_t vlan_cmd[2] = {0, IXGBE_ADVTXD_DCMD_VLE};
 	return vlan_cmd[(ol_flags & PKT_TX_VLAN_PKT) != 0];
@@ -546,12 +546,12 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t cmd_type_len;
 	uint32_t pkt_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
 	uint16_t nb_used;
-	uint16_t tx_ol_req;
+	uint64_t tx_ol_req;
 	uint32_t ctx = 0;
 	uint32_t new_ctx;
 
@@ -584,7 +584,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		vlan_macip_lens.f.l3_len = tx_pkt->l3_len;
 
 		/* If hardware offload required */
-		tx_ol_req = (uint16_t)(ol_flags & PKT_TX_OFFLOAD_MASK);
+		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
 		if (tx_ol_req) {
 			/* If new context need be built or reuse the exist ctx. */
 			ctx = what_advctx_update(txq, tx_ol_req,
@@ -814,19 +814,19 @@ end_of_tx:
  *  RX functions
  *
  **********************************************************************/
-static inline uint16_t
+static inline uint64_t
 rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 {
 	uint16_t pkt_flags;
 
-	static uint16_t ip_pkt_types_map[16] = {
+	static uint64_t ip_pkt_types_map[16] = {
 		0, PKT_RX_IPV4_HDR, PKT_RX_IPV4_HDR_EXT, PKT_RX_IPV4_HDR_EXT,
 		PKT_RX_IPV6_HDR, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
 	};
 
-	static uint16_t ip_rss_types_map[16] = {
+	static uint64_t ip_rss_types_map[16] = {
 		0, PKT_RX_RSS_HASH, PKT_RX_RSS_HASH, PKT_RX_RSS_HASH,
 		0, PKT_RX_RSS_HASH, 0, PKT_RX_RSS_HASH,
 		PKT_RX_RSS_HASH, 0, 0, 0,
@@ -839,45 +839,44 @@ rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 		0, 0, 0, 0,
 	};
 
-	pkt_flags = (uint16_t) ((hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ?
-				ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = (hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ?
+			ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
+			ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F];
 #else
-	pkt_flags = (uint16_t) ((hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ? 0 :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = (hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ? 0 :
+			ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F];
 
 #endif
-	return (uint16_t)(pkt_flags | ip_rss_types_map[hl_tp_rs & 0xF]);
+	return pkt_flags | ip_rss_types_map[hl_tp_rs & 0xF];
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	/*
 	 * Check if VLAN present only.
 	 * Do not check whether L3/L4 rx checksum done by NIC or not,
 	 * That can be found from rte_eth_rxmode.hw_ip_checksum flag
 	 */
-	pkt_flags = (uint16_t)((rx_status & IXGBE_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = (rx_status & IXGBE_RXD_STAT_VP) ?  PKT_RX_VLAN_PKT : 0;
 
 #ifdef RTE_LIBRTE_IEEE1588
 	if (rx_status & IXGBE_RXD_STAT_TMST)
-		pkt_flags = (uint16_t)(pkt_flags | PKT_RX_IEEE1588_TMST);
+		pkt_flags = pkt_flags | PKT_RX_IEEE1588_TMST;
 #endif
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_error_to_pkt_flags(uint32_t rx_status)
 {
 	/*
 	 * Bit 31: IPE, IPv4 checksum error
 	 * Bit 30: L4I, L4I integrity error
 	 */
-	static uint16_t error_to_pkt_flags_map[4] = {
+	static uint64_t error_to_pkt_flags_map[4] = {
 		0,  PKT_RX_L4_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD,
 		PKT_RX_IP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD
 	};
@@ -948,10 +947,10 @@ ixgbe_rx_scan_hw_ring(struct igb_rx_queue *rxq)
 			mb->ol_flags  = rx_desc_hlen_type_rss_to_pkt_flags(
 					rxdp[j].wb.lower.lo_dword.data);
 			/* reuse status field from scan list */
-			mb->ol_flags = (uint16_t)(mb->ol_flags |
-					rx_desc_status_to_pkt_flags(s[j]));
-			mb->ol_flags = (uint16_t)(mb->ol_flags |
-					rx_desc_error_to_pkt_flags(s[j]));
+			mb->ol_flags = mb->ol_flags |
+					rx_desc_status_to_pkt_flags(s[j]);
+			mb->ol_flags = mb->ol_flags |
+					rx_desc_error_to_pkt_flags(s[j]);
 		}
 
 		/* Move mbuf pointers from the S/W ring to the stage */
@@ -1144,7 +1143,7 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t rx_id;
 	uint16_t nb_rx;
 	uint16_t nb_hold;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -1262,10 +1261,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags | rx_desc_error_to_pkt_flags(staterr);
 		rxm->ol_flags = pkt_flags;
 
 		if (likely(pkt_flags & PKT_RX_RSS_HASH))
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 38a3a03..70dbcb0 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -178,7 +178,7 @@ union ixgbe_vlan_macip {
  */
 
 struct ixgbe_advctx_info {
-	uint16_t flags;           /**< ol_flags for context build. */
+	uint64_t flags;           /**< ol_flags for context build. */
 	uint32_t cmp_mask;        /**< compare mask for vlan_macip_lens */
 	union ixgbe_vlan_macip vlan_macip_lens; /**< vlan, mac ip length. */
 };
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 05/13] mbuf: introduce a flag to indicate a control mbuf
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (3 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 04/13] mbuf: expand ol_flags field to 64-bits Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08 11:53   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 06/13] mbuf: minor changes for readability Bruce Richardson
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

Since the flags field is now 64-bits, we can allow one bit to be used to
indicate a control i.e. non-packet mbuf. Dedicate the high bit (bit 63)
for this purpose and add in a utility macro to test if a given mbuf has
the bit set or not.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.c |  2 ++
 lib/librte_mbuf/rte_mbuf.h | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 9dfcac3..52e7574 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -69,7 +69,9 @@ rte_ctrlmbuf_init(struct rte_mempool *mp,
 		void *_m,
 		__attribute__((unused)) unsigned i)
 {
+	struct rte_mbuf *m = _m;
 	rte_pktmbuf_init(mp, opaque_arg, _m, i);
+	m->ol_flags |= CTRL_MBUF_FLAG;
 }
 
 /*
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 6af5b2f..7b0b4f2 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -91,6 +91,7 @@ extern "C" {
 #define PKT_TX_IPV4_CSUM     0x1000 /**< Alias of PKT_TX_IP_CKSUM. */
 #define PKT_TX_IPV4          PKT_RX_IPV4_HDR /**< IPv4 with no IP checksum offload. */
 #define PKT_TX_IPV6          PKT_RX_IPV6_HDR /**< IPv6 packet */
+
 /*
  * Bit 14~13 used for L4 packet type with checksum enabled.
  *     00: Reserved
@@ -106,6 +107,9 @@ extern "C" {
 /* Bit 15 */
 #define PKT_TX_IEEE1588_TMST 0x8000 /**< TX IEEE1588 packet to timestamp. */
 
+/* Use final bit of flags to indicate a control mbuf */
+#define CTRL_MBUF_FLAG       (1ULL << 63)
+
 /**
  * Bit Mask to indicate what bits required for building TX context
  */
@@ -466,6 +470,21 @@ void rte_ctrlmbuf_init(struct rte_mempool *mp, void *opaque_arg,
  */
 #define rte_ctrlmbuf_len(m) rte_pktmbuf_data_len(m)
 
+/**
+ * Tests if an mbuf is a control mbuf
+ *
+ * @param m
+ *   The mbuf.to be tested
+ * @return
+ *   - True (1) if the mbuf is a control mbuf
+ *   - False(0) otherwise
+ */
+static inline int
+rte_is_ctrlmbuf(struct rte_mbuf *m)
+{
+	return (!!(m->ol_flags & CTRL_MBUF_FLAG));
+}
+
 /* Operations on pkt mbuf */
 
 /**
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 06/13] mbuf: minor changes for readability
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (4 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 05/13] mbuf: introduce a flag to indicate a control mbuf Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08 12:03   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata Bruce Richardson
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

* Ensure comments line up correctly
* Simplify the #ifdefs around the refcnt fields to make them clearer

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 7b0b4f2..5260001 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -126,7 +126,6 @@ struct rte_mbuf {
 	uint16_t buf_len;         /**< Length of segment buffer. */
 	uint16_t data_off;
 
-#ifdef RTE_MBUF_REFCNT
 	/**
 	 * 16-bit Reference counter.
 	 * It should only be accessed using the following functions:
@@ -136,12 +135,12 @@ struct rte_mbuf {
 	 * config option.
 	 */
 	union {
-		rte_atomic16_t refcnt_atomic;   /**< Atomically accessed refcnt */
-		uint16_t refcnt;                /**< Non-atomically accessed refcnt */
-	};
-#else
-	uint16_t refcnt_reserved;     /**< Do not use this field */
+#ifdef RTE_MBUF_REFCNT
+		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
+		uint16_t refcnt;              /**< Non-atomically accessed refcnt */
 #endif
+		uint16_t refcnt_reserved;     /**< Do not use this field */
+	};
 	uint8_t nb_segs;        /**< Number of segments. */
 	uint8_t port;           /**< Input port. */
 
@@ -155,12 +154,12 @@ struct rte_mbuf {
 	uint16_t l2_len:7;      /**< L2 (MAC) Header Length. */
 	uint16_t vlan_tci;      /**< VLAN Tag Control Identifier (CPU order). */
 	union {
-		uint32_t rss;       /**< RSS hash result if RSS enabled */
+		uint32_t rss;   /**< RSS hash result if RSS enabled */
 		struct {
 			uint16_t hash;
 			uint16_t id;
-		} fdir;             /**< Filter identifier if FDIR enabled */
-		uint32_t sched;     /**< Hierarchical scheduler */
+		} fdir;         /**< Filter identifier if FDIR enabled */
+		uint32_t sched; /**< Hierarchical scheduler */
 	} hash;                 /**< hash information */
 
 	/* fields only used in slow path or on TX */
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (5 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 06/13] mbuf: minor changes for readability Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08 12:05   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 08/13] mbuf: add named points inside the mbuf structure Bruce Richardson
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

Removed the explicit zero-sized metadata definition at the end of the
mbuf data structure. Updated the metadata macros to take account of this
change so that all existing code which uses those macros still works.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 5260001..ca66d9a 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -166,31 +166,25 @@ struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
 
-	union {
-		uint8_t metadata[0];
-		uint16_t metadata16[0];
-		uint32_t metadata32[0];
-		uint64_t metadata64[0];
-	} __rte_cache_aligned;
 } __rte_cache_aligned;
 
 #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
-	(mbuf->metadata[offset])
+	(((uint8_t *)&(mbuf)[1])[offset])
 #define RTE_MBUF_METADATA_UINT16(mbuf, offset)             \
-	(mbuf->metadata16[offset/sizeof(uint16_t)])
+	(((uint16_t *)&(mbuf)[1])[offset/sizeof(uint16_t)])
 #define RTE_MBUF_METADATA_UINT32(mbuf, offset)             \
-	(mbuf->metadata32[offset/sizeof(uint32_t)])
+	(((uint32_t *)&(mbuf)[1])[offset/sizeof(uint32_t)])
 #define RTE_MBUF_METADATA_UINT64(mbuf, offset)             \
-	(mbuf->metadata64[offset/sizeof(uint64_t)])
+	(((uint64_t *)&(mbuf)[1])[offset/sizeof(uint64_t)])
 
 #define RTE_MBUF_METADATA_UINT8_PTR(mbuf, offset)          \
-	(&mbuf->metadata[offset])
+	(&RTE_MBUF_METADATA_UINT8(mbuf, offset))
 #define RTE_MBUF_METADATA_UINT16_PTR(mbuf, offset)         \
-	(&mbuf->metadata16[offset/sizeof(uint16_t)])
+	(&RTE_MBUF_METADATA_UINT16(mbuf, offset))
 #define RTE_MBUF_METADATA_UINT32_PTR(mbuf, offset)         \
-	(&mbuf->metadata32[offset/sizeof(uint32_t)])
+	(&RTE_MBUF_METADATA_UINT32(mbuf, offset))
 #define RTE_MBUF_METADATA_UINT64_PTR(mbuf, offset)         \
-	(&mbuf->metadata64[offset/sizeof(uint64_t)])
+	(&RTE_MBUF_METADATA_UINT64(mbuf, offset))
 
 /**
  * Given the buf_addr returns the pointer to corresponding mbuf.
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 08/13] mbuf: add named points inside the mbuf structure
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (6 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08 12:08   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 09/13] ixgbe: rework vector pmd following mbuf changes Bruce Richardson
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

Add markers or "labels" at given points inside the mbuf which can be
used instead of individual fields to identify the start of logical
sections inside the mbuf.

The use of typedefs and dummy fields was chosen over using unions
because of a couple reasons:
* unions cause an extra level of indentation (more likely two levels as
  a union containing a struct for multiple fields would be needed). This
  makes the lines longer than they need to be and increases the need for
  wrapping. [This was the main reason]
* with markers, you can apply multiple markers at the same point if
  wanted.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index ca66d9a..591be95 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -115,14 +115,22 @@ extern "C" {
  */
 #define PKT_TX_OFFLOAD_MASK (PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK)
 
+/* define a set of marker types that can be used to refer to set points in the
+ * mbuf */
+typedef void    *MARKER[0];   /**< generic marker for a point in a structure */
+typedef uint64_t MARKER64[0]; /**< marker that allows us to overwrite 8 bytes
+                               * with a single assignment */
 /**
  * The generic rte_mbuf, containing a packet mbuf.
  */
 struct rte_mbuf {
+	MARKER cacheline0;
+
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
 
 	/* next 8 bytes are initialised on RX descriptor rearm */
+	MARKER64 rearm_data;
 	uint16_t buf_len;         /**< Length of segment buffer. */
 	uint16_t data_off;
 
@@ -147,6 +155,7 @@ struct rte_mbuf {
 	uint64_t ol_flags;      /**< Offload features. */
 
 	/* remaining bytes are set on RX when pulling packet from descriptor */
+	MARKER rx_descriptor_fields1;
 	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 09/13] ixgbe: rework vector pmd following mbuf changes
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (7 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 08/13] mbuf: add named points inside the mbuf structure Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 10/13] mbuf: split mbuf across two cache lines Bruce Richardson
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

The vector PMD expects fields to be in a specific order so that it can
do vector operations on multiple fields at a time. Following mbuf
rework, adjust driver to take account of the new layout and re-enable it
in the config.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 config/common_linuxapp                |   2 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |   2 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     |   4 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 136 +++++++++++++---------------------
 4 files changed, 54 insertions(+), 90 deletions(-)

diff --git a/config/common_linuxapp b/config/common_linuxapp
index b140af7..5bee910 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -191,7 +191,7 @@ CONFIG_RTE_LIBRTE_IXGBE_DEBUG_DRIVER=n
 CONFIG_RTE_LIBRTE_IXGBE_PF_DISABLE_STRIP_CRC=n
 CONFIG_RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC=y
 CONFIG_RTE_LIBRTE_IXGBE_ALLOW_UNSUPPORTED_SFP=n
-CONFIG_RTE_IXGBE_INC_VECTOR=n
+CONFIG_RTE_IXGBE_INC_VECTOR=y
 CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=y
 
 #
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index c7f1642..f48c62a 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -2167,7 +2167,7 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		if (!ixgbe_rx_vec_condition_check(dev)) {
 			PMD_INIT_LOG(INFO, "Vector rx enabled, please make "
 				     "sure RX burst size no less than 32.\n");
-			ixgbe_rxq_vec_setup(rxq, socket_id);
+			ixgbe_rxq_vec_setup(rxq);
 			dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
 		}
 #endif
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 70dbcb0..5e98b21 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -115,6 +115,7 @@ struct igb_rx_queue {
 	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
 	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
 	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
+	uint64_t            mbuf_initializer; /**< value to init mbufs */
 	uint16_t            nb_rx_desc; /**< number of RX descriptors. */
 	uint16_t            rx_tail;  /**< current value of RDT register. */
 	uint16_t            nb_rx_hold; /**< number of held free RX desc. */
@@ -126,7 +127,6 @@ struct igb_rx_queue {
 #ifdef RTE_IXGBE_INC_VECTOR
 	uint16_t            rxrearm_nb; /**< the idx we start the re-arming from */
 	uint16_t            rxrearm_start;  /**< number of remaining to be re-armed */
-	__m128i             misc_info;  /**< cache XMM combine port_id/crc/nb_segs */
 #endif
 	uint16_t            rx_free_thresh; /**< max free RX desc to hold. */
 	uint16_t            queue_id; /**< RX queue index. */
@@ -260,7 +260,7 @@ struct ixgbe_txq_ops {
 uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
 int ixgbe_txq_vec_setup(struct igb_tx_queue *txq, unsigned int socket_id);
-int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq, unsigned int socket_id);
+int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq);
 int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev);
 #endif
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index bafb215..3332a92 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -47,15 +47,11 @@
 static inline void
 ixgbe_rxq_rearm(struct igb_rx_queue *rxq)
 {
-	static const struct rte_mbuf mb_def = {
-		.nb_segs = 1,
-	};
 	int i;
 	uint16_t rx_id;
 	volatile union ixgbe_adv_rx_desc *rxdp;
 	struct igb_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
 	struct rte_mbuf *mb0, *mb1;
-	__m128i def_low;
 	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
 			RTE_PKTMBUF_HEADROOM);
 
@@ -66,8 +62,6 @@ ixgbe_rxq_rearm(struct igb_rx_queue *rxq)
 
 	rxdp = rxq->rx_ring + rxq->rxrearm_start;
 
-	def_low = _mm_load_si128((__m128i *)&(mb_def.next));
-
 	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
 	for (i = 0; i < RTE_IXGBE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
 		__m128i dma_addr0, dma_addr1;
@@ -76,33 +70,25 @@ ixgbe_rxq_rearm(struct igb_rx_queue *rxq)
 		mb0 = rxep[0].mbuf;
 		mb1 = rxep[1].mbuf;
 
+		/* flush mbuf with pkt template */
+		mb0->rearm_data[0] = rxq->mbuf_initializer;
+		mb1->rearm_data[0] = rxq->mbuf_initializer;
+
 		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
 		vaddr0 = _mm_loadu_si128((__m128i *)&(mb0->buf_addr));
 		vaddr1 = _mm_loadu_si128((__m128i *)&(mb1->buf_addr));
 
-		/* calc va/pa of pkt data point */
-		vaddr0 = _mm_add_epi64(vaddr0, hdr_room);
-		vaddr1 = _mm_add_epi64(vaddr1, hdr_room);
-
 		/* convert pa to dma_addr hdr/data */
 		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
 		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
 
-		/* fill va into t0 def pkt template */
-		vaddr0 = _mm_unpacklo_epi64(def_low, vaddr0);
-		vaddr1 = _mm_unpacklo_epi64(def_low, vaddr1);
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
 
 		/* flush desc with pa dma_addr */
 		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
 		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
-
-		/* flush mbuf with pkt template */
-		_mm_store_si128((__m128i *)&mb0->next, vaddr0);
-		_mm_store_si128((__m128i *)&mb1->next, vaddr1);
-
-		/* update refcnt per pkt */
-		rte_mbuf_refcnt_set(mb0, 1);
-		rte_mbuf_refcnt_set(mb1, 1);
 	}
 
 	rxq->rxrearm_start += RTE_IXGBE_RXQ_REARM_THRESH;
@@ -189,7 +175,13 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 	int pos;
 	uint64_t var;
 	__m128i shuf_msk;
-	__m128i in_port;
+	__m128i crc_adjust = _mm_set_epi16(
+				0, 0, 0, 0, /* ignore non-length fields */
+				0,          /* ignore high-16bits of pkt_len */
+				-rxq->crc_len, /* sub crc on pkt_len */
+				-rxq->crc_len, /* sub crc on data_len */
+				0            /* ignore pkt_type field */
+			);
 	__m128i dd_check;
 
 	if (unlikely(nb_pkts < RTE_IXGBE_VPMD_RX_BURST))
@@ -222,8 +214,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		15, 14,      /* octet 14~15, low 16 bits vlan_macip */
 		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
 		13, 12,      /* octet 12~13, low 16 bits pkt_len */
-		0xFF, 0xFF,  /* skip nb_segs and in_port, zero out */
-		13, 12       /* octet 12~13, 16 bits data_len */
+		13, 12,      /* octet 12~13, 16 bits data_len */
+		0xFF, 0xFF   /* skip pkt_type field */
 		);
 
 
@@ -231,9 +223,6 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 	 * the next 'n' mbufs into the cache */
 	sw_ring = &rxq->sw_ring[rxq->rx_tail];
 
-	/* in_port, nb_seg = 1, crc_len */
-	in_port = rxq->misc_info;
-
 	/*
 	 * A. load 4 packet in one loop
 	 * B. copy 4 mbuf point from swring to rx_pkts
@@ -285,8 +274,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		desc_to_olflags_v(descs, &rx_pkts[pos]);
 
 		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
-		pkt_mb4 = _mm_add_epi16(pkt_mb4, in_port);
-		pkt_mb3 = _mm_add_epi16(pkt_mb3, in_port);
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
 
 		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
 		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
@@ -297,23 +286,23 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
 
 		/* D.3 copy final 3,4 data to rx_pkts */
-		_mm_storeu_si128((__m128i *)&(rx_pkts[pos+3]->data_len),
+		_mm_storeu_si128((void *)&rx_pkts[pos+3]->rx_descriptor_fields1,
 				pkt_mb4);
-		_mm_storeu_si128((__m128i *)&(rx_pkts[pos+2]->data_len),
+		_mm_storeu_si128((void *)&rx_pkts[pos+2]->rx_descriptor_fields1,
 				pkt_mb3);
 
 		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
-		pkt_mb2 = _mm_add_epi16(pkt_mb2, in_port);
-		pkt_mb1 = _mm_add_epi16(pkt_mb1, in_port);
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
 
 		/* C.3 calc avaialbe number of desc */
 		staterr = _mm_and_si128(staterr, dd_check);
 		staterr = _mm_packs_epi32(staterr, zero);
 
 		/* D.3 copy final 1,2 data to rx_pkts */
-		_mm_storeu_si128((__m128i *)&(rx_pkts[pos+1]->data_len),
+		_mm_storeu_si128((void *)&rx_pkts[pos+1]->rx_descriptor_fields1,
 				pkt_mb2);
-		_mm_storeu_si128((__m128i *)&(rx_pkts[pos]->data_len),
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
 				pkt_mb1);
 
 		/* C.4 calc avaialbe number of desc */
@@ -330,46 +319,19 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 	return nb_pkts_recd;
 }
-
 static inline void
 vtx1(volatile union ixgbe_adv_tx_desc *txdp,
-		struct rte_mbuf *pkt, __m128i flags)
+		struct rte_mbuf *pkt, uint64_t flags)
 {
-	__m128i t0, t1, offset, ols, ba, ctl;
-
-	/* load buf_addr/buf_physaddr in t0 */
-	t0 = _mm_loadu_si128((__m128i *)&(pkt->buf_addr));
-	/* load data, ... pkt_len in t1 */
-	t1 = _mm_loadu_si128((__m128i *)&(pkt->data));
-
-	/* calc offset = (data - buf_adr) */
-	offset = _mm_sub_epi64(t1, t0);
-
-	/* cmd_type_len: pkt_len |= DCMD_DTYP_FLAGS */
-	ctl = _mm_or_si128(t1, flags);
-
-	/* reorder as buf_physaddr/buf_addr */
-	offset = _mm_shuffle_epi32(offset, 0x4E);
-
-	/* olinfo_stats: pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT */
-	ols = _mm_slli_epi32(t1, IXGBE_ADVTXD_PAYLEN_SHIFT);
-
-	/* buffer_addr = buf_physaddr + offset */
-	ba = _mm_add_epi64(t0, offset);
-
-	/* format cmd_type_len/olinfo_status */
-	ctl = _mm_unpackhi_epi32(ctl, ols);
-
-	/* format buf_physaddr/cmd_type_len */
-	ba = _mm_unpackhi_epi64(ba, ctl);
-
-	/* write desc */
-	_mm_store_si128((__m128i *)&txdp->read, ba);
+	__m128i descriptor = _mm_set_epi64x((uint64_t)pkt->pkt_len << 46 |
+			flags | pkt->data_len,
+			pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)&txdp->read, descriptor);
 }
 
 static inline void
 vtx(volatile union ixgbe_adv_tx_desc *txdp,
-		struct rte_mbuf **pkt, uint16_t nb_pkts,  __m128i flags)
+		struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
 {
 	int i;
 	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
@@ -456,9 +418,8 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct igb_tx_entry_v *txep;
 	struct igb_tx_entry_seq *txsp;
 	uint16_t n, nb_commit, tx_id;
-	__m128i flags = _mm_set_epi32(DCMD_DTYP_FLAGS, 0, 0, 0);
-	__m128i rs = _mm_set_epi32(IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS,
-			0, 0, 0);
+	uint64_t flags = DCMD_DTYP_FLAGS;
+	uint64_t rs = IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS;
 	int i;
 
 	if (unlikely(nb_pkts > RTE_IXGBE_VPMD_TX_BURST))
@@ -610,6 +571,23 @@ static struct ixgbe_txq_ops vec_txq_ops = {
 	.reset = ixgbe_reset_tx_queue,
 };
 
+int
+ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq)
+{
+	static struct rte_mbuf mb_def = {
+		.nb_segs = 1,
+		.data_off = RTE_PKTMBUF_HEADROOM,
+#ifdef RTE_MBUF_REFCNT
+		.refcnt = 1,
+#endif
+	};
+
+	mb_def.buf_len = rxq->mb_pool->elt_size - sizeof(struct rte_mbuf);
+	mb_def.port = rxq->port_id;
+	rxq->mbuf_initializer = *((uint64_t *)&mb_def.rearm_data);
+	return 0;
+}
+
 int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
 			unsigned int socket_id)
 {
@@ -637,21 +615,6 @@ int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
 	return 0;
 }
 
-int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq,
-			__rte_unused unsigned int socket_id)
-{
-	rxq->misc_info =
-		_mm_set_epi16(
-			0, 0, 0, 0, 0,
-			(uint16_t)-rxq->crc_len, /* sub crc on pkt_len */
-			(uint16_t)(rxq->port_id << 8 | 1),
-			/* 8b port_id and 8b nb_seg*/
-			(uint16_t)-rxq->crc_len  /* sub crc on data_len */
-			);
-
-	return 0;
-}
-
 int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev)
 {
 #ifndef RTE_LIBRTE_IEEE1588
@@ -683,3 +646,4 @@ int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev)
 	return -1;
 #endif
 }
+
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 10/13] mbuf: split mbuf across two cache lines.
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (8 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 09/13] ixgbe: rework vector pmd following mbuf changes Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-08 12:10   ` Olivier MATZ
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line Bruce Richardson
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

This change splits the mbuf in two to move the pool and next pointers to
the second cache line. This frees up 16 bytes in first cache line.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 app/test/test_mbuf.c       | 2 +-
 lib/librte_mbuf/rte_mbuf.h | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index e3d896b..82136de 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -782,7 +782,7 @@ test_failing_mbuf_sanity_check(void)
 static int
 test_mbuf(void)
 {
-	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != 64);
+	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != CACHE_LINE_SIZE * 2);
 
 	/* create pktmbuf pool if it does not exist */
 	if (pktmbuf_pool == NULL) {
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 591be95..db079ac 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -171,7 +171,8 @@ struct rte_mbuf {
 		uint32_t sched; /**< Hierarchical scheduler */
 	} hash;                 /**< hash information */
 
-	/* fields only used in slow path or on TX */
+	/* second cache line - fields only used in slow path or on TX */
+	MARKER cacheline1 __rte_cache_aligned;
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (9 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 10/13] mbuf: split mbuf across two cache lines Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-04  5:08   ` Yerden Zhumabekov
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 12/13] ixgbe: Fix perf regression due to moved pool ptr Bruce Richardson
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

The l2_len and l3_len fields are used for TX offloads and so should be
put on the second cache line, along with the other fields only used on
TX.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index db079ac..d3c1745 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -159,8 +159,7 @@ struct rte_mbuf {
 	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
-	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
-	uint16_t l2_len:7;      /**< L2 (MAC) Header Length. */
+	uint16_t reserved;
 	uint16_t vlan_tci;      /**< VLAN Tag Control Identifier (CPU order). */
 	union {
 		uint32_t rss;   /**< RSS hash result if RSS enabled */
@@ -176,6 +175,9 @@ struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
 
+	/* fields to support TX offloads */
+	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
+	uint16_t l2_len:7;      /**< L2 (MAC) Header Length. */
 } __rte_cache_aligned;
 
 #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 12/13] ixgbe: Fix perf regression due to moved pool ptr
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (10 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 13/13] ixgbe: Improve slow-path perf: vector scattered RX Bruce Richardson
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

Adjust the fast-path code to fix the regression caused by the pool
pointer moving to the second cache line. This change adjusts the
prefetching and also the way in which the mbufs are freed back to the
mempool.
Note: slow-path e.g. path supporting jumbo frames, is still slower, but
is dealt with by a later commit

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |  8 ++-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     | 14 +-----
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 93 +++++++++++++----------------------
 3 files changed, 39 insertions(+), 76 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index f48c62a..ebbcee8 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -142,10 +142,6 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	 */
 	txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
 
-	/* prefetch the mbufs that are about to be freed */
-	for (i = 0; i < txq->tx_rs_thresh; ++i)
-		rte_prefetch0((txep + i)->mbuf);
-
 	/* free buffers one at a time */
 	if ((txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) != 0) {
 		for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
@@ -186,6 +182,7 @@ tx4(volatile union ixgbe_adv_tx_desc *txdp, struct rte_mbuf **pkts)
 				((uint32_t)DCMD_DTYP_FLAGS | pkt_len);
 		txdp->read.olinfo_status =
 				(pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+		rte_prefetch0(&(*pkts)->pool);
 	}
 }
 
@@ -205,6 +202,7 @@ tx1(volatile union ixgbe_adv_tx_desc *txdp, struct rte_mbuf **pkts)
 			((uint32_t)DCMD_DTYP_FLAGS | pkt_len);
 	txdp->read.olinfo_status =
 			(pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+	rte_prefetch0(&(*pkts)->pool);
 }
 
 /*
@@ -1876,7 +1874,7 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
 		PMD_INIT_LOG(INFO, "Using simple tx code path\n");
 #ifdef RTE_IXGBE_INC_VECTOR
 		if (txq->tx_rs_thresh <= RTE_IXGBE_TX_MAX_FREE_BUF_SZ &&
-		    ixgbe_txq_vec_setup(txq, socket_id) == 0) {
+		    ixgbe_txq_vec_setup(txq) == 0) {
 			PMD_INIT_LOG(INFO, "Vector tx enabled.\n");
 			dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
 		}
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 5e98b21..dbb57af 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -96,14 +96,6 @@ struct igb_tx_entry_v {
 };
 
 /**
- * continuous entry sequence, gather by the same mempool
- */
-struct igb_tx_entry_seq {
-	const struct rte_mempool* pool;
-	uint32_t same_pool;
-};
-
-/**
  * Structure associated with each RX queue.
  */
 struct igb_rx_queue {
@@ -191,10 +183,6 @@ struct igb_tx_queue {
 	volatile union ixgbe_adv_tx_desc *tx_ring;
 	uint64_t            tx_ring_phys_addr; /**< TX ring DMA address. */
 	struct igb_tx_entry *sw_ring;      /**< virtual address of SW ring. */
-#ifdef RTE_IXGBE_INC_VECTOR
-	/** continuous tx entry sequence within the same mempool */
-	struct igb_tx_entry_seq *sw_ring_seq;
-#endif
 	volatile uint32_t   *tdt_reg_addr; /**< Address of TDT register. */
 	uint16_t            nb_tx_desc;    /**< number of TX descriptors. */
 	uint16_t            tx_tail;       /**< current value of TDT reg. */
@@ -259,7 +247,7 @@ struct ixgbe_txq_ops {
 #ifdef RTE_IXGBE_INC_VECTOR
 uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
-int ixgbe_txq_vec_setup(struct igb_tx_queue *txq, unsigned int socket_id);
+int ixgbe_txq_vec_setup(struct igb_tx_queue *txq);
 int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq);
 int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev);
 #endif
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index 3332a92..4f63086 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -342,9 +342,8 @@ static inline int __attribute__((always_inline))
 ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 {
 	struct igb_tx_entry_v *txep;
-	struct igb_tx_entry_seq *txsp;
 	uint32_t status;
-	uint32_t n, k;
+	uint32_t n;
 #ifdef RTE_MBUF_REFCNT
 	uint32_t i;
 	int nb_free = 0;
@@ -364,23 +363,39 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	 */
 	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[txq->tx_next_dd -
 			(n - 1)];
-	txsp = &txq->sw_ring_seq[txq->tx_next_dd - (n - 1)];
-
-	while (n > 0) {
-		k = RTE_MIN(n, txsp[n-1].same_pool);
 #ifdef RTE_MBUF_REFCNT
-		for (i = 0; i < k; i++) {
-			m = __rte_pktmbuf_prefree_seg((txep+n-k+i)->mbuf);
-			if (m != NULL)
-				free[nb_free++] = m;
-		}
-		rte_mempool_put_bulk((void *)txsp[n-1].pool,
-				(void **)free, nb_free);
+	m = __rte_pktmbuf_prefree_seg(txep[0].mbuf);
+#else
+	m = txep[0].mbuf;
+#endif
+	if (likely(m != NULL)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+#ifdef RTE_MBUF_REFCNT
+			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
 #else
-		rte_mempool_put_bulk((void *)txsp[n-1].pool,
-				(void **)(txep+n-k), k);
+			m = txep[i]->mbuf;
 #endif
-		n -= k;
+			if (likely(m != NULL)) {
+				if (likely(m->pool == free[0]->pool))
+					free[nb_free++] = m;
+				else {
+					rte_mempool_put_bulk(free[0]->pool,
+							(void *)free, nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	}
+	else {
+		for (i = 1; i < n; i++) {
+			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m != NULL)
+				rte_mempool_put(m->pool, m);
+		}
 	}
 
 	/* buffers were freed, update counters */
@@ -394,19 +409,11 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 
 static inline void __attribute__((always_inline))
 tx_backlog_entry(struct igb_tx_entry_v *txep,
-		 struct igb_tx_entry_seq *txsp,
 		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
 	int i;
-	for (i = 0; i < (int)nb_pkts; ++i) {
+	for (i = 0; i < (int)nb_pkts; ++i)
 		txep[i].mbuf = tx_pkts[i];
-		/* check and update sequence number */
-		txsp[i].pool = tx_pkts[i]->pool;
-		if (txsp[i-1].pool == tx_pkts[i]->pool)
-			txsp[i].same_pool = txsp[i-1].same_pool + 1;
-		else
-			txsp[i].same_pool = 1;
-	}
 }
 
 uint16_t
@@ -416,7 +423,6 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct igb_tx_queue *txq = (struct igb_tx_queue *)tx_queue;
 	volatile union ixgbe_adv_tx_desc *txdp;
 	struct igb_tx_entry_v *txep;
-	struct igb_tx_entry_seq *txsp;
 	uint16_t n, nb_commit, tx_id;
 	uint64_t flags = DCMD_DTYP_FLAGS;
 	uint64_t rs = IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS;
@@ -435,14 +441,13 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	tx_id = txq->tx_tail;
 	txdp = &txq->tx_ring[tx_id];
 	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[tx_id];
-	txsp = &txq->sw_ring_seq[tx_id];
 
 	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
 
 	n = (uint16_t)(txq->nb_tx_desc - tx_id);
 	if (nb_commit >= n) {
 
-		tx_backlog_entry(txep, txsp, tx_pkts, n);
+		tx_backlog_entry(txep, tx_pkts, n);
 
 		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
 			vtx1(txdp, *tx_pkts, flags);
@@ -457,10 +462,9 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* avoid reach the end of ring */
 		txdp = &(txq->tx_ring[tx_id]);
 		txep = &(((struct igb_tx_entry_v *)txq->sw_ring)[tx_id]);
-		txsp = &(txq->sw_ring_seq[tx_id]);
 	}
 
-	tx_backlog_entry(txep, txsp, tx_pkts, nb_commit);
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
 
 	vtx(txdp, tx_pkts, nb_commit, flags);
 
@@ -484,7 +488,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
 {
 	unsigned i;
 	struct igb_tx_entry_v *txe;
-	struct igb_tx_entry_seq *txs;
 	uint16_t nb_free, max_desc;
 
 	if (txq->sw_ring != NULL) {
@@ -502,10 +505,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
 		for (i = 0; i < txq->nb_tx_desc; i++) {
 			txe = (struct igb_tx_entry_v *)&txq->sw_ring[i];
 			txe->mbuf = NULL;
-
-			txs = &txq->sw_ring_seq[i];
-			txs->pool = NULL;
-			txs->same_pool = 0;
 		}
 	}
 }
@@ -520,11 +519,6 @@ ixgbe_tx_free_swring(struct igb_tx_queue *txq)
 		rte_free((struct igb_rx_entry *)txq->sw_ring - 1);
 		txq->sw_ring = NULL;
 	}
-
-	if (txq->sw_ring_seq != NULL) {
-		rte_free(txq->sw_ring_seq - 1);
-		txq->sw_ring_seq = NULL;
-	}
 }
 
 static void
@@ -533,7 +527,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
 	static const union ixgbe_adv_tx_desc zeroed_desc = { .read = {
 			.buffer_addr = 0} };
 	struct igb_tx_entry_v *txe = (struct igb_tx_entry_v *)txq->sw_ring;
-	struct igb_tx_entry_seq *txs = txq->sw_ring_seq;
 	uint16_t i;
 
 	/* Zero out HW ring memory */
@@ -545,8 +538,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
 		volatile union ixgbe_adv_tx_desc *txd = &txq->tx_ring[i];
 		txd->wb.status = IXGBE_TXD_STAT_DD;
 		txe[i].mbuf = NULL;
-		txs[i].pool = NULL;
-		txs[i].same_pool = 0;
 	}
 
 	txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
@@ -588,28 +579,14 @@ ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq)
 	return 0;
 }
 
-int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
-			unsigned int socket_id)
+int ixgbe_txq_vec_setup(struct igb_tx_queue *txq)
 {
-	uint16_t nb_desc;
-
 	if (txq->sw_ring == NULL)
 		return -1;
 
-	/* request addtional one entry for continous sequence check */
-	nb_desc = (uint16_t)(txq->nb_tx_desc + 1);
-
-	txq->sw_ring_seq = rte_zmalloc_socket("txq->sw_ring_seq",
-				sizeof(struct igb_tx_entry_seq) * nb_desc,
-				CACHE_LINE_SIZE, socket_id);
-	if (txq->sw_ring_seq == NULL)
-		return -1;
-
-
 	/* leave the first one for overflow */
 	txq->sw_ring = (struct igb_tx_entry *)
 		((struct igb_tx_entry_v *)txq->sw_ring + 1);
-	txq->sw_ring_seq += 1;
 	txq->ops = &vec_txq_ops;
 
 	return 0;
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH 13/13] ixgbe: Improve slow-path perf: vector scattered RX
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (11 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 12/13] ixgbe: Fix perf regression due to moved pool ptr Bruce Richardson
@ 2014-09-03 15:49 ` Bruce Richardson
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-03 15:49 UTC (permalink / raw)
  To: dev

Provide a wrapper routine to enable receive of scattered packets with a
vector driver.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |  16 ++++
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     |   1 +
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 173 ++++++++++++++++++++++++++++++++--
 3 files changed, 183 insertions(+), 7 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index ebbcee8..8a2e0ee 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -3477,12 +3477,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 		if ((dev->data->dev_conf.rxmode.max_rx_pkt_len +
 				2 * IXGBE_VLAN_TAG_SIZE) > buf_size){
 			dev->data->scattered_rx = 1;
+#ifdef RTE_IXGBE_INC_VECTOR
+			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
+#else
 			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
+#endif
 		}
 	}
 
 	if (dev->data->dev_conf.rxmode.enable_scatter) {
+#ifdef RTE_IXGBE_INC_VECTOR
+		dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
+#else
 		dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
+#endif
 		dev->data->scattered_rx = 1;
 	}
 
@@ -3970,12 +3978,20 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev)
 		if ((dev->data->dev_conf.rxmode.max_rx_pkt_len +
 				2 * IXGBE_VLAN_TAG_SIZE) > buf_size) {
 			dev->data->scattered_rx = 1;
+#ifdef RTE_IXGBE_INC_VECTOR
+			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
+#else
 			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
+#endif
 		}
 	}
 
 	if (dev->data->dev_conf.rxmode.enable_scatter) {
+#ifdef RTE_IXGBE_INC_VECTOR
+		dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
+#else
 		dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
+#endif
 		dev->data->scattered_rx = 1;
 	}
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index dbb57af..e2e037b 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -246,6 +246,7 @@ struct ixgbe_txq_ops {
 
 #ifdef RTE_IXGBE_INC_VECTOR
 uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
+uint16_t ixgbe_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
 int ixgbe_txq_vec_setup(struct igb_tx_queue *txq);
 int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq);
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index 4f63086..168996f 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -164,12 +164,11 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf **rx_pkts)
  *   numbers of DD bit
  * - don't support ol_flags for rss and csum err
  */
-uint16_t
-ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
-		uint16_t nb_pkts)
+static inline uint16_t
+_recv_raw_pkts_vec(struct igb_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts, uint8_t *split_packet)
 {
 	volatile union ixgbe_adv_rx_desc *rxdp;
-	struct igb_rx_queue *rxq = rx_queue;
 	struct igb_rx_entry *sw_ring;
 	uint16_t nb_pkts_recd;
 	int pos;
@@ -182,7 +181,7 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				-rxq->crc_len, /* sub crc on data_len */
 				0            /* ignore pkt_type field */
 			);
-	__m128i dd_check;
+	__m128i dd_check, eop_check;
 
 	if (unlikely(nb_pkts < RTE_IXGBE_VPMD_RX_BURST))
 		return 0;
@@ -207,6 +206,9 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 	/* 4 packets DD mask */
 	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
 
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
 	/* mask to shuffle from desc. to mbuf */
 	shuf_msk = _mm_set_epi8(
 		7, 6, 5, 4,  /* octet 4~7, 32bits rss */
@@ -218,7 +220,6 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		0xFF, 0xFF   /* skip pkt_type field */
 		);
 
-
 	/* Cache is empty -> need to scan the buffer rings, but first move
 	 * the next 'n' mbufs into the cache */
 	sw_ring = &rxq->sw_ring[rxq->rx_tail];
@@ -227,6 +228,7 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 	 * A. load 4 packet in one loop
 	 * B. copy 4 mbuf point from swring to rx_pkts
 	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
 	 * D. fill info. from desc to mbuf
 	 */
 	for (pos = 0, nb_pkts_recd = 0; pos < RTE_IXGBE_VPMD_RX_BURST;
@@ -237,6 +239,13 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
 		__m128i mbp1, mbp2; /* two mbuf pointer in one XMM reg. */
 
+		if (split_packet) {
+			rte_prefetch0(&rx_pkts[pos]->cacheline1);
+			rte_prefetch0(&rx_pkts[pos + 1]->cacheline1);
+			rte_prefetch0(&rx_pkts[pos + 2]->cacheline1);
+			rte_prefetch0(&rx_pkts[pos + 3]->cacheline1);
+		}
+
 		/* B.1 load 1 mbuf point */
 		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
 
@@ -295,7 +304,36 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
 		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
 
-		/* C.3 calc avaialbe number of desc */
+		/* C* extract and record EOP bit */
+		if (split_packet){
+			__m128i eop_shuf_mask = _mm_set_epi8(
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0x04, 0x0C, 0x00, 0x08
+					);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* convert from 0/2 value to a 0/1 value */
+			//eop_bits = _mm_srli_epi32(eop_bits, 1);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += RTE_IXGBE_DESCS_PER_LOOP;
+
+			/* zero-out next pointers */
+			rx_pkts[pos]->next = NULL;
+			rx_pkts[pos + 1]->next = NULL;
+			rx_pkts[pos + 2]->next = NULL;
+			rx_pkts[pos + 3]->next = NULL;
+		}
+
+		/* C.3 calc available number of desc */
 		staterr = _mm_and_si128(staterr, dd_check);
 		staterr = _mm_packs_epi32(staterr, zero);
 
@@ -319,6 +357,127 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 	return nb_pkts_recd;
 }
+
+/*
+ * vPMD receive routine, now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
+ * in one loop
+ *
+ * Notice:
+ * - nb_pkts < RTE_IXGBE_VPMD_RX_BURST, just return no packet
+ * - nb_pkts > RTE_IXGBE_VPMD_RX_BURST, only scan RTE_IXGBE_VPMD_RX_BURST
+ *   numbers of DD bit
+ * - don't support ol_flags for rss and csum err
+ */
+uint16_t
+ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+static inline uint16_t
+reassemble_packets(struct igb_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[RTE_IXGBE_VPMD_RX_BURST]; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned pkt_idx = 0, buf_idx = 0;
+
+
+	while (buf_idx < nb_bufs) {
+		if (end != NULL) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len)
+					end->data_len -= rxq->crc_len;
+				else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+					end = secondlast;
+				}
+				pkts[pkt_idx++] = start;
+				start = end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			end = start = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+		buf_idx++;
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+/*
+ * vPMD receive routine that reassembles scattered packets
+ *
+ * Notice:
+ * - don't support ol_flags for rss and csum err
+ * - now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
+ */
+uint16_t
+ixgbe_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts)
+{
+	struct igb_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[RTE_IXGBE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint32_t *split_fl32 = (uint32_t *)split_flags;
+	if (rxq->pkt_first_seg == NULL &&
+			split_fl32[0] == 0 && split_fl32[1] == 0 &&
+			split_fl32[2] == 0 && split_fl32[3] == 0 )
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned i = 0;
+	if (rxq->pkt_first_seg == NULL ){
+		/* find the first split flag, and only reassemble then*/
+		while (!split_flags[i] && i < nb_bufs)
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
 static inline void
 vtx1(volatile union ixgbe_adv_tx_desc *txdp,
 		struct rte_mbuf *pkt, uint64_t flags)
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line Bruce Richardson
@ 2014-09-04  5:08   ` Yerden Zhumabekov
  2014-09-04 10:27     ` Bruce Richardson
  0 siblings, 1 reply; 62+ messages in thread
From: Yerden Zhumabekov @ 2014-09-04  5:08 UTC (permalink / raw)
  To: dev

Hello Bruce,

I'm a little bit concerned about performance issues that would arise if
these fields would go to the 2nd cache line.

For exampe, l2_len and l3_len fields are used by librte_ip_frag to find
L3 and L4 headers position inside mbuf data. Thus, these values should
be calculated by NIC offload, or by user on RX leg.

Secondly, (I wouldn't say on behalf of everyone, but) we use these
fields in our libraries as well for needs of classification. For
instance, in case you try to support other ethertypes which are not
supported by NIC offload (MPLS, IPX etc), but you still need to point
out L3 and L3 headers.

If my concerns are consistent, what would be possible suggestions?

03.09.2014 21:49, Bruce Richardson пишет:
> The l2_len and l3_len fields are used for TX offloads and so should be
> put on the second cache line, along with the other fields only used on
> TX.
>
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

-- 
Sincerely,

Yerden Zhumabekov
STS, ACI
Astana, KZ

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line
  2014-09-04  5:08   ` Yerden Zhumabekov
@ 2014-09-04 10:27     ` Bruce Richardson
  2014-09-04 11:00       ` Yerden Zhumabekov
  0 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-04 10:27 UTC (permalink / raw)
  To: Yerden Zhumabekov; +Cc: dev

On Thu, Sep 04, 2014 at 11:08:57AM +0600, Yerden Zhumabekov wrote:
> Hello Bruce,
> 
> I'm a little bit concerned about performance issues that would arise if
> these fields would go to the 2nd cache line.
> 
> For exampe, l2_len and l3_len fields are used by librte_ip_frag to find
> L3 and L4 headers position inside mbuf data. Thus, these values should
> be calculated by NIC offload, or by user on RX leg.
> 
> Secondly, (I wouldn't say on behalf of everyone, but) we use these
> fields in our libraries as well for needs of classification. For
> instance, in case you try to support other ethertypes which are not
> supported by NIC offload (MPLS, IPX etc), but you still need to point
> out L3 and L3 headers.
> 
> If my concerns are consistent, what would be possible suggestions?

Hi Yerden,

I understand your concerns and it's good to have this discussion.

There are a number of reasons why I've moved these particular fields
to the second cache line. Firstly, the main reason is that, obviously enough,
not all fields will fit in cache line 0, and we need to prioritize what does
get stored there. The guiding principle behind what fields get moved or not
that I've chosen to use for this patch set is to move fields that are not
used on the receive path (or the fastpath receive path, more specifically -
so that we can move fields only used by jumbo frames that span mbufs) to the
second cache line. From a search through the existing codebase, there are no
drivers which set the l2/l3 length fields on RX, this is only used in
reassembly libraries/apps and by the drivers on TX.

The other reason for moving it to the second cache line is that it logically
belongs with all the other length fields that we need to add to enable
tunneling support. [To get an idea of the extra fields that I propose adding
to the mbuf, please see the RFC patchset I sent out previously as "[RFC 
PATCH 00/14] Extend the mbuf structure"]. While we probably can fit the 16-bits
needed for l2/l3 length on the mbuf line 0, there is not enough room for all
the lengths so we would end up splitting them with other fields in between.

So, in terms of what do to about this particular issue. I would hope that for
applications that use these fields the impact should be small and/or possible
to work around e.g. maybe prefetch second cache line on RX in driver. If not,
then I'm happy to see about withdrawing this particular change and seeing if
we can keep l2/l3 lengths on cache line zero, with other length fields being
on cache line 1.

Question: would you consider the ip fragmentation and reassembly example apps
in the Intel DPDK releases good examples to test to see the impacts of this
change, or is there some other test you would prefer that I look to do? 
Can you perhaps test out the patch sets for the mbuf that I've upstreamed so
far and let me know what regressions, if any, you see in your use-case
scenarios?

Regards,
/Bruce

> 
> 03.09.2014 21:49, Bruce Richardson ?????:
> > The l2_len and l3_len fields are used for TX offloads and so should be
> > put on the second cache line, along with the other fields only used on
> > TX.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> -- 
> Sincerely,
> 
> Yerden Zhumabekov
> STS, ACI
> Astana, KZ
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line
  2014-09-04 10:27     ` Bruce Richardson
@ 2014-09-04 11:00       ` Yerden Zhumabekov
  2014-09-04 11:55         ` Bruce Richardson
  0 siblings, 1 reply; 62+ messages in thread
From: Yerden Zhumabekov @ 2014-09-04 11:00 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

I get your point. I've also read throught the code of various PMDs and
have found no indication of setting l2_len/l3_len fields as well.

As for testing, we'd be happy to test the patchset but we are now in
process of building our testing facilities so we are not ready to
provide enough workload for the hardware/software. I was also wondering
if anyone has run some test and can provide some numbers on that matter.

Personally, I don't think frag/reassemly app is a perfect example for
evaluating 2nd cache line performance penalty. The offsets to L3 and L4
headers need to be calculated for all TCP/IP traffic and fragmented
traffic is not representative in this case. Maybe it would be better to
write an app which calculates these offsets for different set of mbufs
and provides some stats. For example, l2fwd/l3fwd + additional l2_len
and l3_len calculation.

And I'm also figuring out how to rewrite our app/libs (prefetch etc) to
reflect the future changes in mbuf, hence my concerns :)


04.09.2014 16:27, Bruce Richardson пишет:
> Hi Yerden,
>
> I understand your concerns and it's good to have this discussion.
>
> There are a number of reasons why I've moved these particular fields
> to the second cache line. Firstly, the main reason is that, obviously enough,
> not all fields will fit in cache line 0, and we need to prioritize what does
> get stored there. The guiding principle behind what fields get moved or not
> that I've chosen to use for this patch set is to move fields that are not
> used on the receive path (or the fastpath receive path, more specifically -
> so that we can move fields only used by jumbo frames that span mbufs) to the
> second cache line. From a search through the existing codebase, there are no
> drivers which set the l2/l3 length fields on RX, this is only used in
> reassembly libraries/apps and by the drivers on TX.
>
> The other reason for moving it to the second cache line is that it logically
> belongs with all the other length fields that we need to add to enable
> tunneling support. [To get an idea of the extra fields that I propose adding
> to the mbuf, please see the RFC patchset I sent out previously as "[RFC 
> PATCH 00/14] Extend the mbuf structure"]. While we probably can fit the 16-bits
> needed for l2/l3 length on the mbuf line 0, there is not enough room for all
> the lengths so we would end up splitting them with other fields in between.
>
> So, in terms of what do to about this particular issue. I would hope that for
> applications that use these fields the impact should be small and/or possible
> to work around e.g. maybe prefetch second cache line on RX in driver. If not,
> then I'm happy to see about withdrawing this particular change and seeing if
> we can keep l2/l3 lengths on cache line zero, with other length fields being
> on cache line 1.
>
> Question: would you consider the ip fragmentation and reassembly example apps
> in the Intel DPDK releases good examples to test to see the impacts of this
> change, or is there some other test you would prefer that I look to do? 
> Can you perhaps test out the patch sets for the mbuf that I've upstreamed so
> far and let me know what regressions, if any, you see in your use-case
> scenarios?
>
> Regards,
> /Bruce
>
-- 
Sincerely,

Yerden Zhumabekov
STS, ACI
Astana, KZ

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line
  2014-09-04 11:00       ` Yerden Zhumabekov
@ 2014-09-04 11:55         ` Bruce Richardson
  0 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-04 11:55 UTC (permalink / raw)
  To: Yerden Zhumabekov; +Cc: dev

On Thu, Sep 04, 2014 at 05:00:12PM +0600, Yerden Zhumabekov wrote:
> I get your point. I've also read throught the code of various PMDs and
> have found no indication of setting l2_len/l3_len fields as well.
> 
> As for testing, we'd be happy to test the patchset but we are now in
> process of building our testing facilities so we are not ready to
> provide enough workload for the hardware/software. I was also wondering
> if anyone has run some test and can provide some numbers on that matter.
> 
> Personally, I don't think frag/reassemly app is a perfect example for
> evaluating 2nd cache line performance penalty. The offsets to L3 and L4
> headers need to be calculated for all TCP/IP traffic and fragmented
> traffic is not representative in this case. Maybe it would be better to
> write an app which calculates these offsets for different set of mbufs
> and provides some stats. For example, l2fwd/l3fwd + additional l2_len
> and l3_len calculation.
> 
> And I'm also figuring out how to rewrite our app/libs (prefetch etc) to
> reflect the future changes in mbuf, hence my concerns :)
>
Just a final point on this. Note that the second cache line is always being
read by the TX leg of the code to free back mbufs to their mbuf pool post-
transmit. The overall fast-path RX+TX benchmarks show no performance
degradation due to that access.

For sample apps, you make a good point indeed about the existing app not being
very useful as they work on larger packets. I'll see what I can throw together
here to make a more realistic test.

/Bruce
 
> 
> 04.09.2014 16:27, Bruce Richardson ??????????:
> > Hi Yerden,
> >
> > I understand your concerns and it's good to have this discussion.
> >
> > There are a number of reasons why I've moved these particular fields
> > to the second cache line. Firstly, the main reason is that, obviously enough,
> > not all fields will fit in cache line 0, and we need to prioritize what does
> > get stored there. The guiding principle behind what fields get moved or not
> > that I've chosen to use for this patch set is to move fields that are not
> > used on the receive path (or the fastpath receive path, more specifically -
> > so that we can move fields only used by jumbo frames that span mbufs) to the
> > second cache line. From a search through the existing codebase, there are no
> > drivers which set the l2/l3 length fields on RX, this is only used in
> > reassembly libraries/apps and by the drivers on TX.
> >
> > The other reason for moving it to the second cache line is that it logically
> > belongs with all the other length fields that we need to add to enable
> > tunneling support. [To get an idea of the extra fields that I propose adding
> > to the mbuf, please see the RFC patchset I sent out previously as "[RFC 
> > PATCH 00/14] Extend the mbuf structure"]. While we probably can fit the 16-bits
> > needed for l2/l3 length on the mbuf line 0, there is not enough room for all
> > the lengths so we would end up splitting them with other fields in between.
> >
> > So, in terms of what do to about this particular issue. I would hope that for
> > applications that use these fields the impact should be small and/or possible
> > to work around e.g. maybe prefetch second cache line on RX in driver. If not,
> > then I'm happy to see about withdrawing this particular change and seeing if
> > we can keep l2/l3 lengths on cache line zero, with other length fields being
> > on cache line 1.
> >
> > Question: would you consider the ip fragmentation and reassembly example apps
> > in the Intel DPDK releases good examples to test to see the impacts of this
> > change, or is there some other test you would prefer that I look to do? 
> > Can you perhaps test out the patch sets for the mbuf that I've upstreamed so
> > far and let me know what regressions, if any, you see in your use-case
> > scenarios?
> >
> > Regards,
> > /Bruce
> >
> -- 
> Sincerely,
> 
> Yerden Zhumabekov
> STS, ACI
> Astana, KZ
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 01/13] mbuf: replace data pointer by an offset
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 01/13] mbuf: replace data pointer by an offset Bruce Richardson
@ 2014-09-08  9:52   ` Olivier MATZ
  2014-09-08  9:55     ` Olivier MATZ
  0 siblings, 1 reply; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08  9:52 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi Bruce,

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> From: Olivier Matz <olivier.matz@6wind.com>
> 
> Original patch:
>  The mbuf structure already contains a pointer to the beginning of the
>  buffer (m->buf_addr). It is not needed to use 8 bytes again to store
>  another pointer to the beginning of the data.
> 
>  Using a 16 bits unsigned integer is enough as we know that a mbuf is
>  never longer than 64KB. We gain 6 bytes in the structure thanks to
>  this modification.
> 
>  Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
> 
> This version:
> * Updated original patch to apply to latest on mainline.
> * Disabled vector PMD in config as it relies heavily on the mbuf layout
>   This will be re-enabled in a subsequent commit once vPMD has been
>   reworked to take account of mbuf changes.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>
> [...]
>
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 32e8474..669e7f5 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -119,6 +119,13 @@ struct rte_mbuf {
>  	void *buf_addr;           /**< Virtual address of segment buffer. */
>  	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
>  	uint16_t buf_len;         /**< Length of segment buffer. */
> +
> +	/* valid for any segment */
> +	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> +	uint16_t data_off;
> +	uint16_t data_len;        /**< Amount of data in segment buffer. */
> +	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
> +

Here, the compiler will add some padding between "buf_len" and "next"
(6 bytes in 64bits, 2 in 32bits). Shouldn't we add "reserved" fields
or at least a comment ? Another idea would be to reorganize the fields
to avoid padding, if possible.

Except this remarq, Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 01/13] mbuf: replace data pointer by an offset
  2014-09-08  9:52   ` Olivier MATZ
@ 2014-09-08  9:55     ` Olivier MATZ
  0 siblings, 0 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08  9:55 UTC (permalink / raw)
  To: Bruce Richardson, dev


>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>> index 32e8474..669e7f5 100644
>> --- a/lib/librte_mbuf/rte_mbuf.h
>> +++ b/lib/librte_mbuf/rte_mbuf.h
>> @@ -119,6 +119,13 @@ struct rte_mbuf {
>>  	void *buf_addr;           /**< Virtual address of segment buffer. */
>>  	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
>>  	uint16_t buf_len;         /**< Length of segment buffer. */
>> +
>> +	/* valid for any segment */
>> +	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
>> +	uint16_t data_off;
>> +	uint16_t data_len;        /**< Amount of data in segment buffer. */
>> +	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
>> +
> 
> Here, the compiler will add some padding between "buf_len" and "next"
> (6 bytes in 64bits, 2 in 32bits). Shouldn't we add "reserved" fields
> or at least a comment ? Another idea would be to reorganize the fields
> to avoid padding, if possible.

I've just seen that patch 02/13 reorders the mbuf fields. So maybe this
remarq is useless.


Olivier

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 02/13] mbuf: reorder fields by time of use
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 02/13] mbuf: reorder fields by time of use Bruce Richardson
@ 2014-09-08 10:17   ` Olivier MATZ
  0 siblings, 0 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 10:17 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi Bruce,

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> *  Reorder the fields in the mbuf so that we have fields that are used$
> together side-by-side in the structure. This means that we have a$
> contiguous block of 8-bytes in the mbuf which are used to reset an mbuf$
> of descriptor rearm, and a block of 16-bytes of data (excluding flags)
> which are set on RX from the received packet descriptor.

Some unwanted $ characters here.

> * Use dummy fields as appropraite to ensure alignment or to reserve gaps
> for later field additions.

s/appropraite/appropriate

> * Place most items which are not used by fast-path RX separately at the end
> of the structure so they can later be moved to a separate cache line.
> [The l2/l3 length fields are not moved at this stage as doing so will
> cause overflow to the next cache line].
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field Bruce Richardson
@ 2014-09-08 10:17   ` Olivier MATZ
  2014-09-08 10:33     ` Yerden Zhumabekov
  0 siblings, 1 reply; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 10:17 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi Bruce,

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> Replace a reserved slot with the new packet type field used to identify
> the type of the packet, i.e. what protocols are used.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  lib/librte_mbuf/rte_mbuf.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index f136d37..8d0c6fb 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -146,7 +146,7 @@ struct rte_mbuf {
>  	uint32_t reserved1;     /**< Unused field. Required for padding */
>  
>  	/* remaining bytes are set on RX when pulling packet from descriptor */
> -	uint16_t reserved2;     /**< Unused field. Required for padding */
> +	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
>  	uint16_t data_len;      /**< Amount of data in segment buffer. */
>  	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
>  	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
> 

This patch adds a new fields that nobody uses. So why should we add it ?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 04/13] mbuf: expand ol_flags field to 64-bits
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 04/13] mbuf: expand ol_flags field to 64-bits Bruce Richardson
@ 2014-09-08 10:25   ` Olivier MATZ
  2014-09-09  9:00     ` Richardson, Bruce
  0 siblings, 1 reply; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 10:25 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi Bruce,

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> The offload flags field (ol_flags) was 16-bits and had no further room
> for expansion. This patch increases the field size to 64-bits, using up
> the remaining reserved space in the single-cache-line mbuf.
> 
> NOTE: none of the values for existing flags have been changed, i.e. no
> new numbers have been explicitly reserved between existing flag
> definitions.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

The initial series I've proposed [1][2] had on more enhancement: the
first patch [1] allowed to remove the definition of flag names in
testpmd. Indeed, this is not really good because they must be kept
synchronized with the flags in rte_mbuf. What do you think about this
patch? Should it be integrated in your series? Or later? Or never? ;)

The second patch [2] changes the value of the flags. This is not needed
now, but if we do it in the future, we should not forget to change
app/test-pmd/cmdline.c accordingly. Maybe this could go in your patch
directly as it does not hurt?

Olivier

[1] http://dpdk.org/ml/archives/dev/2014-May/002545.html
[2] http://dpdk.org/ml/archives/dev/2014-May/002546.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
  2014-09-08 10:17   ` Olivier MATZ
@ 2014-09-08 10:33     ` Yerden Zhumabekov
  2014-09-08 11:17       ` Olivier MATZ
  0 siblings, 1 reply; 62+ messages in thread
From: Yerden Zhumabekov @ 2014-09-08 10:33 UTC (permalink / raw)
  To: Olivier MATZ, Bruce Richardson, dev

I would use it :)
It's useful to store the IP protocol number (UDP, TCP etc) and version
of IP (4, 6) and then relay packet to specific handler.

08.09.2014 16:17, Olivier MATZ пишет:
> Hi Bruce,
>
> On 09/03/2014 05:49 PM, Bruce Richardson wrote:
>> Replace a reserved slot with the new packet type field used to identify
>> the type of the packet, i.e. what protocols are used.
>>
>> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
>> ---
>>  lib/librte_mbuf/rte_mbuf.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
>> index f136d37..8d0c6fb 100644
>> --- a/lib/librte_mbuf/rte_mbuf.h
>> +++ b/lib/librte_mbuf/rte_mbuf.h
>> @@ -146,7 +146,7 @@ struct rte_mbuf {
>>  	uint32_t reserved1;     /**< Unused field. Required for padding */
>>  
>>  	/* remaining bytes are set on RX when pulling packet from descriptor */
>> -	uint16_t reserved2;     /**< Unused field. Required for padding */
>> +	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
>>  	uint16_t data_len;      /**< Amount of data in segment buffer. */
>>  	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
>>  	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
>>
> This patch adds a new fields that nobody uses. So why should we add it ?

-- 
Sincerely,

Yerden Zhumabekov
State Technical Service
Astana, KZ

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
  2014-09-08 10:33     ` Yerden Zhumabekov
@ 2014-09-08 11:17       ` Olivier MATZ
  2014-09-09  3:59         ` Zhang, Helin
  2014-09-09 15:05         ` Jim Thompson
  0 siblings, 2 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 11:17 UTC (permalink / raw)
  To: Yerden Zhumabekov, Bruce Richardson, dev

Hi Yerden,

On 09/08/2014 12:33 PM, Yerden Zhumabekov wrote:
> 08.09.2014 16:17, Olivier MATZ пишет:
>>> --- a/lib/librte_mbuf/rte_mbuf.h
>>> +++ b/lib/librte_mbuf/rte_mbuf.h
>>> @@ -146,7 +146,7 @@ struct rte_mbuf {
>>>  	uint32_t reserved1;     /**< Unused field. Required for padding */
>>>  
>>>  	/* remaining bytes are set on RX when pulling packet from descriptor */
>>> -	uint16_t reserved2;     /**< Unused field. Required for padding */
>>> +	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
>>>  	uint16_t data_len;      /**< Amount of data in segment buffer. */
>>>  	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
>>>  	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
>>>
>> This patch adds a new fields that nobody uses. So why should we add it ?
> 
> I would use it :)
> It's useful to store the IP protocol number (UDP, TCP etc) and version
> of IP (4, 6) and then relay packet to specific handler.

I'm not saying this field is useless. But even if it's useful
for some applications like yours, it does not mean that it should go in
the generic mbuf structure.

Also, for a new field, we should define who is in charge of filling it.
Is is the driver? Does it mean that all drivers have to be modified to
fill it? Or is it just a placeholder for applications? In this case,
shouldn't we use application-specific metadata? In the other direction
(TX), we would also need to define if this field must be filled by the
application before transmitting a mbuf to a driver.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 05/13] mbuf: introduce a flag to indicate a control mbuf
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 05/13] mbuf: introduce a flag to indicate a control mbuf Bruce Richardson
@ 2014-09-08 11:53   ` Olivier MATZ
  0 siblings, 0 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 11:53 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi Bruce,

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> Since the flags field is now 64-bits, we can allow one bit to be used to
> indicate a control i.e. non-packet mbuf. Dedicate the high bit (bit 63)
> for this purpose and add in a utility macro to test if a given mbuf has
> the bit set or not.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> [...]
>
> +/**
> + * Tests if an mbuf is a control mbuf
> + *
> + * @param m
> + *   The mbuf.to be tested
> + * @return
> + *   - True (1) if the mbuf is a control mbuf
> + *   - False(0) otherwise
> + */

Typo s/mbuf.to/mbuf to/


Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 06/13] mbuf: minor changes for readability
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 06/13] mbuf: minor changes for readability Bruce Richardson
@ 2014-09-08 12:03   ` Olivier MATZ
  0 siblings, 0 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 12:03 UTC (permalink / raw)
  To: Bruce Richardson, dev

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> * Ensure comments line up correctly
> * Simplify the #ifdefs around the refcnt fields to make them clearer
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata Bruce Richardson
@ 2014-09-08 12:05   ` Olivier MATZ
  2014-09-09  9:01     ` Richardson, Bruce
  2014-09-10 15:09     ` Bruce Richardson
  0 siblings, 2 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 12:05 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi Bruce,

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> Removed the explicit zero-sized metadata definition at the end of the
> mbuf data structure. Updated the metadata macros to take account of this
> change so that all existing code which uses those macros still works.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  lib/librte_mbuf/rte_mbuf.h | 22 ++++++++--------------
>  1 file changed, 8 insertions(+), 14 deletions(-)
> 
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> index 5260001..ca66d9a 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -166,31 +166,25 @@ struct rte_mbuf {
>  	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
>  	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
>  
> -	union {
> -		uint8_t metadata[0];
> -		uint16_t metadata16[0];
> -		uint32_t metadata32[0];
> -		uint64_t metadata64[0];
> -	} __rte_cache_aligned;
>  } __rte_cache_aligned;
>  
>  #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
> -	(mbuf->metadata[offset])
> +	(((uint8_t *)&(mbuf)[1])[offset])
>  #define RTE_MBUF_METADATA_UINT16(mbuf, offset)             \
> -	(mbuf->metadata16[offset/sizeof(uint16_t)])
> +	(((uint16_t *)&(mbuf)[1])[offset/sizeof(uint16_t)])
>  #define RTE_MBUF_METADATA_UINT32(mbuf, offset)             \
> -	(mbuf->metadata32[offset/sizeof(uint32_t)])
> +	(((uint32_t *)&(mbuf)[1])[offset/sizeof(uint32_t)])
>  #define RTE_MBUF_METADATA_UINT64(mbuf, offset)             \
> -	(mbuf->metadata64[offset/sizeof(uint64_t)])
> +	(((uint64_t *)&(mbuf)[1])[offset/sizeof(uint64_t)])
>  
>  #define RTE_MBUF_METADATA_UINT8_PTR(mbuf, offset)          \
> -	(&mbuf->metadata[offset])
> +	(&RTE_MBUF_METADATA_UINT8(mbuf, offset))
>  #define RTE_MBUF_METADATA_UINT16_PTR(mbuf, offset)         \
> -	(&mbuf->metadata16[offset/sizeof(uint16_t)])
> +	(&RTE_MBUF_METADATA_UINT16(mbuf, offset))
>  #define RTE_MBUF_METADATA_UINT32_PTR(mbuf, offset)         \
> -	(&mbuf->metadata32[offset/sizeof(uint32_t)])
> +	(&RTE_MBUF_METADATA_UINT32(mbuf, offset))
>  #define RTE_MBUF_METADATA_UINT64_PTR(mbuf, offset)         \
> -	(&mbuf->metadata64[offset/sizeof(uint64_t)])
> +	(&RTE_MBUF_METADATA_UINT64(mbuf, offset))
>  
>  /**
>   * Given the buf_addr returns the pointer to corresponding mbuf.
> 

I think it goes in the good direction. So:
Acked-by: Olivier Matz <olivier.matz@6wind.com>

Just one question: why not removing RTE_MBUF_METADATA*() macros?
I'd just provide one macro that gives a (void*) to the first byte
after the mbuf structure.

The format of the metadata is up to the application, that usually
casts (m + 1) into a private structure, making the macros not very
useful. I suggest to move these macros outside rte_mbuf.h, in the
application-specific or library-specific header, what do you think?

Regards,
Olivier

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 08/13] mbuf: add named points inside the mbuf structure
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 08/13] mbuf: add named points inside the mbuf structure Bruce Richardson
@ 2014-09-08 12:08   ` Olivier MATZ
  0 siblings, 0 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 12:08 UTC (permalink / raw)
  To: Bruce Richardson, dev

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> Add markers or "labels" at given points inside the mbuf which can be
> used instead of individual fields to identify the start of logical
> sections inside the mbuf.
> 
> The use of typedefs and dummy fields was chosen over using unions
> because of a couple reasons:
> * unions cause an extra level of indentation (more likely two levels as
>   a union containing a struct for multiple fields would be needed). This
>   makes the lines longer than they need to be and increases the need for
>   wrapping. [This was the main reason]
> * with markers, you can apply multiple markers at the same point if
>   wanted.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 10/13] mbuf: split mbuf across two cache lines.
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 10/13] mbuf: split mbuf across two cache lines Bruce Richardson
@ 2014-09-08 12:10   ` Olivier MATZ
  0 siblings, 0 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-08 12:10 UTC (permalink / raw)
  To: Bruce Richardson, dev

Hi Bruce,

On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> This change splits the mbuf in two to move the pool and next pointers to
> the second cache line. This frees up 16 bytes in first cache line.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

I think you should provide more explanation in the commit log about why
this change is needed. I remeber you already provided the arguments on
the mailing list, but it could be interesting to have them in the log.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
  2014-09-08 11:17       ` Olivier MATZ
@ 2014-09-09  3:59         ` Zhang, Helin
       [not found]           ` <540EB428.9060706@6wind.com>
  2014-09-09 15:05         ` Jim Thompson
  1 sibling, 1 reply; 62+ messages in thread
From: Zhang, Helin @ 2014-09-09  3:59 UTC (permalink / raw)
  To: Olivier MATZ, Yerden Zhumabekov, Richardson, Bruce, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier MATZ
> Sent: Monday, September 8, 2014 7:17 PM
> To: Yerden Zhumabekov; Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
> 
> Hi Yerden,
> 
> On 09/08/2014 12:33 PM, Yerden Zhumabekov wrote:
> > 08.09.2014 16:17, Olivier MATZ пишет:
> >>> --- a/lib/librte_mbuf/rte_mbuf.h
> >>> +++ b/lib/librte_mbuf/rte_mbuf.h
> >>> @@ -146,7 +146,7 @@ struct rte_mbuf {
> >>>  	uint32_t reserved1;     /**< Unused field. Required for padding */
> >>>
> >>>  	/* remaining bytes are set on RX when pulling packet from descriptor
> */
> >>> -	uint16_t reserved2;     /**< Unused field. Required for padding */
> >>> +	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
> >>>  	uint16_t data_len;      /**< Amount of data in segment buffer. */
> >>>  	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
> >>>  	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
> >>>
> >> This patch adds a new fields that nobody uses. So why should we add it ?
> >
> > I would use it :)
> > It's useful to store the IP protocol number (UDP, TCP etc) and version
> > of IP (4, 6) and then relay packet to specific handler.

It is a common field which i40e PMD will use it to store the 'packet type ID'. i40e
hardware can recognize more than a hundred of packet types of received packets,
this is quite useful for upper layer stack or application. So this field is quite useful
and will be filled by PMD.
In ixgbe/igb, it has less than 10 packet types which are marked in offload flags. From
now on, it would be better to have new field here to put the hardware offloaded
packet type in and it could be used for future NICs.

> 
> I'm not saying this field is useless. But even if it's useful for some applications
> like yours, it does not mean that it should go in the generic mbuf structure.
> 
> Also, for a new field, we should define who is in charge of filling it.
> Is is the driver? Does it mean that all drivers have to be modified to fill it? Or is
> it just a placeholder for applications? In this case, shouldn't we use
> application-specific metadata? In the other direction (TX), we would also need
> to define if this field must be filled by the application before transmitting a mbuf
> to a driver.
Yes, PMD will fill it. I40e PMD will be the first one, ixgbe/igb can be kept as it is, or
modified to be consistent. It is used for RX side only, and for TX side, it can be
investigated to see if it can be used also. I think some new features in development
can think of that.
Anyway, it is a quite useful field for i40e and future generation of NICs.
> 
> Regards,
> Olivier

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
       [not found]           ` <540EB428.9060706@6wind.com>
@ 2014-09-09  8:45             ` Zhang, Helin
  2014-09-09  9:47             ` Richardson, Bruce
  1 sibling, 0 replies; 62+ messages in thread
From: Zhang, Helin @ 2014-09-09  8:45 UTC (permalink / raw)
  To: Olivier MATZ, Yerden Zhumabekov, Richardson, Bruce, dev



> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Tuesday, September 9, 2014 4:03 PM
> To: Zhang, Helin; Yerden Zhumabekov; Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
> 
> Hello,
> 
> On 09/09/2014 05:59 AM, Zhang, Helin wrote:
> > It is a common field which i40e PMD will use it to store the 'packet
> > type ID'. i40e hardware can recognize more than a hundred of packet
> > types of received packets, this is quite useful for upper layer stack
> > or application. So this field is quite useful and will be filled by PMD.
> > In ixgbe/igb, it has less than 10 packet types which are marked in
> > offload flags. From now on, it would be better to have new field here
> > to put the hardware offloaded packet type in and it could be used for future
> NICs.
> >
> >>
> >> I'm not saying this field is useless. But even if it's useful for
> >> some applications like yours, it does not mean that it should go in the
> generic mbuf structure.
> >>
> >> Also, for a new field, we should define who is in charge of filling it.
> >> Is is the driver? Does it mean that all drivers have to be modified
> >> to fill it? Or is it just a placeholder for applications? In this
> >> case, shouldn't we use application-specific metadata? In the other
> >> direction (TX), we would also need to define if this field must be
> >> filled by the application before transmitting a mbuf to a driver.
> > Yes, PMD will fill it. I40e PMD will be the first one, ixgbe/igb can
> > be kept as it is, or modified to be consistent. It is used for RX side
> > only, and for TX side, it can be investigated to see if it can be used
> > also. I think some new features in development can think of that.
> > Anyway, it is a quite useful field for i40e and future generation of NICs.
> 
> To me, having the support in a hardware for that feature is not a sufficient
> reason for adding this field. There are many hardware features that will never
> be integrated in dpdk.

At least this field is quite important for i40e.
e.g. packet type=43 means that hardware recognize it as a VXLAN packet. To avoid checking what type of packet by software, hardware can offload that, and fill the packet type ID in that field.
It cannot be put in ol_flags anymore, as it has more than 100 packet type can be recognized by hardware. Without it, vxlan feature cannot be implemented at all. 

> 
> This first version of the patch:
> 
> - just adds a field that is not used by any code, so it is useless.
>    At least testpmd or an application example should show how to
>    use it.

It will be used at least in vxlan feature which is in development. Without it, vxlan cannot be completed. So this is a very important field i40e and future NICs.

>
> - does not describe what enhancement is provided by adding the
>    field (performance? in this case, numbers + use case would help
>    to convince people).

I40e hardware can recognize received packets as different packet types, and there are about 256 packet types can be recognized by i40e hardware. It is not a enhancement, it is the key for at least i40e features.

> 
> - does not describe what can be the content of the field. Is it
>    a protocol number?
> 

The packet type is a offload feature of hardware, the value of it can mean one type of packet recognized by the i40e hardware. E.g.

| Packet type | Description            |
| 0         | Reserved              |
| 1         | MAC, PAY2            |
| 2         | MAC, TimeSync, PAY2    |
...
| 43        | MAC, IPV4, GRENAT, PAY3 |

> - does not explain if all drivers must fill this field. If yes,
>    the patch has to update all drivers. If not, something must be
>    done to mark the packet field as unknown by default.
> 
I40e needs it.
igb/ixgbe can be changed to support it, but not mandatory, as ol_flags can represent it.
Actually ixgbe and igb has packet type also, but the number of those types is less than 10, so it can be put in ol_flags. For i40e and future NICs, the number of that could be 256 or more, ol_flags does not have that many bits for it. The best idea is to fill the packet type ID directly into a field.

> Regards,
> Olivier

Regards,
Helin


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 04/13] mbuf: expand ol_flags field to 64-bits
  2014-09-08 10:25   ` Olivier MATZ
@ 2014-09-09  9:00     ` Richardson, Bruce
  0 siblings, 0 replies; 62+ messages in thread
From: Richardson, Bruce @ 2014-09-09  9:00 UTC (permalink / raw)
  To: Olivier MATZ, dev

> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Monday, September 08, 2014 11:26 AM
> To: Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 04/13] mbuf: expand ol_flags field to 64-bits
> 
> Hi Bruce,
> 
> On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> > The offload flags field (ol_flags) was 16-bits and had no further room
> > for expansion. This patch increases the field size to 64-bits, using up
> > the remaining reserved space in the single-cache-line mbuf.
> >
> > NOTE: none of the values for existing flags have been changed, i.e. no
> > new numbers have been explicitly reserved between existing flag
> > definitions.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> 
> The initial series I've proposed [1][2] had on more enhancement: the
> first patch [1] allowed to remove the definition of flag names in
> testpmd. Indeed, this is not really good because they must be kept
> synchronized with the flags in rte_mbuf. What do you think about this
> patch? Should it be integrated in your series? Or later? Or never? ;)

No, it is a good change - I've just keep it out of my series for simplicity as I'm largely trying to keep the scope as small as possible. I would love to see that go in as a separate patch maybe once the mbuf rework is finished. 

> 
> The second patch [2] changes the value of the flags. This is not needed
> now, but if we do it in the future, we should not forget to change
> app/test-pmd/cmdline.c accordingly. Maybe this could go in your patch
> directly as it does not hurt?

As above for now. Right now I'm just trying to get the structure worked out, and deal with any performance regressions that are found (such as what Pablo found last Friday :-( ). 

/Bruce

> 
> Olivier
> 
> 
> [1] http://dpdk.org/ml/archives/dev/2014-May/002545.html
> [2] http://dpdk.org/ml/archives/dev/2014-May/002546.html

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-08 12:05   ` Olivier MATZ
@ 2014-09-09  9:01     ` Richardson, Bruce
  2014-09-12 16:56       ` Dumitrescu, Cristian
  2014-09-10 15:09     ` Bruce Richardson
  1 sibling, 1 reply; 62+ messages in thread
From: Richardson, Bruce @ 2014-09-09  9:01 UTC (permalink / raw)
  To: Olivier MATZ, dev

> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Monday, September 08, 2014 1:06 PM
> To: Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the
> mbuf metadata
> 
> Hi Bruce,
> 
> On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> > Removed the explicit zero-sized metadata definition at the end of the
> > mbuf data structure. Updated the metadata macros to take account of this
> > change so that all existing code which uses those macros still works.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  lib/librte_mbuf/rte_mbuf.h | 22 ++++++++--------------
> >  1 file changed, 8 insertions(+), 14 deletions(-)
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 5260001..ca66d9a 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -166,31 +166,25 @@ struct rte_mbuf {
> >  	struct rte_mempool *pool; /**< Pool from which mbuf was allocated.
> */
> >  	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> >
> > -	union {
> > -		uint8_t metadata[0];
> > -		uint16_t metadata16[0];
> > -		uint32_t metadata32[0];
> > -		uint64_t metadata64[0];
> > -	} __rte_cache_aligned;
> >  } __rte_cache_aligned;
> >
> >  #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
> > -	(mbuf->metadata[offset])
> > +	(((uint8_t *)&(mbuf)[1])[offset])
> >  #define RTE_MBUF_METADATA_UINT16(mbuf, offset)             \
> > -	(mbuf->metadata16[offset/sizeof(uint16_t)])
> > +	(((uint16_t *)&(mbuf)[1])[offset/sizeof(uint16_t)])
> >  #define RTE_MBUF_METADATA_UINT32(mbuf, offset)             \
> > -	(mbuf->metadata32[offset/sizeof(uint32_t)])
> > +	(((uint32_t *)&(mbuf)[1])[offset/sizeof(uint32_t)])
> >  #define RTE_MBUF_METADATA_UINT64(mbuf, offset)             \
> > -	(mbuf->metadata64[offset/sizeof(uint64_t)])
> > +	(((uint64_t *)&(mbuf)[1])[offset/sizeof(uint64_t)])
> >
> >  #define RTE_MBUF_METADATA_UINT8_PTR(mbuf, offset)          \
> > -	(&mbuf->metadata[offset])
> > +	(&RTE_MBUF_METADATA_UINT8(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT16_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata16[offset/sizeof(uint16_t)])
> > +	(&RTE_MBUF_METADATA_UINT16(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT32_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata32[offset/sizeof(uint32_t)])
> > +	(&RTE_MBUF_METADATA_UINT32(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT64_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata64[offset/sizeof(uint64_t)])
> > +	(&RTE_MBUF_METADATA_UINT64(mbuf, offset))
> >
> >  /**
> >   * Given the buf_addr returns the pointer to corresponding mbuf.
> >
> 
> I think it goes in the good direction. So:
> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> 
> Just one question: why not removing RTE_MBUF_METADATA*() macros?
> I'd just provide one macro that gives a (void*) to the first byte
> after the mbuf structure.
> 
> The format of the metadata is up to the application, that usually
> casts (m + 1) into a private structure, making the macros not very
> useful. I suggest to move these macros outside rte_mbuf.h, in the
> application-specific or library-specific header, what do you think?
> 
> Regards,
> Olivier
> 
Yes, I'll look into that.

/Bruce

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
       [not found]           ` <540EB428.9060706@6wind.com>
  2014-09-09  8:45             ` Zhang, Helin
@ 2014-09-09  9:47             ` Richardson, Bruce
  1 sibling, 0 replies; 62+ messages in thread
From: Richardson, Bruce @ 2014-09-09  9:47 UTC (permalink / raw)
  To: Olivier MATZ, Zhang, Helin, Yerden Zhumabekov, dev



> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Tuesday, September 09, 2014 9:03 AM
> To: Zhang, Helin; Yerden Zhumabekov; Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
> 
> Hello,
> 
> On 09/09/2014 05:59 AM, Zhang, Helin wrote:
> > It is a common field which i40e PMD will use it to store the 'packet type ID'.
> i40e
> > hardware can recognize more than a hundred of packet types of received
> packets,
> > this is quite useful for upper layer stack or application. So this field is quite
> useful
> > and will be filled by PMD.
> > In ixgbe/igb, it has less than 10 packet types which are marked in offload flags.
> From
> > now on, it would be better to have new field here to put the hardware
> offloaded
> > packet type in and it could be used for future NICs.
> >
> >>
> >> I'm not saying this field is useless. But even if it's useful for some applications
> >> like yours, it does not mean that it should go in the generic mbuf structure.
> >>
> >> Also, for a new field, we should define who is in charge of filling it.
> >> Is is the driver? Does it mean that all drivers have to be modified to fill it? Or
> is
> >> it just a placeholder for applications? In this case, shouldn't we use
> >> application-specific metadata? In the other direction (TX), we would also
> need
> >> to define if this field must be filled by the application before transmitting a
> mbuf
> >> to a driver.
> > Yes, PMD will fill it. I40e PMD will be the first one, ixgbe/igb can be kept as it
> is, or
> > modified to be consistent. It is used for RX side only, and for TX side, it can be
> > investigated to see if it can be used also. I think some new features in
> development
> > can think of that.
> > Anyway, it is a quite useful field for i40e and future generation of NICs.
> 
> To me, having the support in a hardware for that feature is not a
> sufficient reason for adding this field. There are many hardware
> features that will never be integrated in dpdk.
> 
> This first version of the patch:
> 
> - just adds a field that is not used by any code, so it is useless.
>    At least testpmd or an application example should show how to
>    use it.
> 
> - does not describe what enhancement is provided by adding the
>    field (performance? in this case, numbers + use case would help
>    to convince people).
> 
> - does not describe what can be the content of the field. Is it
>    a protocol number?
> 
> - does not explain if all drivers must fill this field. If yes,
>    the patch has to update all drivers. If not, something must be
>    done to mark the packet field as unknown by default.
> 
> Regards,
> Olivier

Hi,

Points taken. Really, this patch doesn't belong in this set as I had planned things and better belongs in patch set 3 (coming soon, I hope) which should propose the new field additions. I simply put it here to avoid having to start renumbering and renaming reserved fields in the structure, but that is possibly the lesser of the two evils.

However, with regards to adding new fields in, I would like to have some agreement that I can add fields in without actually pushing in the patch to use them - so long as sufficient rational is provided for using the field and there is a soon pending change to actually use it. This patch did not meet the criteria for explanation, but if updated to do so, I would like to have this patch accepted on the basis of that explanation so as to enable those working on the drivers to make us of it. 

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field
  2014-09-08 11:17       ` Olivier MATZ
  2014-09-09  3:59         ` Zhang, Helin
@ 2014-09-09 15:05         ` Jim Thompson
  1 sibling, 0 replies; 62+ messages in thread
From: Jim Thompson @ 2014-09-09 15:05 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev


> On Sep 8, 2014, at 4:17 AM, Olivier MATZ <olivier.matz@6wind.com> wrote:
> 
> Hi Yerden,
> 
> On 09/08/2014 12:33 PM, Yerden Zhumabekov wrote:
>> 08.09.2014 16:17, Olivier MATZ пишет:
>>>> --- a/lib/librte_mbuf/rte_mbuf.h
>>>> +++ b/lib/librte_mbuf/rte_mbuf.h
>>>> @@ -146,7 +146,7 @@ struct rte_mbuf {
>>>> 	uint32_t reserved1;     /**< Unused field. Required for padding */
>>>> 
>>>> 	/* remaining bytes are set on RX when pulling packet from descriptor */
>>>> -	uint16_t reserved2;     /**< Unused field. Required for padding */
>>>> +	uint16_t packet_type;   /**< Type of packet, e.g. protocols used */
>>>> 	uint16_t data_len;      /**< Amount of data in segment buffer. */
>>>> 	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
>>>> 	uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
>>>> 
>>> This patch adds a new fields that nobody uses. So why should we add it ?
>> 
>> I would use it :)
>> It's useful to store the IP protocol number (UDP, TCP etc) and version
>> of IP (4, 6) and then relay packet to specific handler.
> 
> I'm not saying this field is useless. But even if it's useful
> for some applications like yours, it does not mean that it should go in
> the generic mbuf structure.
> 
> Also, for a new field, we should define who is in charge of filling it.
> Is is the driver? Does it mean that all drivers have to be modified to
> fill it? Or is it just a placeholder for applications? In this case,
> shouldn't we use application-specific metadata? In the other direction
> (TX), we would also need to define if this field must be filled by the
> application before transmitting a mbuf to a driver.


Funny, but these new fields (and extended mbuf) were prominent during the dpdk summit.

I think it’s going to be quite useful.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-08 12:05   ` Olivier MATZ
  2014-09-09  9:01     ` Richardson, Bruce
@ 2014-09-10 15:09     ` Bruce Richardson
  2014-09-10 15:31       ` Olivier MATZ
  1 sibling, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-10 15:09 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev

On Mon, Sep 08, 2014 at 02:05:41PM +0200, Olivier MATZ wrote:
> Hi Bruce,
> 
> On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> > Removed the explicit zero-sized metadata definition at the end of the
> > mbuf data structure. Updated the metadata macros to take account of this
> > change so that all existing code which uses those macros still works.
> > 
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  lib/librte_mbuf/rte_mbuf.h | 22 ++++++++--------------
> >  1 file changed, 8 insertions(+), 14 deletions(-)
> > 
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 5260001..ca66d9a 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -166,31 +166,25 @@ struct rte_mbuf {
> >  	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
> >  	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> >  
> > -	union {
> > -		uint8_t metadata[0];
> > -		uint16_t metadata16[0];
> > -		uint32_t metadata32[0];
> > -		uint64_t metadata64[0];
> > -	} __rte_cache_aligned;
> >  } __rte_cache_aligned;
> >  
> >  #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
> > -	(mbuf->metadata[offset])
> > +	(((uint8_t *)&(mbuf)[1])[offset])
> >  #define RTE_MBUF_METADATA_UINT16(mbuf, offset)             \
> > -	(mbuf->metadata16[offset/sizeof(uint16_t)])
> > +	(((uint16_t *)&(mbuf)[1])[offset/sizeof(uint16_t)])
> >  #define RTE_MBUF_METADATA_UINT32(mbuf, offset)             \
> > -	(mbuf->metadata32[offset/sizeof(uint32_t)])
> > +	(((uint32_t *)&(mbuf)[1])[offset/sizeof(uint32_t)])
> >  #define RTE_MBUF_METADATA_UINT64(mbuf, offset)             \
> > -	(mbuf->metadata64[offset/sizeof(uint64_t)])
> > +	(((uint64_t *)&(mbuf)[1])[offset/sizeof(uint64_t)])
> >  
> >  #define RTE_MBUF_METADATA_UINT8_PTR(mbuf, offset)          \
> > -	(&mbuf->metadata[offset])
> > +	(&RTE_MBUF_METADATA_UINT8(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT16_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata16[offset/sizeof(uint16_t)])
> > +	(&RTE_MBUF_METADATA_UINT16(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT32_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata32[offset/sizeof(uint32_t)])
> > +	(&RTE_MBUF_METADATA_UINT32(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT64_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata64[offset/sizeof(uint64_t)])
> > +	(&RTE_MBUF_METADATA_UINT64(mbuf, offset))
> >  
> >  /**
> >   * Given the buf_addr returns the pointer to corresponding mbuf.
> > 
> 
> I think it goes in the good direction. So:
> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> 
> Just one question: why not removing RTE_MBUF_METADATA*() macros?
> I'd just provide one macro that gives a (void*) to the first byte
> after the mbuf structure.
> 
> The format of the metadata is up to the application, that usually
> casts (m + 1) into a private structure, making the macros not very
> useful. I suggest to move these macros outside rte_mbuf.h, in the
> application-specific or library-specific header, what do you think?
> 

Things look to work if I just move the definitions en-masse into rte_port.h. Is that the sort of change you would be thinking of? I was wondering about replacing the typed macros here with a generic one to access just beyond the definition of the mbuf structure, but on further thought, I believe that using the buf_addr pointer in the mbuf data structure is probably enough for most applications. [An alternative to moving the definitions into rte_port.h is to move them into rte_table.h and having port_frag.c use the buf_addr pointer instead of a macro to get at the metadata. All other references to the macros apart from two in that port file, are in the tables or in apps that use the tables lib.]
What do you think?

/Bruce

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-10 15:09     ` Bruce Richardson
@ 2014-09-10 15:31       ` Olivier MATZ
  0 siblings, 0 replies; 62+ messages in thread
From: Olivier MATZ @ 2014-09-10 15:31 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

Hi Bruce,

On 09/10/2014 05:09 PM, Bruce Richardson wrote:
>> Just one question: why not removing RTE_MBUF_METADATA*() macros?
>> I'd just provide one macro that gives a (void*) to the first byte
>> after the mbuf structure.
>>
>> The format of the metadata is up to the application, that usually
>> casts (m + 1) into a private structure, making the macros not very
>> useful. I suggest to move these macros outside rte_mbuf.h, in the
>> application-specific or library-specific header, what do you think?
>>
>
> Things look to work if I just move the definitions en-masse into
> rte_port.h. Is that the sort of change you would be thinking of?
> I was wondering about replacing the typed macros here with a
> generic one to access just beyond the definition of the mbuf
> structure, but on further thought, I believe that using the
> buf_addr pointer in the mbuf data structure is probably enough
> for most applications. [An alternative to moving the definitions
> into rte_port.h is to move them into rte_table.h and having
> port_frag.c use the buf_addr pointer instead of a macro to get at
> the metadata. All other references to the macros apart from two
> in that port file, are in the tables or in apps that use the
> tables lib.]
> What do you think?

Yes, moving the macros in rte_port.h looks good to me, since
the libraries using rte_ports are the users of this specific
metadata format.

Thanks!
Olivier

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2
  2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                   ` (12 preceding siblings ...)
  2014-09-03 15:49 ` [dpdk-dev] [PATCH 13/13] ixgbe: Improve slow-path perf: vector scattered RX Bruce Richardson
@ 2014-09-11 13:15 ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 01/13] mbuf: replace data pointer by an offset Bruce Richardson
                     ` (13 more replies)
  13 siblings, 14 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

This patch set continues on from the changes in part 1, and depends
upon that patch set.

This patch set reorders the fields in the mbuf structure and splits
the structure across two cache lines, given lots of new space for new
fields to be added. This set uses some of that space by expanding the 
ol_flags field. A part 3 patchset is planned to introduce some other
new fields into the new mbuf structure.

With the splitting of the mbuf across multiple cache lines, performance
degradations are seen inside the drivers, both fast-path and slow path.
For the fast-path, this patchset reworks the way in which the pool pointer
is used to free packets post-TX, which removes the perf regression. For
the slow path, an alternative approach is taken - a new scattered packets
RX function is introduced into the vector PMD. Using this function,
throughput for the slow path RX-TX using testpmd is increased by up to 20%
over the original baseline.

Changes in V2:
* General minor updates follow comments on V1 set
* Updated a number of patches to include KNI mbuf changes where needed
* Deferred the patch to add the packet type field to a future patch set
* After removing meta-data element from mbuf structure also added in patch
  to move the macros to rte_port/rte_port.h

Bruce Richardson (11):
  mbuf: reorder fields by time of use
  mbuf: expand ol_flags field to 64-bits
  mbuf: introduce a flag to indicate a control mbuf
  mbuf: minor changes for readability
  mbuf: use macros only to access the mbuf metadata
  mbuf: add named points inside the mbuf structure
  ixgbe: rework vector pmd following mbuf changes
  mbuf: split mbuf across two cache lines.
  mbuf: move l2_len and l3_len to second cache line
  ixgbe: Fix perf regression due to moved pool ptr
  ixgbe: Improve slow-path perf: vector scattered RX

Olivier Matz (1):
  mbuf: replace data pointer by an offset

 app/test-pmd/config.c                              |   8 +-
 app/test-pmd/csumonly.c                            |  10 +-
 app/test-pmd/flowgen.c                             |   2 +-
 app/test-pmd/icmpecho.c                            |   2 +-
 app/test-pmd/ieee1588fwd.c                         |   4 +-
 app/test-pmd/macfwd-retry.c                        |   2 +-
 app/test-pmd/macfwd.c                              |   2 +-
 app/test-pmd/macswap.c                             |   2 +-
 app/test-pmd/rxonly.c                              |   4 +-
 app/test-pmd/testpmd.c                             |   2 +-
 app/test-pmd/testpmd.h                             |   4 +-
 app/test-pmd/txonly.c                              |   9 +-
 app/test/packet_burst_generator.c                  |   7 +-
 app/test/test_mbuf.c                               |   8 +-
 app/test/test_table_acl.c                          |   7 +-
 app/test/test_table_pipeline.c                     |   8 +-
 examples/exception_path/main.c                     |   3 +-
 examples/vhost/main.c                              |  37 +-
 examples/vhost_xen/main.c                          |  14 +-
 .../linuxapp/eal/include/exec-env/rte_kni_common.h |  20 +-
 lib/librte_eal/linuxapp/kni/kni_net.c              |  18 +-
 lib/librte_ip_frag/rte_ipv4_fragmentation.c        |   6 +-
 lib/librte_ip_frag/rte_ipv6_fragmentation.c        |   6 +-
 lib/librte_mbuf/rte_mbuf.c                         |  10 +-
 lib/librte_mbuf/rte_mbuf.h                         | 139 ++++----
 lib/librte_pmd_bond/rte_eth_bond_pmd.c             |   4 +-
 lib/librte_pmd_e1000/em_rxtx.c                     |  46 ++-
 lib/librte_pmd_e1000/igb_rxtx.c                    |  75 ++--
 lib/librte_pmd_i40e/i40e_rxtx.c                    |  48 +--
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c                  | 100 +++---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h                  |  24 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c              | 389 +++++++++++++--------
 lib/librte_pmd_pcap/rte_eth_pcap.c                 |   9 +-
 lib/librte_pmd_virtio/virtio_rxtx.c                |  10 +-
 lib/librte_pmd_virtio/virtqueue.h                  |   3 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c              |   5 +-
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c           |   2 +-
 37 files changed, 586 insertions(+), 463 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 01/13] mbuf: replace data pointer by an offset
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use Bruce Richardson
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

From: Olivier Matz <olivier.matz@6wind.com>

Original patch:
 The mbuf structure already contains a pointer to the beginning of the
 buffer (m->buf_addr). It is not needed to use 8 bytes again to store
 another pointer to the beginning of the data.

 Using a 16 bits unsigned integer is enough as we know that a mbuf is
 never longer than 64KB. We gain 6 bytes in the structure thanks to
 this modification.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>

This version:
* Updated original patch to apply to latest on mainline.
* Disabled vector PMD in config as it relies heavily on the mbuf layout
  This will be re-enabled in a subsequent commit once vPMD has been
  reworked to take account of mbuf changes.

Updated in V2:
* Updated the KNI buffer structure to match that of new mbuf
* fixed checkpatch errors.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 app/test-pmd/csumonly.c                            |  2 +-
 app/test-pmd/flowgen.c                             |  2 +-
 app/test-pmd/icmpecho.c                            |  2 +-
 app/test-pmd/ieee1588fwd.c                         |  4 +--
 app/test-pmd/macfwd-retry.c                        |  2 +-
 app/test-pmd/macfwd.c                              |  2 +-
 app/test-pmd/macswap.c                             |  2 +-
 app/test-pmd/rxonly.c                              |  2 +-
 app/test-pmd/testpmd.c                             |  2 +-
 app/test-pmd/txonly.c                              |  7 ++--
 app/test/packet_burst_generator.c                  |  7 ++--
 app/test/test_mbuf.c                               |  6 ++--
 app/test/test_table_acl.c                          |  7 ++--
 app/test/test_table_pipeline.c                     |  8 ++---
 config/common_linuxapp                             |  2 +-
 examples/exception_path/main.c                     |  3 +-
 examples/vhost/main.c                              | 37 ++++++++++---------
 examples/vhost_xen/main.c                          | 14 ++++----
 .../linuxapp/eal/include/exec-env/rte_kni_common.h | 10 +++---
 lib/librte_eal/linuxapp/kni/kni_net.c              | 18 ++++++----
 lib/librte_ip_frag/rte_ipv4_fragmentation.c        |  6 ++--
 lib/librte_ip_frag/rte_ipv6_fragmentation.c        |  6 ++--
 lib/librte_mbuf/rte_mbuf.c                         |  6 ++--
 lib/librte_mbuf/rte_mbuf.h                         | 42 ++++++++++------------
 lib/librte_pmd_bond/rte_eth_bond_pmd.c             |  4 +--
 lib/librte_pmd_e1000/em_rxtx.c                     | 12 +++----
 lib/librte_pmd_e1000/igb_rxtx.c                    | 13 ++++---
 lib/librte_pmd_i40e/i40e_rxtx.c                    | 17 +++++----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c                  | 13 +++----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h                  |  3 +-
 lib/librte_pmd_pcap/rte_eth_pcap.c                 |  9 +++--
 lib/librte_pmd_virtio/virtio_rxtx.c                | 10 +++---
 lib/librte_pmd_virtio/virtqueue.h                  |  3 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c              |  5 ++-
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c           |  2 +-
 35 files changed, 146 insertions(+), 144 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 28b66f5..2ce4c42 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -263,7 +263,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		pkt_ol_flags = mb->ol_flags;
 		ol_flags = (uint16_t) (pkt_ol_flags & (~PKT_TX_L4_MASK));
 
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
 		if (eth_type == ETHER_TYPE_VLAN) {
 			/* Only allow single VLAN label here */
diff --git a/app/test-pmd/flowgen.c b/app/test-pmd/flowgen.c
index b091b6d..8b4ed9a 100644
--- a/app/test-pmd/flowgen.c
+++ b/app/test-pmd/flowgen.c
@@ -175,7 +175,7 @@ pkt_burst_flow_gen(struct fwd_stream *fs)
 		pkt->next = NULL;
 
 		/* Initialize Ethernet header. */
-		eth_hdr = (struct ether_hdr *)pkt->data;
+		eth_hdr = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
 		ether_addr_copy(&cfg_ether_dst, &eth_hdr->d_addr);
 		ether_addr_copy(&cfg_ether_src, &eth_hdr->s_addr);
 		eth_hdr->ether_type = rte_cpu_to_be_16(ETHER_TYPE_IPv4);
diff --git a/app/test-pmd/icmpecho.c b/app/test-pmd/icmpecho.c
index 4a277b8..7fd4b6d 100644
--- a/app/test-pmd/icmpecho.c
+++ b/app/test-pmd/icmpecho.c
@@ -330,7 +330,7 @@ reply_to_icmp_echo_rqsts(struct fwd_stream *fs)
 	nb_replies = 0;
 	for (i = 0; i < nb_rx; i++) {
 		pkt = pkts_burst[i];
-		eth_h = (struct ether_hdr *) pkt->data;
+		eth_h = rte_pktmbuf_mtod(pkt, struct ether_hdr *);
 		eth_type = RTE_BE_TO_CPU_16(eth_h->ether_type);
 		l2_len = sizeof(struct ether_hdr);
 		if (verbose_level > 0) {
diff --git a/app/test-pmd/ieee1588fwd.c b/app/test-pmd/ieee1588fwd.c
index ab5e06e..976aa28 100644
--- a/app/test-pmd/ieee1588fwd.c
+++ b/app/test-pmd/ieee1588fwd.c
@@ -546,7 +546,7 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	 * Check that the received packet is a PTP packet that was detected
 	 * by the hardware.
 	 */
-	eth_hdr = (struct ether_hdr *)mb->data;
+	eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 	eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
 	if (! (mb->ol_flags & PKT_RX_IEEE1588_PTP)) {
 		if (eth_type == ETHER_TYPE_1588) {
@@ -574,7 +574,7 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	 * Check that the received PTP packet is a PTP V2 packet of type
 	 * PTP_SYNC_MESSAGE.
 	 */
-	ptp_hdr = (struct ptpv2_msg *) ((char *) mb->data +
+	ptp_hdr = (struct ptpv2_msg *) (rte_pktmbuf_mtod(mb, char *) +
 					sizeof(struct ether_hdr));
 	if (ptp_hdr->version != 0x02) {
 		printf("Port %u Received PTP V2 Ethernet frame with wrong PTP"
diff --git a/app/test-pmd/macfwd-retry.c b/app/test-pmd/macfwd-retry.c
index 5122983..83da26f 100644
--- a/app/test-pmd/macfwd-retry.c
+++ b/app/test-pmd/macfwd-retry.c
@@ -119,7 +119,7 @@ pkt_burst_mac_retry_forward(struct fwd_stream *fs)
 	fs->rx_packets += nb_rx;
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 4b905cd..38bae23 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -110,7 +110,7 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 	txp = &ports[fs->tx_port];
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
diff --git a/app/test-pmd/macswap.c b/app/test-pmd/macswap.c
index c5b3a0c..1786095 100644
--- a/app/test-pmd/macswap.c
+++ b/app/test-pmd/macswap.c
@@ -110,7 +110,7 @@ pkt_burst_mac_swap(struct fwd_stream *fs)
 	txp = &ports[fs->tx_port];
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 
 		/* Swap dest and src mac addresses. */
 		ether_addr_copy(&eth_hdr->d_addr, &addr);
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 5bc74da..7ba36a1 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -149,7 +149,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 			rte_pktmbuf_free(mb);
 			continue;
 		}
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		ol_flags = mb->ol_flags;
 		print_ether_addr("  src=", &eth_hdr->s_addr);
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index b426dfc..9f6cdc4 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -404,7 +404,7 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 			mb_ctor_arg->seg_buf_offset);
 	mb->buf_len      = mb_ctor_arg->seg_buf_size;
 	mb->ol_flags     = 0;
-	mb->data         = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+	mb->data_off     = RTE_PKTMBUF_HEADROOM;
 	mb->nb_segs      = 1;
 	mb->l2_l3_len       = 0;
 	mb->vlan_tci     = 0;
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 8135264..2b3f0b9 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -111,13 +111,13 @@ copy_buf_to_pkt_segs(void* buf, unsigned len, struct rte_mbuf *pkt,
 		seg = seg->next;
 	}
 	copy_len = seg->data_len - offset;
-	seg_buf = ((char *) seg->data + offset);
+	seg_buf = (rte_pktmbuf_mtod(seg, char *) + offset);
 	while (len > copy_len) {
 		rte_memcpy(seg_buf, buf, (size_t) copy_len);
 		len -= copy_len;
 		buf = ((char*) buf + copy_len);
 		seg = seg->next;
-		seg_buf = seg->data;
+		seg_buf = rte_pktmbuf_mtod(seg, char *);
 	}
 	rte_memcpy(seg_buf, buf, (size_t) len);
 }
@@ -126,7 +126,8 @@ static inline void
 copy_buf_to_pkt(void* buf, unsigned len, struct rte_mbuf *pkt, unsigned offset)
 {
 	if (offset + len <= pkt->data_len) {
-		rte_memcpy(((char *) pkt->data + offset), buf, (size_t) len);
+		rte_memcpy((rte_pktmbuf_mtod(pkt, char *) + offset),
+			buf, (size_t) len);
 		return;
 	}
 	copy_buf_to_pkt_segs(buf, len, pkt, offset);
diff --git a/app/test/packet_burst_generator.c b/app/test/packet_burst_generator.c
index db7f023..9e747a4 100644
--- a/app/test/packet_burst_generator.c
+++ b/app/test/packet_burst_generator.c
@@ -59,13 +59,13 @@ copy_buf_to_pkt_segs(void *buf, unsigned len, struct rte_mbuf *pkt,
 		seg = seg->next;
 	}
 	copy_len = seg->data_len - offset;
-	seg_buf = ((char *) seg->data + offset);
+	seg_buf = rte_pktmbuf_mtod(seg, char *) + offset;
 	while (len > copy_len) {
 		rte_memcpy(seg_buf, buf, (size_t) copy_len);
 		len -= copy_len;
 		buf = ((char *) buf + copy_len);
 		seg = seg->next;
-		seg_buf = seg->data;
+		seg_buf = rte_pktmbuf_mtod(seg, void *);
 	}
 	rte_memcpy(seg_buf, buf, (size_t) len);
 }
@@ -74,7 +74,8 @@ static inline void
 copy_buf_to_pkt(void *buf, unsigned len, struct rte_mbuf *pkt, unsigned offset)
 {
 	if (offset + len <= pkt->data_len) {
-		rte_memcpy(((char *) pkt->data + offset), buf, (size_t) len);
+		rte_memcpy(rte_pktmbuf_mtod(pkt, char *) + offset,
+				buf, (size_t) len);
 		return;
 	}
 	copy_buf_to_pkt_segs(buf, len, pkt, offset);
diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index b81e622..1b25481 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -432,7 +432,7 @@ test_pktmbuf_pool_ptr(void)
 			printf("rte_pktmbuf_alloc() failed (%u)\n", i);
 			ret = -1;
 		}
-		m[i]->data = RTE_PTR_ADD(m[i]->data, 64);
+		m[i]->data_off += 64;
 	}
 
 	/* free them */
@@ -451,8 +451,8 @@ test_pktmbuf_pool_ptr(void)
 			printf("rte_pktmbuf_alloc() failed (%u)\n", i);
 			ret = -1;
 		}
-		if (m[i]->data != RTE_PTR_ADD(m[i]->buf_addr, RTE_PKTMBUF_HEADROOM)) {
-			printf ("data pointer not set properly\n");
+		if (m[i]->data_off != RTE_PKTMBUF_HEADROOM) {
+			printf("invalid data_off\n");
 			ret = -1;
 		}
 	}
diff --git a/app/test/test_table_acl.c b/app/test/test_table_acl.c
index 4db680a..0f2b57e 100644
--- a/app/test/test_table_acl.c
+++ b/app/test/test_table_acl.c
@@ -513,7 +513,7 @@ test_pipeline_single_filter(int expected_count)
 			struct rte_mbuf *mbuf;
 
 			mbuf = rte_pktmbuf_alloc(pool);
-			memset(mbuf->data, 0x00,
+			memset(rte_pktmbuf_mtod(mbuf, char *), 0x00,
 				sizeof(struct ipv4_5tuple));
 
 			five_tuple.proto = j;
@@ -522,7 +522,7 @@ test_pipeline_single_filter(int expected_count)
 			five_tuple.port_src = rte_bswap16(100 + j);
 			five_tuple.port_dst = rte_bswap16(200 + j);
 
-			memcpy(mbuf->data, &five_tuple,
+			memcpy(rte_pktmbuf_mtod(mbuf, char *), &five_tuple,
 				sizeof(struct ipv4_5tuple));
 			RTE_LOG(INFO, PIPELINE, "%s: Enqueue onto ring %d\n",
 				__func__, i);
@@ -549,7 +549,8 @@ test_pipeline_single_filter(int expected_count)
 			printf("Got %d object(s) from ring %d!\n", ret, i);
 			for (j = 0; j < ret; j++) {
 				mbuf = (struct rte_mbuf *)objs[j];
-				rte_hexdump(stdout, "mbuf", mbuf->data, 64);
+				rte_hexdump(stdout, "mbuf",
+					rte_pktmbuf_mtod(mbuf, char *), 64);
 				rte_pktmbuf_free(mbuf);
 			}
 			tx_count += ret;
diff --git a/app/test/test_table_pipeline.c b/app/test/test_table_pipeline.c
index 15a038b..a0a9e04 100644
--- a/app/test/test_table_pipeline.c
+++ b/app/test/test_table_pipeline.c
@@ -39,11 +39,6 @@
 #include "test_table.h"
 #include "test_table_pipeline.h"
 
-#define RTE_CBUF_UINT8_PTR(cbuf, offset)			\
-	(&cbuf->data[offset])
-#define RTE_CBUF_UINT32_PTR(cbuf, offset)			\
-	(&cbuf->data32[offset/sizeof(uint32_t)])
-
 #if 0
 
 static rte_pipeline_port_out_action_handler port_action_0x00
@@ -498,7 +493,8 @@ test_pipeline_single_filter(int test_type, int expected_count)
 			printf("Got %d object(s) from ring %d!\n", ret, i);
 			for (j = 0; j < ret; j++) {
 				mbuf = (struct rte_mbuf *)objs[j];
-				rte_hexdump(stdout, "Object:", mbuf->data,
+				rte_hexdump(stdout, "Object:",
+					rte_pktmbuf_mtod(mbuf, char *),
 					mbuf->data_len);
 				rte_pktmbuf_free(mbuf);
 			}
diff --git a/config/common_linuxapp b/config/common_linuxapp
index 5bee910..b140af7 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -191,7 +191,7 @@ CONFIG_RTE_LIBRTE_IXGBE_DEBUG_DRIVER=n
 CONFIG_RTE_LIBRTE_IXGBE_PF_DISABLE_STRIP_CRC=n
 CONFIG_RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC=y
 CONFIG_RTE_LIBRTE_IXGBE_ALLOW_UNSUPPORTED_SFP=n
-CONFIG_RTE_IXGBE_INC_VECTOR=y
+CONFIG_RTE_IXGBE_INC_VECTOR=n
 CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=y
 
 #
diff --git a/examples/exception_path/main.c b/examples/exception_path/main.c
index 5045ef8..f286bf2 100644
--- a/examples/exception_path/main.c
+++ b/examples/exception_path/main.c
@@ -302,7 +302,8 @@ main_loop(__attribute__((unused)) void *arg)
 			if (m == NULL)
 				continue;
 
-			ret = read(tap_fd, m->data, MAX_PACKET_SZ);
+			ret = read(tap_fd, rte_pktmbuf_mtod(m, void *),
+				MAX_PACKET_SZ);
 			lcore_stats[lcore_id].rx++;
 			if (unlikely(ret < 0)) {
 				FATAL_ERROR("Reading from %s interface failed",
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 4e1c103..85ee8b8 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -1033,7 +1033,7 @@ virtio_dev_rx(struct virtio_net *dev, struct rte_mbuf **pkts, uint32_t count)
 
 		/* Copy mbuf data to buffer */
 		rte_memcpy((void *)(uintptr_t)buff_addr,
-			(const void *)buff->data,
+			rte_pktmbuf_mtod(buff, const void *),
 			rte_pktmbuf_data_len(buff));
 		PRINT_PACKET(dev, (uintptr_t)buff_addr,
 			rte_pktmbuf_data_len(buff), 0);
@@ -1438,7 +1438,7 @@ link_vmdq(struct virtio_net *dev, struct rte_mbuf *m)
 	int i, ret;
 
 	/* Learn MAC address of guest device from packet */
-	pkt_hdr = (struct ether_hdr *)m->data;
+	pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	dev_ll = ll_root_used;
 
@@ -1525,7 +1525,7 @@ virtio_tx_local(struct virtio_net *dev, struct rte_mbuf *m)
 	struct ether_hdr *pkt_hdr;
 	uint64_t ret = 0;
 
-	pkt_hdr = (struct ether_hdr *)m->data;
+	pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	/*get the used devices list*/
 	dev_ll = ll_root_used;
@@ -1593,7 +1593,7 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 	unsigned len, ret, offset = 0;
 	const uint16_t lcore_id = rte_lcore_id();
 	struct virtio_net_data_ll *dev_ll = ll_root_used;
-	struct ether_hdr *pkt_hdr = (struct ether_hdr *)m->data;
+	struct ether_hdr *pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	/*check if destination is local VM*/
 	if ((vm2vm_mode == VM2VM_SOFTWARE) && (virtio_tx_local(dev, m) == 0))
@@ -1652,18 +1652,21 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 	mbuf->nb_segs = m->nb_segs;
 
 	/* Copy ethernet header to mbuf. */
-	rte_memcpy((void*)mbuf->data, (const void*)m->data, ETH_HLEN);
+	rte_memcpy(rte_pktmbuf_mtod(mbuf, void *),
+		rte_pktmbuf_mtod(m, const void *),
+		ETH_HLEN);
 
 
 	/* Setup vlan header. Bytes need to be re-ordered for network with htons()*/
-	vlan_hdr = (struct vlan_ethhdr *) mbuf->data;
+	vlan_hdr = rte_pktmbuf_mtod(mbuf, struct vlan_ethhdr *);
 	vlan_hdr->h_vlan_encapsulated_proto = vlan_hdr->h_vlan_proto;
 	vlan_hdr->h_vlan_proto = htons(ETH_P_8021Q);
 	vlan_hdr->h_vlan_TCI = htons(vlan_tag);
 
 	/* Copy the remaining packet contents to the mbuf. */
-	rte_memcpy((void*) ((uint8_t*)mbuf->data + VLAN_ETH_HLEN),
-		(const void*) ((uint8_t*)m->data + ETH_HLEN), (m->data_len - ETH_HLEN));
+	rte_memcpy((void *)(rte_pktmbuf_mtod(mbuf, uint8_t *) + VLAN_ETH_HLEN),
+		(const void *)(rte_pktmbuf_mtod(m, uint8_t *) + ETH_HLEN),
+		(m->data_len - ETH_HLEN));
 
 	/* Copy the remaining segments for the whole packet. */
 	prev = mbuf;
@@ -1778,7 +1781,7 @@ virtio_dev_tx(struct virtio_net* dev, struct rte_mempool *mbuf_pool)
 		/* Setup dummy mbuf. This is copied to a real mbuf if transmitted out the physical port. */
 		m.data_len = desc->len;
 		m.pkt_len = desc->len;
-		m.data = (void*)(uintptr_t)buff_addr;
+		m.data_off = 0;
 
 		PRINT_PACKET(dev, (uintptr_t)buff_addr, desc->len, 0);
 
@@ -2333,7 +2336,7 @@ attach_rxmbuf_zcp(struct virtio_net *dev)
 	}
 
 	mbuf->buf_addr = (void *)(uintptr_t)(buff_addr - RTE_PKTMBUF_HEADROOM);
-	mbuf->data = (void *)(uintptr_t)(buff_addr);
+	mbuf->data_off = RTE_PKTMBUF_HEADROOM;
 	mbuf->buf_physaddr = phys_addr - RTE_PKTMBUF_HEADROOM;
 	mbuf->data_len = desc->len;
 	MBUF_HEADROOM_UINT32(mbuf) = (uint32_t)desc_idx;
@@ -2370,7 +2373,7 @@ static inline void pktmbuf_detach_zcp(struct rte_mbuf *m)
 
 	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->data = (char *) m->buf_addr + buf_ofs;
+	m->data_off = buf_ofs;
 
 	m->data_len = 0;
 }
@@ -2604,7 +2607,7 @@ virtio_tx_route_zcp(struct virtio_net *dev, struct rte_mbuf *m,
 	unsigned len, ret, offset = 0;
 	struct vpool *vpool;
 	struct virtio_net_data_ll *dev_ll = ll_root_used;
-	struct ether_hdr *pkt_hdr = (struct ether_hdr *)m->data;
+	struct ether_hdr *pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 	uint16_t vlan_tag = (uint16_t)vlan_tags[(uint16_t)dev->device_fh];
 
 	/*Add packet to the port tx queue*/
@@ -2681,11 +2684,11 @@ virtio_tx_route_zcp(struct virtio_net *dev, struct rte_mbuf *m,
 	mbuf->pkt_len = mbuf->data_len;
 	if (unlikely(need_copy)) {
 		/* Copy the packet contents to the mbuf. */
-		rte_memcpy((void *)((uint8_t *)mbuf->data),
-			(const void *) ((uint8_t *)m->data),
+		rte_memcpy(rte_pktmbuf_mtod(mbuf, void *),
+			rte_pktmbuf_mtod(m, void *),
 			m->data_len);
 	} else {
-		mbuf->data = m->data;
+		mbuf->data_off = m->data_off;
 		mbuf->buf_physaddr = m->buf_physaddr;
 		mbuf->buf_addr = m->buf_addr;
 	}
@@ -2819,8 +2822,8 @@ virtio_dev_tx_zcp(struct virtio_net *dev)
 		m.data_len = desc->len;
 		m.nb_segs = 1;
 		m.next = NULL;
-		m.data = (void *)(uintptr_t)buff_addr;
-		m.buf_addr = m.data;
+		m.data_off = 0;
+		m.buf_addr = (void *)(uintptr_t)buff_addr;
 		m.buf_physaddr = phys_addr;
 
 		/*
diff --git a/examples/vhost_xen/main.c b/examples/vhost_xen/main.c
index 8162cd8..56ffec8 100644
--- a/examples/vhost_xen/main.c
+++ b/examples/vhost_xen/main.c
@@ -808,7 +808,7 @@ virtio_tx_local(struct virtio_net *dev, struct rte_mbuf *m)
 	struct ether_hdr *pkt_hdr;
 	uint64_t ret = 0;
 
-	pkt_hdr = (struct ether_hdr *)m->data;
+	pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	/*get the used devices list*/
 	dev_ll = ll_root_used;
@@ -883,18 +883,20 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 	mbuf->pkt_len = mbuf->data_len;
 
 	/* Copy ethernet header to mbuf. */
-	rte_memcpy((void*)mbuf->data, (const void*)m->data, ETH_HLEN);
+	rte_memcpy(rte_pktmbuf_mtod(mbuf, void*),
+			rte_pktmbuf_mtod(m, const void*), ETH_HLEN);
 
 
 	/* Setup vlan header. Bytes need to be re-ordered for network with htons()*/
-	vlan_hdr = (struct vlan_ethhdr *) mbuf->data;
+	vlan_hdr = rte_pktmbuf_mtod(mbuf, struct vlan_ethhdr *);
 	vlan_hdr->h_vlan_encapsulated_proto = vlan_hdr->h_vlan_proto;
 	vlan_hdr->h_vlan_proto = htons(ETH_P_8021Q);
 	vlan_hdr->h_vlan_TCI = htons(vlan_tag);
 
 	/* Copy the remaining packet contents to the mbuf. */
-	rte_memcpy((void*) ((uint8_t*)mbuf->data + VLAN_ETH_HLEN),
-		(const void*) ((uint8_t*)m->data + ETH_HLEN), (m->data_len - ETH_HLEN));
+	rte_memcpy((void *)(rte_pktmbuf_mtod(mbuf, uint8_t *) + VLAN_ETH_HLEN),
+		(const void *)(rte_pktmbuf_mtod(m, uint8_t *) + ETH_HLEN),
+		(m->data_len - ETH_HLEN));
 	tx_q->m_table[len] = mbuf;
 	len++;
 	if (enable_stats) {
@@ -981,7 +983,7 @@ virtio_dev_tx(struct virtio_net* dev, struct rte_mempool *mbuf_pool)
 
 		/* Setup dummy mbuf. This is copied to a real mbuf if transmitted out the physical port. */
 		m.data_len = desc->len;
-		m.data = (void*)(uintptr_t)buff_addr;
+		m.data_off = 0;
 		m.nb_segs = 1;
 
 		virtio_tx_route(dev, &m, mbuf_pool, 0);
diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
index d0b82da..07908ac 100644
--- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
@@ -110,13 +110,13 @@ struct rte_kni_fifo {
 struct rte_kni_mbuf {
 	void *pool;
 	void *buf_addr;
-	char pad0[14];
-	uint16_t ol_flags;      /**< Offload features. */
+	char pad0[16];
 	void *next;
-	void *data;             /**< Start address of data in segment buffer. */
+	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
-	char pad2[2];
-	uint16_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
+	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
+	char pad2[4];
+	uint16_t ol_flags;      /**< Offload features. */
 } __attribute__((__aligned__(64)));
 
 /*
diff --git a/lib/librte_eal/linuxapp/kni/kni_net.c b/lib/librte_eal/linuxapp/kni/kni_net.c
index fa5fd3e..dd95db5 100644
--- a/lib/librte_eal/linuxapp/kni/kni_net.c
+++ b/lib/librte_eal/linuxapp/kni/kni_net.c
@@ -162,7 +162,8 @@ kni_net_rx_normal(struct kni_dev *kni)
 	for (i = 0; i < num; i++) {
 		kva = (void *)va[i] - kni->mbuf_va + kni->mbuf_kva;
 		len = kva->data_len;
-		data_kva = kva->data - kni->mbuf_va + kni->mbuf_kva;
+		data_kva = kva->buf_addr + kva->data_off - kni->mbuf_va
+				+ kni->mbuf_kva;
 
 		skb = dev_alloc_skb(len + 2);
 		if (!skb) {
@@ -246,13 +247,14 @@ kni_net_rx_lo_fifo(struct kni_dev *kni)
 		for (i = 0; i < num; i++) {
 			kva = (void *)va[i] - kni->mbuf_va + kni->mbuf_kva;
 			len = kva->pkt_len;
-			data_kva = kva->data - kni->mbuf_va +
-						kni->mbuf_kva;
+			data_kva = kva->buf_addr + kva->data_off -
+					kni->mbuf_va + kni->mbuf_kva;
 
 			alloc_kva = (void *)alloc_va[i] - kni->mbuf_va +
 							kni->mbuf_kva;
-			alloc_data_kva = alloc_kva->data - kni->mbuf_va +
-								kni->mbuf_kva;
+			alloc_data_kva = alloc_kva->buf_addr +
+					alloc_kva->data_off - kni->mbuf_va +
+							kni->mbuf_kva;
 			memcpy(alloc_data_kva, data_kva, len);
 			alloc_kva->pkt_len = len;
 			alloc_kva->data_len = len;
@@ -321,7 +323,8 @@ kni_net_rx_lo_fifo_skb(struct kni_dev *kni)
 	for (i = 0; i < num; i++) {
 		kva = (void *)va[i] - kni->mbuf_va + kni->mbuf_kva;
 		len = kva->data_len;
-		data_kva = kva->data - kni->mbuf_va + kni->mbuf_kva;
+		data_kva = kva->buf_addr + kva->data_off - kni->mbuf_va +
+				kni->mbuf_kva;
 
 		skb = dev_alloc_skb(len + 2);
 		if (skb == NULL)
@@ -423,7 +426,8 @@ kni_net_tx(struct sk_buff *skb, struct net_device *dev)
 		void *data_kva;
 
 		pkt_kva = (void *)pkt_va - kni->mbuf_va + kni->mbuf_kva;
-		data_kva = pkt_kva->data - kni->mbuf_va + kni->mbuf_kva;
+		data_kva = pkt_kva->buf_addr + pkt_kva->data_off - kni->mbuf_va
+				+ kni->mbuf_kva;
 
 		len = skb->len;
 		memcpy(data_kva, skb->data, len);
diff --git a/lib/librte_ip_frag/rte_ipv4_fragmentation.c b/lib/librte_ip_frag/rte_ipv4_fragmentation.c
index 6b9f07d..a4ed923 100644
--- a/lib/librte_ip_frag/rte_ipv4_fragmentation.c
+++ b/lib/librte_ip_frag/rte_ipv4_fragmentation.c
@@ -109,7 +109,7 @@ rte_ipv4_fragment_packet(struct rte_mbuf *pkt_in,
 	/* Fragment size should be a multiply of 8. */
 	IP_FRAG_ASSERT((frag_size & IPV4_HDR_FO_MASK) == 0);
 
-	in_hdr = (struct ipv4_hdr *) pkt_in->data;
+	in_hdr = rte_pktmbuf_mtod(pkt_in, struct ipv4_hdr *);
 	flag_offset = rte_cpu_to_be_16(in_hdr->fragment_offset);
 
 	/* If Don't Fragment flag is set */
@@ -165,7 +165,7 @@ rte_ipv4_fragment_packet(struct rte_mbuf *pkt_in,
 			if (len > (in_seg->data_len - in_seg_data_pos)) {
 				len = in_seg->data_len - in_seg_data_pos;
 			}
-			out_seg->data = (char*) in_seg->data + (uint16_t)in_seg_data_pos;
+			out_seg->data_off = in_seg->data_off + in_seg_data_pos;
 			out_seg->data_len = (uint16_t)len;
 			out_pkt->pkt_len = (uint16_t)(len +
 			    out_pkt->pkt_len);
@@ -188,7 +188,7 @@ rte_ipv4_fragment_packet(struct rte_mbuf *pkt_in,
 
 		/* Build the IP header */
 
-		out_hdr = (struct ipv4_hdr*) out_pkt->data;
+		out_hdr = rte_pktmbuf_mtod(out_pkt, struct ipv4_hdr *);
 
 		__fill_ipv4hdr_frag(out_hdr, in_hdr,
 		    (uint16_t)out_pkt->pkt_len,
diff --git a/lib/librte_ip_frag/rte_ipv6_fragmentation.c b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
index e007662..4ffcc7c 100644
--- a/lib/librte_ip_frag/rte_ipv6_fragmentation.c
+++ b/lib/librte_ip_frag/rte_ipv6_fragmentation.c
@@ -125,7 +125,7 @@ rte_ipv6_fragment_packet(struct rte_mbuf *pkt_in,
 	    (uint16_t)(pkt_in->pkt_len - sizeof (struct ipv6_hdr))))
 		return (-EINVAL);
 
-	in_hdr = (struct ipv6_hdr *) pkt_in->data;
+	in_hdr = rte_pktmbuf_mtod(pkt_in, struct ipv6_hdr *);
 
 	in_seg = pkt_in;
 	in_seg_data_pos = sizeof(struct ipv6_hdr);
@@ -171,7 +171,7 @@ rte_ipv6_fragment_packet(struct rte_mbuf *pkt_in,
 			if (len > (in_seg->data_len - in_seg_data_pos)) {
 				len = in_seg->data_len - in_seg_data_pos;
 			}
-			out_seg->data = (char *) in_seg->data + (uint16_t) in_seg_data_pos;
+			out_seg->data_off = in_seg->data_off + in_seg_data_pos;
 			out_seg->data_len = (uint16_t)len;
 			out_pkt->pkt_len = (uint16_t)(len +
 			    out_pkt->pkt_len);
@@ -196,7 +196,7 @@ rte_ipv6_fragment_packet(struct rte_mbuf *pkt_in,
 
 		/* Build the IP header */
 
-		out_hdr = (struct ipv6_hdr *) out_pkt->data;
+		out_hdr = rte_pktmbuf_mtod(out_pkt, struct ipv6_hdr *);
 
 		__fill_ipv6hdr_frag(out_hdr, in_hdr,
 		    (uint16_t) out_pkt->pkt_len - sizeof(struct ipv6_hdr),
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index c1b2176..26e36eb 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -117,7 +117,7 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->buf_len = (uint16_t)buf_len;
 
 	/* keep some headroom between start of buffer and data */
-	m->data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
+	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
 
 	/* init some constant fields */
 	m->pool = mp;
@@ -183,12 +183,12 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 		__rte_mbuf_sanity_check(m, 0);
 
 		fprintf(f, "  segment at 0x%p, data=0x%p, data_len=%u\n",
-		       m, m->data, (unsigned)m->data_len);
+			m, rte_pktmbuf_mtod(m, void *), (unsigned)m->data_len);
 		len = dump_len;
 		if (len > m->data_len)
 			len = m->data_len;
 		if (len != 0)
-			rte_hexdump(f, NULL, m->data, len);
+			rte_hexdump(f, NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
 		m = m->next;
 		nb_segs --;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index a523d82..71080e5 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -119,6 +119,13 @@ struct rte_mbuf {
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
 	uint16_t buf_len;         /**< Length of segment buffer. */
+
+	/* valid for any segment */
+	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
+	uint16_t data_off;
+	uint16_t data_len;        /**< Amount of data in segment buffer. */
+	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
+
 #ifdef RTE_MBUF_REFCNT
 	/**
 	 * 16-bit Reference counter.
@@ -138,15 +145,9 @@ struct rte_mbuf {
 	uint16_t reserved;            /**< Unused field. Required for padding */
 	uint16_t ol_flags;            /**< Offload features. */
 
-	/* valid for any segment */
-	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
-	void* data;             /**< Start address of data in segment buffer. */
-	uint16_t data_len;      /**< Amount of data in segment buffer. */
-
 	/* these fields are valid for first segment only */
 	uint8_t nb_segs;        /**< Number of segments. */
 	uint8_t port;           /**< Input port. */
-	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 
 	/* offload features, valid for first segment only */
 	union {
@@ -171,7 +172,7 @@ struct rte_mbuf {
 		uint16_t metadata16[0];
 		uint32_t metadata32[0];
 		uint64_t metadata64[0];
-	};
+	} __rte_cache_aligned;
 } __rte_cache_aligned;
 
 #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
@@ -457,7 +458,7 @@ void rte_ctrlmbuf_init(struct rte_mempool *mp, void *opaque_arg,
  * @param m
  *   The control mbuf.
  */
-#define rte_ctrlmbuf_data(m) ((m)->data)
+#define rte_ctrlmbuf_data(m) ((char *)((m)->buf_addr) + (m)->data_off)
 
 /**
  * A macro that returns the length of the carried data.
@@ -522,8 +523,6 @@ void rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg);
  */
 static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 {
-	uint32_t buf_ofs;
-
 	m->next = NULL;
 	m->pkt_len = 0;
 	m->l2_l3_len = 0;
@@ -532,9 +531,8 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 	m->port = 0xff;
 
 	m->ol_flags = 0;
-	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
+	m->data_off = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->data = (char*) m->buf_addr + buf_ofs;
 
 	m->data_len = 0;
 	__rte_mbuf_sanity_check(m, 1);
@@ -591,7 +589,7 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->buf_len = md->buf_len;
 
 	mi->next = md->next;
-	mi->data = md->data;
+	mi->data_off = md->data_off;
 	mi->data_len = md->data_len;
 	mi->port = md->port;
 	mi->vlan_tci = md->vlan_tci;
@@ -621,16 +619,14 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
 	const struct rte_mempool *mp = m->pool;
 	void *buf = RTE_MBUF_TO_BADDR(m);
-	uint32_t buf_ofs;
 	uint32_t buf_len = mp->elt_size - sizeof(*m);
 	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof (*m);
 
 	m->buf_addr = buf;
 	m->buf_len = (uint16_t)buf_len;
 
-	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
+	m->data_off = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->data = (char*) m->buf_addr + buf_ofs;
 
 	m->data_len = 0;
 }
@@ -794,7 +790,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	return (uint16_t) ((char*) m->data - (char*) m->buf_addr);
+	return m->data_off;
 }
 
 /**
@@ -842,7 +838,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
  * @param t
  *   The type to cast the result into.
  */
-#define rte_pktmbuf_mtod(m, t) ((t)((m)->data))
+#define rte_pktmbuf_mtod(m, t) ((t)((char *)(m)->buf_addr + (m)->data_off))
 
 /**
  * A macro that returns the length of the packet.
@@ -887,11 +883,11 @@ static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
 	if (unlikely(len > rte_pktmbuf_headroom(m)))
 		return NULL;
 
-	m->data = (char*) m->data - len;
+	m->data_off -= len;
 	m->data_len = (uint16_t)(m->data_len + len);
 	m->pkt_len  = (m->pkt_len + len);
 
-	return (char*) m->data;
+	return (char *)m->buf_addr + m->data_off;
 }
 
 /**
@@ -920,7 +916,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
 	if (unlikely(len > rte_pktmbuf_tailroom(m_last)))
 		return NULL;
 
-	tail = (char*) m_last->data + m_last->data_len;
+	tail = (char *)m_last->buf_addr + m_last->data_off + m_last->data_len;
 	m_last->data_len = (uint16_t)(m_last->data_len + len);
 	m->pkt_len  = (m->pkt_len + len);
 	return (char*) tail;
@@ -948,9 +944,9 @@ static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)
 		return NULL;
 
 	m->data_len = (uint16_t)(m->data_len - len);
-	m->data = ((char*) m->data + len);
+	m->data_off += len;
 	m->pkt_len  = (m->pkt_len - len);
-	return (char*) m->data;
+	return (char *)m->buf_addr + m->data_off;
 }
 
 /**
diff --git a/lib/librte_pmd_bond/rte_eth_bond_pmd.c b/lib/librte_pmd_bond/rte_eth_bond_pmd.c
index 5979ce5..506a448 100644
--- a/lib/librte_pmd_bond/rte_eth_bond_pmd.c
+++ b/lib/librte_pmd_bond/rte_eth_bond_pmd.c
@@ -198,14 +198,14 @@ xmit_slave_hash(const struct rte_mbuf *buf, uint8_t slave_count, uint8_t policy)
 
 	switch (policy) {
 	case BALANCE_XMIT_POLICY_LAYER2:
-		eth_hdr = (struct ether_hdr *)buf->data;
+		eth_hdr = rte_pktmbuf_mtod(buf, struct ether_hdr *);
 
 		hash = ether_hash(eth_hdr);
 		hash ^= hash >> 8;
 		return hash % slave_count;
 
 	case BALANCE_XMIT_POLICY_LAYER23:
-		eth_hdr = (struct ether_hdr *)buf->data;
+		eth_hdr = rte_pktmbuf_mtod(buf, struct ether_hdr *);
 
 		if (buf->ol_flags & PKT_RX_VLAN_PKT)
 			eth_offset = sizeof(struct ether_hdr) + sizeof(struct vlan_hdr);
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index ba7e3a9..ed3a6fc 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -90,8 +90,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 }
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb)             \
-	(uint64_t) ((mb)->buf_physaddr +       \
-	(uint64_t) ((char *)((mb)->data) - (char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -793,8 +792,8 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.length) -
 				rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -963,7 +962,7 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1035,7 +1034,8 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index d4a803e..41727e7 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -95,9 +95,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 }
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr +		   \
-			(uint64_t) ((char *)((mb)->data) -     \
-				(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -774,8 +772,8 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
 				      rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -950,7 +948,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1031,7 +1029,8 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->ol_flags = pkt_flags;
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c
index e41e8d0..25a5f6f 100644
--- a/lib/librte_pmd_i40e/i40e_rxtx.c
+++ b/lib/librte_pmd_i40e/i40e_rxtx.c
@@ -78,9 +78,7 @@
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	((uint64_t)((mb)->buf_physaddr + \
-	(uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr)))
+	((uint64_t)((mb)->buf_physaddr + (mb)->data_off))
 
 static const struct rte_memzone *
 i40e_ring_dma_zone_reserve(struct rte_eth_dev *dev,
@@ -685,7 +683,7 @@ i40e_rx_alloc_bufs(struct i40e_rx_queue *rxq)
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
 		mb->next = NULL;
-		mb->data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mb->data_off = RTE_PKTMBUF_HEADROOM;
 		mb->nb_segs = 1;
 		mb->port = rxq->port_id;
 		dma_addr = rte_cpu_to_le_64(\
@@ -842,8 +840,8 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		rx_packet_len = ((qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
 				I40E_RXD_QW1_LENGTH_PBUF_SHIFT) - rxq->crc_len;
 
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_prefetch0(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_prefetch0(RTE_PTR_ADD(rxm->buf_addr, RTE_PKTMBUF_HEADROOM));
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = rx_packet_len;
@@ -946,7 +944,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 		rx_packet_len = (qword1 & I40E_RXD_QW1_LENGTH_PBUF_MASK) >>
 					I40E_RXD_QW1_LENGTH_PBUF_SHIFT;
 		rxm->data_len = rx_packet_len;
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/**
 		 * If this is the first buffer of the received packet, set the
@@ -1015,7 +1013,8 @@ i40e_recv_scattered_pkts(void *rx_queue,
 				rte_le_to_cpu_32(rxd.wb.qword0.hi_dword.rss);
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_prefetch0(first_seg->data);
+		rte_prefetch0(RTE_PTR_ADD(first_seg->buf_addr,
+			first_seg->data_off));
 		rx_pkts[nb_rx++] = first_seg;
 		first_seg = NULL;
 	}
@@ -2131,7 +2130,7 @@ i40e_alloc_rx_queue_mbufs(struct i40e_rx_queue *rxq)
 
 		rte_mbuf_refcnt_set(mbuf, 1);
 		mbuf->next = NULL;
-		mbuf->data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mbuf->data_off = RTE_PKTMBUF_HEADROOM;
 		mbuf->nb_segs = 1;
 		mbuf->port = rxq->port_id;
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 575a014..c661335 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -997,7 +997,7 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
 		mb->next = NULL;
-		mb->data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mb->data_off = RTE_PKTMBUF_HEADROOM;
 		mb->nb_segs = 1;
 		mb->port = rxq->port_id;
 
@@ -1248,8 +1248,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
 				      rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -1431,7 +1431,7 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1522,7 +1522,8 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		}
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
@@ -3212,7 +3213,7 @@ ixgbe_alloc_rx_queue_mbufs(struct igb_rx_queue *rxq)
 
 		rte_mbuf_refcnt_set(mbuf, 1);
 		mbuf->next = NULL;
-		mbuf->data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mbuf->data_off = RTE_PKTMBUF_HEADROOM;
 		mbuf->nb_segs = 1;
 		mbuf->port = rxq->port_id;
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 41042ac..fd0023e 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -47,8 +47,7 @@
 #endif
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
diff --git a/lib/librte_pmd_pcap/rte_eth_pcap.c b/lib/librte_pmd_pcap/rte_eth_pcap.c
index 121de65..2ff1e77 100644
--- a/lib/librte_pmd_pcap/rte_eth_pcap.c
+++ b/lib/librte_pmd_pcap/rte_eth_pcap.c
@@ -151,7 +151,8 @@ eth_pcap_rx(void *queue,
 
 		if (header.len <= buf_size) {
 			/* pcap packet will fit in the mbuf, go ahead and copy */
-			rte_memcpy(mbuf->data, packet, header.len);
+			rte_memcpy(rte_pktmbuf_mtod(mbuf, void *), packet,
+					header.len);
 			mbuf->data_len = (uint16_t)header.len;
 			mbuf->pkt_len = mbuf->data_len;
 			bufs[num_rx] = mbuf;
@@ -202,7 +203,8 @@ eth_pcap_tx_dumper(void *queue,
 		calculate_timestamp(&header.ts);
 		header.len = mbuf->data_len;
 		header.caplen = header.len;
-		pcap_dump((u_char*) dumper_q->dumper, &header, mbuf->data);
+		pcap_dump((u_char *)dumper_q->dumper, &header,
+				rte_pktmbuf_mtod(mbuf, void*));
 		rte_pktmbuf_free(mbuf);
 		num_tx++;
 	}
@@ -237,7 +239,8 @@ eth_pcap_tx(void *queue,
 
 	for (i = 0; i < nb_pkts; i++) {
 		mbuf = bufs[i];
-		ret = pcap_sendpacket(tx_queue->pcap, (u_char*) mbuf->data,
+		ret = pcap_sendpacket(tx_queue->pcap,
+				rte_pktmbuf_mtod(mbuf, u_char *),
 				mbuf->data_len);
 		if (unlikely(ret != 0))
 			break;
diff --git a/lib/librte_pmd_virtio/virtio_rxtx.c b/lib/librte_pmd_virtio/virtio_rxtx.c
index 132ee45..29c9cea 100644
--- a/lib/librte_pmd_virtio/virtio_rxtx.c
+++ b/lib/librte_pmd_virtio/virtio_rxtx.c
@@ -118,7 +118,7 @@ virtqueue_dequeue_burst_rx(struct virtqueue *vq, struct rte_mbuf **rx_pkts,
 		}
 
 		rte_prefetch0(cookie);
-		rte_packet_prefetch(cookie->data);
+		rte_packet_prefetch(rte_pktmbuf_mtod(cookie, void *));
 		rx_pkts[i]  = cookie;
 		vq->vq_used_cons_idx++;
 		vq_ring_free_chain(vq, desc_idx);
@@ -480,7 +480,7 @@ virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		}
 
 		rxm->port = rxvq->port_id;
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
@@ -584,7 +584,7 @@ virtio_recv_mergeable_pkts(void *rx_queue,
 		if (seg_num == 0)
 			seg_num = 1;
 
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 		rxm->nb_segs = seg_num;
 		rxm->next = NULL;
 		rxm->pkt_len = (uint32_t)(len[0] - hdr_size);
@@ -622,9 +622,7 @@ virtio_recv_mergeable_pkts(void *rx_queue,
 			while (extra_idx < rcv_cnt) {
 				rxm = rcv_pkts[extra_idx];
 
-				rxm->data =
-					(char *)rxm->buf_addr +
-					RTE_PKTMBUF_HEADROOM - hdr_size;
+				rxm->data_off = RTE_PKTMBUF_HEADROOM - hdr_size;
 				rxm->next = NULL;
 				rxm->pkt_len = (uint32_t)(len[extra_idx]);
 				rxm->data_len = (uint16_t)(len[extra_idx]);
diff --git a/lib/librte_pmd_virtio/virtqueue.h b/lib/librte_pmd_virtio/virtqueue.h
index d777feb..fdee054 100644
--- a/lib/librte_pmd_virtio/virtqueue.h
+++ b/lib/librte_pmd_virtio/virtqueue.h
@@ -59,8 +59,7 @@
 #define VIRTQUEUE_MAX_NAME_SZ 32
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define VTNET_SQ_RQ_QUEUE_IDX 0
 #define VTNET_SQ_TQ_QUEUE_IDX 1
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index e74b6fd..263f9ce 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -79,8 +79,7 @@
 
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -565,7 +564,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 			rxm->data_len = (uint16_t)rcd->len;
 			rxm->port = rxq->port_id;
 			rxm->vlan_tci = 0;
-			rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+			rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 			rx_pkts[nb_rx++] = rxm;
 
diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
index 22215ed..891cb58 100644
--- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
+++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
@@ -110,7 +110,7 @@ eth_xenvirt_rx(void *q, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 		rxm = rx_pkts[i];
 		PMD_RX_LOG(DEBUG, "packet len:%d\n", len[i]);
 		rxm->next = NULL;
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 		rxm->data_len = (uint16_t)(len[i] - sizeof(struct virtio_net_hdr));
 		rxm->nb_segs = 1;
 		rxm->port = pi->port_id;
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 01/13] mbuf: replace data pointer by an offset Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-15  7:11     ` Liu, Jijiang
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 03/13] mbuf: expand ol_flags field to 64-bits Bruce Richardson
                     ` (11 subsequent siblings)
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

*  Reorder the fields in the mbuf so that we have fields that are used
together side-by-side in the structure. This means that we have a
contiguous block of 8-bytes in the mbuf which are used to reset an mbuf
of descriptor rearm, and a block of 16-bytes of data (excluding flags)
which are set on RX from the received packet descriptor.
* Use dummy fields as appropriate to ensure alignment or to reserve gaps
for later field additions.
* Place most items which are not used by fast-path RX separately at the end
of the structure so they can later be moved to a separate cache line.
[The l2/l3 length fields are not moved at this stage as doing so will
cause overflow to the next cache line].

Updated in V2:
* Updated the KNI buffer structure to match that of new mbuf
* Cleaned up commit message typos.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 .../linuxapp/eal/include/exec-env/rte_kni_common.h | 12 ++++++-----
 lib/librte_mbuf/rte_mbuf.h                         | 25 ++++++++++++----------
 2 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
index 07908ac..f2b502c 100644
--- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
@@ -108,15 +108,17 @@ struct rte_kni_fifo {
  * Padding is necessary to assure the offsets of these fields
  */
 struct rte_kni_mbuf {
-	void *pool;
 	void *buf_addr;
-	char pad0[16];
-	void *next;
+	char pad0[10];
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
+	char pad1[4];
+	uint16_t ol_flags;      /**< Offload features. */
+	char pad2[8];
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
-	char pad2[4];
-	uint16_t ol_flags;      /**< Offload features. */
+	char pad3[8];
+	void *pool;
+	void *next;
 } __attribute__((__aligned__(64)));
 
 /*
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 71080e5..413a5a1 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -115,16 +115,12 @@ extern "C" {
  * The generic rte_mbuf, containing a packet mbuf.
  */
 struct rte_mbuf {
-	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
-	uint16_t buf_len;         /**< Length of segment buffer. */
 
-	/* valid for any segment */
-	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
+	/* next 8 bytes are initialised on RX descriptor rearm */
+	uint16_t buf_len;         /**< Length of segment buffer. */
 	uint16_t data_off;
-	uint16_t data_len;        /**< Amount of data in segment buffer. */
-	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
 
 #ifdef RTE_MBUF_REFCNT
 	/**
@@ -142,14 +138,17 @@ struct rte_mbuf {
 #else
 	uint16_t refcnt_reserved;     /**< Do not use this field */
 #endif
-	uint16_t reserved;            /**< Unused field. Required for padding */
-	uint16_t ol_flags;            /**< Offload features. */
-
-	/* these fields are valid for first segment only */
 	uint8_t nb_segs;        /**< Number of segments. */
 	uint8_t port;           /**< Input port. */
 
-	/* offload features, valid for first segment only */
+	uint16_t ol_flags;      /**< Offload features. */
+	uint16_t reserved0;     /**< Unused field. Required for padding */
+	uint32_t reserved1;     /**< Unused field. Required for padding */
+
+	/* remaining bytes are set on RX when pulling packet from descriptor */
+	uint16_t reserved2;     /**< Unused field. Required for padding */
+	uint16_t data_len;      /**< Amount of data in segment buffer. */
+	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
 	union {
 		uint16_t l2_l3_len; /**< combined l2/l3 lengths as single var */
 		struct {
@@ -167,6 +166,10 @@ struct rte_mbuf {
 		uint32_t sched;     /**< Hierarchical scheduler */
 	} hash;                 /**< hash information */
 
+	/* fields only used in slow path or on TX */
+	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
+	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
+
 	union {
 		uint8_t metadata[0];
 		uint16_t metadata16[0];
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 03/13] mbuf: expand ol_flags field to 64-bits
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 01/13] mbuf: replace data pointer by an offset Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 04/13] mbuf: introduce a flag to indicate a control mbuf Bruce Richardson
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

The offload flags field (ol_flags) was 16-bits and had no further room
for expansion. This patch increases the field size to 64-bits, using up
the remaining reserved space in the single-cache-line mbuf.

NOTE: none of the values for existing flags have been changed, i.e. no
new numbers have been explicitly reserved between existing flag
definitions.

Updates in V2:
* Modified ol_flags field in kni mbuf structure

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

Updated kni data structure
---
 app/test-pmd/config.c                              |  8 +--
 app/test-pmd/csumonly.c                            |  8 +--
 app/test-pmd/rxonly.c                              |  2 +-
 app/test-pmd/testpmd.h                             |  4 +-
 app/test-pmd/txonly.c                              |  2 +-
 .../linuxapp/eal/include/exec-env/rte_kni_common.h |  4 +-
 lib/librte_mbuf/rte_mbuf.c                         |  2 +-
 lib/librte_mbuf/rte_mbuf.h                         |  4 +-
 lib/librte_pmd_e1000/em_rxtx.c                     | 34 ++++++------
 lib/librte_pmd_e1000/igb_rxtx.c                    | 62 ++++++++++------------
 lib/librte_pmd_i40e/i40e_rxtx.c                    | 31 +++++------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c                  | 61 ++++++++++-----------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h                  |  2 +-
 13 files changed, 106 insertions(+), 118 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 606e34a..adfa9a8 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1714,14 +1714,14 @@ set_qmap(portid_t port_id, uint8_t is_rx, uint16_t queue_id, uint8_t map_value)
 }
 
 void
-tx_cksum_set(portid_t port_id, uint8_t cksum_mask)
+tx_cksum_set(portid_t port_id, uint64_t ol_flags)
 {
-	uint16_t tx_ol_flags;
+	uint64_t tx_ol_flags;
 	if (port_id_is_invalid(port_id))
 		return;
 	/* Clear last 4 bits and then set L3/4 checksum mask again */
-	tx_ol_flags = (uint16_t) (ports[port_id].tx_ol_flags & 0xFFF0);
-	ports[port_id].tx_ol_flags = (uint16_t) ((cksum_mask & 0xf) | tx_ol_flags);
+	tx_ol_flags = ports[port_id].tx_ol_flags & (~0x0Full);
+	ports[port_id].tx_ol_flags = ((ol_flags & 0xf) | tx_ol_flags);
 }
 
 void
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 2ce4c42..fcc4876 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -217,9 +217,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t i;
-	uint16_t ol_flags;
-	uint16_t pkt_ol_flags;
-	uint16_t tx_ol_flags;
+	uint64_t ol_flags;
+	uint64_t pkt_ol_flags;
+	uint64_t tx_ol_flags;
 	uint16_t l4_proto;
 	uint16_t eth_type;
 	uint8_t  l2_len;
@@ -261,7 +261,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		mb = pkts_burst[i];
 		l2_len  = sizeof(struct ether_hdr);
 		pkt_ol_flags = mb->ol_flags;
-		ol_flags = (uint16_t) (pkt_ol_flags & (~PKT_TX_L4_MASK));
+		ol_flags = (pkt_ol_flags & (~PKT_TX_L4_MASK));
 
 		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 7ba36a1..98c788b 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -109,7 +109,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 	struct rte_mbuf  *mb;
 	struct ether_hdr *eth_hdr;
 	uint16_t eth_type;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint16_t nb_rx;
 	uint16_t i;
 #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 09923a8..142091d 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -141,7 +141,7 @@ struct rte_port {
 	struct fwd_stream       *rx_stream; /**< Port RX stream, if unique */
 	struct fwd_stream       *tx_stream; /**< Port TX stream, if unique */
 	unsigned int            socket_id;  /**< For NUMA support */
-	uint16_t                tx_ol_flags;/**< Offload Flags of TX packets. */
+	uint64_t                tx_ol_flags;/**< Offload Flags of TX packets. */
 	uint16_t                tx_vlan_id; /**< Tag Id. in TX VLAN packets. */
 	void                    *fwd_ctx;   /**< Forwarding mode context */
 	uint64_t                rx_bad_ip_csum; /**< rx pkts with bad ip checksum  */
@@ -494,7 +494,7 @@ void tx_vlan_pvid_set(portid_t port_id, uint16_t vlan_id, int on);
 
 void set_qmap(portid_t port_id, uint8_t is_rx, uint16_t queue_id, uint8_t map_value);
 
-void tx_cksum_set(portid_t port_id, uint8_t cksum_mask);
+void tx_cksum_set(portid_t port_id, uint64_t ol_flags);
 
 void set_verbose_level(uint16_t vb_level);
 void set_tx_pkt_segments(unsigned *seg_lengths, unsigned nb_segs);
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 2b3f0b9..3d08005 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -203,7 +203,7 @@ pkt_burst_transmit(struct fwd_stream *fs)
 	uint16_t nb_tx;
 	uint16_t nb_pkt;
 	uint16_t vlan_tci;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint8_t  i;
 #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
 	uint64_t start_tsc;
diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
index f2b502c..ab022bd 100644
--- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
@@ -112,8 +112,8 @@ struct rte_kni_mbuf {
 	char pad0[10];
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[4];
-	uint16_t ol_flags;      /**< Offload features. */
-	char pad2[8];
+	uint64_t ol_flags;      /**< Offload features. */
+	char pad2[2];
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	char pad3[8];
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 26e36eb..9dfcac3 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -174,7 +174,7 @@ rte_pktmbuf_dump(FILE *f, const struct rte_mbuf *m, unsigned dump_len)
 
 	fprintf(f, "dump mbuf at 0x%p, phys=%"PRIx64", buf_len=%u\n",
 	       m, (uint64_t)m->buf_physaddr, (unsigned)m->buf_len);
-	fprintf(f, "  pkt_len=%"PRIu32", ol_flags=%"PRIx16", nb_segs=%u, "
+	fprintf(f, "  pkt_len=%"PRIu32", ol_flags=%"PRIx64", nb_segs=%u, "
 	       "in_port=%u\n", m->pkt_len, m->ol_flags,
 	       (unsigned)m->nb_segs, (unsigned)m->port);
 	nb_segs = m->nb_segs;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 413a5a1..0e20e31 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -141,9 +141,7 @@ struct rte_mbuf {
 	uint8_t nb_segs;        /**< Number of segments. */
 	uint8_t port;           /**< Input port. */
 
-	uint16_t ol_flags;      /**< Offload features. */
-	uint16_t reserved0;     /**< Unused field. Required for padding */
-	uint32_t reserved1;     /**< Unused field. Required for padding */
+	uint64_t ol_flags;      /**< Offload features. */
 
 	/* remaining bytes are set on RX when pulling packet from descriptor */
 	uint16_t reserved2;     /**< Unused field. Required for padding */
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index ed3a6fc..b8423b4 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -168,7 +168,7 @@ union em_vlan_macip {
  * Structure to check if new context need be built
  */
 struct em_ctx_info {
-	uint16_t flags;              /**< ol_flags related to context build. */
+	uint64_t flags;              /**< ol_flags related to context build. */
 	uint32_t cmp_mask;           /**< compare mask */
 	union em_vlan_macip hdrlen;  /**< L2 and L3 header lenghts */
 };
@@ -238,7 +238,7 @@ struct em_tx_queue {
 static inline void
 em_set_xmit_ctx(struct em_tx_queue* txq,
 		volatile struct e1000_context_desc *ctx_txd,
-		uint16_t flags,
+		uint64_t flags,
 		union em_vlan_macip hdrlen)
 {
 	uint32_t cmp_mask, cmd_len;
@@ -304,7 +304,7 @@ em_set_xmit_ctx(struct em_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_ctx_update(struct em_tx_queue *txq, uint16_t flags,
+what_ctx_update(struct em_tx_queue *txq, uint64_t flags,
 		union em_vlan_macip hdrlen)
 {
 	/* If match with the current context */
@@ -377,7 +377,7 @@ em_xmit_cleanup(struct em_tx_queue *txq)
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_upper(uint16_t ol_flags)
+tx_desc_cksum_flags_to_upper(uint64_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, E1000_TXD_POPTS_TXSM << 8};
 	static const uint32_t l3_olinfo[2] = {0, E1000_TXD_POPTS_IXSM << 8};
@@ -403,12 +403,12 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t popts_spec;
 	uint32_t cmd_type_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
 	uint16_t nb_used;
-	uint16_t tx_ol_req;
+	uint64_t tx_ol_req;
 	uint32_t ctx;
 	uint32_t new_ctx;
 	union em_vlan_macip hdrlen;
@@ -438,8 +438,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		ol_flags = tx_pkt->ol_flags;
 
 		/* If hardware offload required */
-		tx_ol_req = (uint16_t)(ol_flags & (PKT_TX_IP_CKSUM |
-							PKT_TX_L4_MASK));
+		tx_ol_req = (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_L4_MASK));
 		if (tx_ol_req) {
 			hdrlen.f.vlan_tci = tx_pkt->vlan_tci;
 			hdrlen.f.l2_len = tx_pkt->l2_len;
@@ -642,22 +641,21 @@ end_of_tx:
  *
  **********************************************************************/
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	/* Check if VLAN present */
-	pkt_flags = (uint16_t)((rx_status & E1000_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = ((rx_status & E1000_RXD_STAT_VP) ?  PKT_RX_VLAN_PKT : 0);
 
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_error_to_pkt_flags(uint32_t rx_error)
 {
-	uint16_t pkt_flags = 0;
+	uint64_t pkt_flags = 0;
 
 	if (rx_error & E1000_RXD_ERR_IPE)
 		pkt_flags |= PKT_RX_IP_CKSUM_BAD;
@@ -801,8 +799,8 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->port = rxq->port_id;
 
 		rxm->ol_flags = rx_desc_status_to_pkt_flags(status);
-		rxm->ol_flags = (uint16_t)(rxm->ol_flags |
-				rx_desc_error_to_pkt_flags(rxd.errors));
+		rxm->ol_flags = rxm->ol_flags |
+				rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
@@ -1027,8 +1025,8 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->port = rxq->port_id;
 
 		first_seg->ol_flags = rx_desc_status_to_pkt_flags(status);
-		first_seg->ol_flags = (uint16_t)(first_seg->ol_flags |
-					rx_desc_error_to_pkt_flags(rxd.errors));
+		first_seg->ol_flags = first_seg->ol_flags |
+					rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.special);
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index 41727e7..56d1dfc 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -175,7 +175,7 @@ union igb_vlan_macip {
  * Strucutre to check if new context need be built
  */
 struct igb_advctx_info {
-	uint16_t flags;           /**< ol_flags related to context build. */
+	uint64_t flags;           /**< ol_flags related to context build. */
 	uint32_t cmp_mask;        /**< compare mask for vlan_macip_lens */
 	union igb_vlan_macip vlan_macip_lens; /**< vlan, mac & ip length. */
 };
@@ -243,7 +243,7 @@ struct igb_tx_queue {
 static inline void
 igbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct e1000_adv_tx_context_desc *ctx_txd,
-		uint16_t ol_flags, uint32_t vlan_macip_lens)
+		uint64_t ol_flags, uint32_t vlan_macip_lens)
 {
 	uint32_t type_tucmd_mlhl;
 	uint32_t mss_l4len_idx;
@@ -308,7 +308,7 @@ igbe_set_xmit_ctx(struct igb_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
+what_advctx_update(struct igb_tx_queue *txq, uint64_t flags,
 		uint32_t vlan_macip_lens)
 {
 	/* If match with the current context */
@@ -331,7 +331,7 @@ what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
+tx_desc_cksum_flags_to_olinfo(uint64_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, E1000_ADVTXD_POPTS_TXSM};
 	static const uint32_t l3_olinfo[2] = {0, E1000_ADVTXD_POPTS_IXSM};
@@ -343,7 +343,7 @@ tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
 }
 
 static inline uint32_t
-tx_desc_vlan_flags_to_cmdtype(uint16_t ol_flags)
+tx_desc_vlan_flags_to_cmdtype(uint64_t ol_flags)
 {
 	static uint32_t vlan_cmd[2] = {0, E1000_ADVTXD_DCMD_VLE};
 	return vlan_cmd[(ol_flags & PKT_TX_VLAN_PKT) != 0];
@@ -366,12 +366,12 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t cmd_type_len;
 	uint32_t pkt_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint16_t tx_end;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
-	uint16_t tx_ol_req;
+	uint64_t tx_ol_req;
 	uint32_t new_ctx = 0;
 	uint32_t ctx = 0;
 
@@ -400,7 +400,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		ol_flags = tx_pkt->ol_flags;
 		vlan_macip_lens.f.vlan_tci = tx_pkt->vlan_tci;
 		vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
-		tx_ol_req = (uint16_t)(ol_flags & PKT_TX_OFFLOAD_MASK);
+		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
 
 		/* If a Context Descriptor need be built . */
 		if (tx_ol_req) {
@@ -587,12 +587,12 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
  *  RX functions
  *
  **********************************************************************/
-static inline uint16_t
+static inline uint64_t
 rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 {
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
-	static uint16_t ip_pkt_types_map[16] = {
+	static uint64_t ip_pkt_types_map[16] = {
 		0, PKT_RX_IPV4_HDR, PKT_RX_IPV4_HDR_EXT, PKT_RX_IPV4_HDR_EXT,
 		PKT_RX_IPV6_HDR, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
@@ -605,34 +605,32 @@ rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 		0, 0, 0, 0,
 	};
 
-	pkt_flags = (uint16_t)((hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ?
+	pkt_flags = (hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ?
 				ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F];
 #else
-	pkt_flags = (uint16_t)((hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ? 0 :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = (hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ? 0 :
+				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F];
 #endif
-	return (uint16_t)(pkt_flags | (((hl_tp_rs & 0x0F) == 0) ?
-						0 : PKT_RX_RSS_HASH));
+	return pkt_flags | (((hl_tp_rs & 0x0F) == 0) ?  0 : PKT_RX_RSS_HASH);
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	/* Check if VLAN present */
-	pkt_flags = (uint16_t)((rx_status & E1000_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = (rx_status & E1000_RXD_STAT_VP) ?  PKT_RX_VLAN_PKT : 0;
 
 #if defined(RTE_LIBRTE_IEEE1588)
 	if (rx_status & E1000_RXD_STAT_TMST)
-		pkt_flags = (uint16_t)(pkt_flags | PKT_RX_IEEE1588_TMST);
+		pkt_flags = pkt_flags | PKT_RX_IEEE1588_TMST;
 #endif
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_error_to_pkt_flags(uint32_t rx_status)
 {
 	/*
@@ -640,7 +638,7 @@ rx_desc_error_to_pkt_flags(uint32_t rx_status)
 	 * Bit 29: L4I, L4I integrity error
 	 */
 
-	static uint16_t error_to_pkt_flags_map[4] = {
+	static uint64_t error_to_pkt_flags_map[4] = {
 		0,  PKT_RX_L4_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD,
 		PKT_RX_IP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD
 	};
@@ -667,7 +665,7 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t rx_id;
 	uint16_t nb_rx;
 	uint16_t nb_hold;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -786,10 +784,8 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags | rx_desc_error_to_pkt_flags(staterr);
 		rxm->ol_flags = pkt_flags;
 
 		/*
@@ -846,7 +842,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t nb_rx;
 	uint16_t nb_hold;
 	uint16_t data_len;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -1022,10 +1018,8 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags | rx_desc_error_to_pkt_flags(staterr);
 		first_seg->ol_flags = pkt_flags;
 
 		/* Prefetch data of first segment, if configured to do so. */
diff --git a/lib/librte_pmd_i40e/i40e_rxtx.c b/lib/librte_pmd_i40e/i40e_rxtx.c
index 25a5f6f..cd05654 100644
--- a/lib/librte_pmd_i40e/i40e_rxtx.c
+++ b/lib/librte_pmd_i40e/i40e_rxtx.c
@@ -91,27 +91,27 @@ static uint16_t i40e_xmit_pkts_simple(void *tx_queue,
 				      uint16_t nb_pkts);
 
 /* Translate the rx descriptor status to pkt flags */
-static inline uint16_t
+static inline uint64_t
 i40e_rxd_status_to_pkt_flags(uint64_t qword)
 {
-	uint16_t flags;
+	uint64_t flags;
 
 	/* Check if VLAN packet */
-	flags = (uint16_t)(qword & (1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT) ?
-							PKT_RX_VLAN_PKT : 0);
+	flags = qword & (1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT) ?
+							PKT_RX_VLAN_PKT : 0;
 
 	/* Check if RSS_HASH */
-	flags |= (uint16_t)((((qword >> I40E_RX_DESC_STATUS_FLTSTAT_SHIFT) &
+	flags |= (((qword >> I40E_RX_DESC_STATUS_FLTSTAT_SHIFT) &
 					I40E_RX_DESC_FLTSTAT_RSS_HASH) ==
-			I40E_RX_DESC_FLTSTAT_RSS_HASH) ? PKT_RX_RSS_HASH : 0);
+			I40E_RX_DESC_FLTSTAT_RSS_HASH) ? PKT_RX_RSS_HASH : 0;
 
 	return flags;
 }
 
-static inline uint16_t
+static inline uint64_t
 i40e_rxd_error_to_pkt_flags(uint64_t qword)
 {
-	uint16_t flags = 0;
+	uint64_t flags = 0;
 	uint64_t error_bits = (qword >> I40E_RXD_QW1_ERROR_SHIFT);
 
 #define I40E_RX_ERR_BITS 0x3f
@@ -143,12 +143,12 @@ i40e_rxd_error_to_pkt_flags(uint64_t qword)
 }
 
 /* Translate pkt types to pkt flags */
-static inline uint16_t
+static inline uint64_t
 i40e_rxd_ptype_to_pkt_flags(uint64_t qword)
 {
 	uint8_t ptype = (uint8_t)((qword & I40E_RXD_QW1_PTYPE_MASK) >>
 					I40E_RXD_QW1_PTYPE_SHIFT);
-	static const uint16_t ip_ptype_map[I40E_MAX_PKT_TYPE] = {
+	static const uint64_t ip_ptype_map[I40E_MAX_PKT_TYPE] = {
 		0, /* PTYPE 0 */
 		0, /* PTYPE 1 */
 		0, /* PTYPE 2 */
@@ -567,7 +567,7 @@ i40e_rx_scan_hw_ring(struct i40e_rx_queue *rxq)
 	uint32_t rx_status;
 	int32_t s[I40E_LOOK_AHEAD], nb_dd;
 	int32_t i, j, nb_rx = 0;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	rxdp = &rxq->rx_ring[rxq->rx_tail];
 	rxep = &rxq->sw_ring[rxq->rx_tail];
@@ -789,7 +789,7 @@ i40e_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 	uint16_t rx_packet_len;
 	uint16_t rx_id, nb_hold;
 	uint64_t dma_addr;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -896,10 +896,11 @@ i40e_recv_scattered_pkts(void *rx_queue,
 	struct rte_mbuf *last_seg = rxq->pkt_last_seg;
 	struct rte_mbuf *nmb, *rxm;
 	uint16_t rx_id = rxq->rx_tail;
-	uint16_t nb_rx = 0, nb_hold = 0, rx_packet_len, pkt_flags;
+	uint16_t nb_rx = 0, nb_hold = 0, rx_packet_len;
 	uint32_t rx_status;
 	uint64_t qword1;
 	uint64_t dma_addr;
+	uint64_t pkt_flags;
 
 	while (nb_rx < nb_pkts) {
 		rxdp = &rx_ring[rx_id];
@@ -1046,7 +1047,7 @@ i40e_recv_scattered_pkts(void *rx_queue,
 
 /* Check if the context descriptor is needed for TX offloading */
 static inline uint16_t
-i40e_calc_context_desc(uint16_t flags)
+i40e_calc_context_desc(uint64_t flags)
 {
 	uint16_t mask = 0;
 
@@ -1075,7 +1076,7 @@ i40e_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 	uint32_t td_offset;
 	uint32_t tx_flags;
 	uint32_t td_tag;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint8_t l2_len;
 	uint8_t l3_len;
 	uint16_t nb_used;
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index c661335..26cb176 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -358,7 +358,7 @@ ixgbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 static inline void
 ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct ixgbe_adv_tx_context_desc *ctx_txd,
-		uint16_t ol_flags, uint32_t vlan_macip_lens)
+		uint64_t ol_flags, uint32_t vlan_macip_lens)
 {
 	uint32_t type_tucmd_mlhl;
 	uint32_t mss_l4len_idx;
@@ -421,7 +421,7 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
+what_advctx_update(struct igb_tx_queue *txq, uint64_t flags,
 		uint32_t vlan_macip_lens)
 {
 	/* If match with the current used context */
@@ -444,7 +444,7 @@ what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
+tx_desc_cksum_flags_to_olinfo(uint64_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_TXSM};
 	static const uint32_t l3_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_IXSM};
@@ -456,7 +456,7 @@ tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
 }
 
 static inline uint32_t
-tx_desc_vlan_flags_to_cmdtype(uint16_t ol_flags)
+tx_desc_vlan_flags_to_cmdtype(uint64_t ol_flags)
 {
 	static const uint32_t vlan_cmd[2] = {0, IXGBE_ADVTXD_DCMD_VLE};
 	return vlan_cmd[(ol_flags & PKT_TX_VLAN_PKT) != 0];
@@ -546,12 +546,12 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t cmd_type_len;
 	uint32_t pkt_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint64_t ol_flags;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
 	uint16_t nb_used;
-	uint16_t tx_ol_req;
+	uint64_t tx_ol_req;
 	uint32_t ctx = 0;
 	uint32_t new_ctx;
 
@@ -583,7 +583,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		vlan_macip_lens.f.l2_l3_len = tx_pkt->l2_l3_len;
 
 		/* If hardware offload required */
-		tx_ol_req = (uint16_t)(ol_flags & PKT_TX_OFFLOAD_MASK);
+		tx_ol_req = ol_flags & PKT_TX_OFFLOAD_MASK;
 		if (tx_ol_req) {
 			/* If new context need be built or reuse the exist ctx. */
 			ctx = what_advctx_update(txq, tx_ol_req,
@@ -813,19 +813,19 @@ end_of_tx:
  *  RX functions
  *
  **********************************************************************/
-static inline uint16_t
+static inline uint64_t
 rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 {
 	uint16_t pkt_flags;
 
-	static uint16_t ip_pkt_types_map[16] = {
+	static uint64_t ip_pkt_types_map[16] = {
 		0, PKT_RX_IPV4_HDR, PKT_RX_IPV4_HDR_EXT, PKT_RX_IPV4_HDR_EXT,
 		PKT_RX_IPV6_HDR, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
 	};
 
-	static uint16_t ip_rss_types_map[16] = {
+	static uint64_t ip_rss_types_map[16] = {
 		0, PKT_RX_RSS_HASH, PKT_RX_RSS_HASH, PKT_RX_RSS_HASH,
 		0, PKT_RX_RSS_HASH, 0, PKT_RX_RSS_HASH,
 		PKT_RX_RSS_HASH, 0, 0, 0,
@@ -838,45 +838,44 @@ rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 		0, 0, 0, 0,
 	};
 
-	pkt_flags = (uint16_t) ((hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ?
-				ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = (hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ?
+			ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
+			ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F];
 #else
-	pkt_flags = (uint16_t) ((hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ? 0 :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = (hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ? 0 :
+			ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F];
 
 #endif
-	return (uint16_t)(pkt_flags | ip_rss_types_map[hl_tp_rs & 0xF]);
+	return pkt_flags | ip_rss_types_map[hl_tp_rs & 0xF];
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	/*
 	 * Check if VLAN present only.
 	 * Do not check whether L3/L4 rx checksum done by NIC or not,
 	 * That can be found from rte_eth_rxmode.hw_ip_checksum flag
 	 */
-	pkt_flags = (uint16_t)((rx_status & IXGBE_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = (rx_status & IXGBE_RXD_STAT_VP) ?  PKT_RX_VLAN_PKT : 0;
 
 #ifdef RTE_LIBRTE_IEEE1588
 	if (rx_status & IXGBE_RXD_STAT_TMST)
-		pkt_flags = (uint16_t)(pkt_flags | PKT_RX_IEEE1588_TMST);
+		pkt_flags = pkt_flags | PKT_RX_IEEE1588_TMST;
 #endif
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint64_t
 rx_desc_error_to_pkt_flags(uint32_t rx_status)
 {
 	/*
 	 * Bit 31: IPE, IPv4 checksum error
 	 * Bit 30: L4I, L4I integrity error
 	 */
-	static uint16_t error_to_pkt_flags_map[4] = {
+	static uint64_t error_to_pkt_flags_map[4] = {
 		0,  PKT_RX_L4_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD,
 		PKT_RX_IP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD
 	};
@@ -947,10 +946,10 @@ ixgbe_rx_scan_hw_ring(struct igb_rx_queue *rxq)
 			mb->ol_flags  = rx_desc_hlen_type_rss_to_pkt_flags(
 					rxdp[j].wb.lower.lo_dword.data);
 			/* reuse status field from scan list */
-			mb->ol_flags = (uint16_t)(mb->ol_flags |
-					rx_desc_status_to_pkt_flags(s[j]));
-			mb->ol_flags = (uint16_t)(mb->ol_flags |
-					rx_desc_error_to_pkt_flags(s[j]));
+			mb->ol_flags = mb->ol_flags |
+					rx_desc_status_to_pkt_flags(s[j]);
+			mb->ol_flags = mb->ol_flags |
+					rx_desc_error_to_pkt_flags(s[j]);
 		}
 
 		/* Move mbuf pointers from the S/W ring to the stage */
@@ -1143,7 +1142,7 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t rx_id;
 	uint16_t nb_rx;
 	uint16_t nb_hold;
-	uint16_t pkt_flags;
+	uint64_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -1261,10 +1260,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->vlan_tci = rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags | rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags | rx_desc_error_to_pkt_flags(staterr);
 		rxm->ol_flags = pkt_flags;
 
 		if (likely(pkt_flags & PKT_RX_RSS_HASH))
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index fd0023e..0d633d6 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -177,7 +177,7 @@ union ixgbe_vlan_macip {
  */
 
 struct ixgbe_advctx_info {
-	uint16_t flags;           /**< ol_flags for context build. */
+	uint64_t flags;           /**< ol_flags for context build. */
 	uint32_t cmp_mask;        /**< compare mask for vlan_macip_lens */
 	union ixgbe_vlan_macip vlan_macip_lens; /**< vlan, mac ip length. */
 };
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 04/13] mbuf: introduce a flag to indicate a control mbuf
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (2 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 03/13] mbuf: expand ol_flags field to 64-bits Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 05/13] mbuf: minor changes for readability Bruce Richardson
                     ` (9 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

Since the flags field is now 64-bits, we can allow one bit to be used to
indicate a control i.e. non-packet mbuf. Dedicate the high bit (bit 63)
for this purpose and add in a utility macro to test if a given mbuf has
the bit set or not.

Updated in V2:
* Fix typo in comment in rte_mbuf.h

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.c |  2 ++
 lib/librte_mbuf/rte_mbuf.h | 19 +++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 9dfcac3..52e7574 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -69,7 +69,9 @@ rte_ctrlmbuf_init(struct rte_mempool *mp,
 		void *_m,
 		__attribute__((unused)) unsigned i)
 {
+	struct rte_mbuf *m = _m;
 	rte_pktmbuf_init(mp, opaque_arg, _m, i);
+	m->ol_flags |= CTRL_MBUF_FLAG;
 }
 
 /*
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 0e20e31..b9a1c66 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -91,6 +91,7 @@ extern "C" {
 #define PKT_TX_IPV4_CSUM     0x1000 /**< Alias of PKT_TX_IP_CKSUM. */
 #define PKT_TX_IPV4          PKT_RX_IPV4_HDR /**< IPv4 with no IP checksum offload. */
 #define PKT_TX_IPV6          PKT_RX_IPV6_HDR /**< IPv6 packet */
+
 /*
  * Bit 14~13 used for L4 packet type with checksum enabled.
  *     00: Reserved
@@ -106,6 +107,9 @@ extern "C" {
 /* Bit 15 */
 #define PKT_TX_IEEE1588_TMST 0x8000 /**< TX IEEE1588 packet to timestamp. */
 
+/* Use final bit of flags to indicate a control mbuf */
+#define CTRL_MBUF_FLAG       (1ULL << 63)
+
 /**
  * Bit Mask to indicate what bits required for building TX context
  */
@@ -471,6 +475,21 @@ void rte_ctrlmbuf_init(struct rte_mempool *mp, void *opaque_arg,
  */
 #define rte_ctrlmbuf_len(m) rte_pktmbuf_data_len(m)
 
+/**
+ * Tests if an mbuf is a control mbuf
+ *
+ * @param m
+ *   The mbuf to be tested
+ * @return
+ *   - True (1) if the mbuf is a control mbuf
+ *   - False(0) otherwise
+ */
+static inline int
+rte_is_ctrlmbuf(struct rte_mbuf *m)
+{
+	return (!!(m->ol_flags & CTRL_MBUF_FLAG));
+}
+
 /* Operations on pkt mbuf */
 
 /**
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 05/13] mbuf: minor changes for readability
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (3 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 04/13] mbuf: introduce a flag to indicate a control mbuf Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 06/13] mbuf: use macros only to access the mbuf metadata Bruce Richardson
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

* Ensure comments line up correctly
* Simplify the #ifdefs around the refcnt fields to make them clearer

Updates in V2:
* None

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index b9a1c66..6ce852a 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -126,7 +126,6 @@ struct rte_mbuf {
 	uint16_t buf_len;         /**< Length of segment buffer. */
 	uint16_t data_off;
 
-#ifdef RTE_MBUF_REFCNT
 	/**
 	 * 16-bit Reference counter.
 	 * It should only be accessed using the following functions:
@@ -136,21 +135,21 @@ struct rte_mbuf {
 	 * config option.
 	 */
 	union {
-		rte_atomic16_t refcnt_atomic;   /**< Atomically accessed refcnt */
-		uint16_t refcnt;                /**< Non-atomically accessed refcnt */
-	};
-#else
-	uint16_t refcnt_reserved;     /**< Do not use this field */
+#ifdef RTE_MBUF_REFCNT
+		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
+		uint16_t refcnt;              /**< Non-atomically accessed refcnt */
 #endif
-	uint8_t nb_segs;        /**< Number of segments. */
-	uint8_t port;           /**< Input port. */
+		uint16_t refcnt_reserved;     /**< Do not use this field */
+	};
+	uint8_t nb_segs;          /**< Number of segments. */
+	uint8_t port;             /**< Input port. */
 
-	uint64_t ol_flags;      /**< Offload features. */
+	uint64_t ol_flags;        /**< Offload features. */
 
 	/* remaining bytes are set on RX when pulling packet from descriptor */
-	uint16_t reserved2;     /**< Unused field. Required for padding */
-	uint16_t data_len;      /**< Amount of data in segment buffer. */
-	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
+	uint16_t reserved2;       /**< Unused field. Required for padding */
+	uint16_t data_len;        /**< Amount of data in segment buffer. */
+	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
 	union {
 		uint16_t l2_l3_len; /**< combined l2/l3 lengths as single var */
 		struct {
@@ -158,15 +157,15 @@ struct rte_mbuf {
 			uint16_t l2_len:7;      /**< L2 (MAC) Header Length. */
 		};
 	};
-	uint16_t vlan_tci;      /**< VLAN Tag Control Identifier (CPU order). */
+	uint16_t vlan_tci;        /**< VLAN Tag Control Identifier (CPU order) */
 	union {
-		uint32_t rss;       /**< RSS hash result if RSS enabled */
+		uint32_t rss;     /**< RSS hash result if RSS enabled */
 		struct {
 			uint16_t hash;
 			uint16_t id;
-		} fdir;             /**< Filter identifier if FDIR enabled */
-		uint32_t sched;     /**< Hierarchical scheduler */
-	} hash;                 /**< hash information */
+		} fdir;           /**< Filter identifier if FDIR enabled */
+		uint32_t sched;   /**< Hierarchical scheduler */
+	} hash;                   /**< hash information */
 
 	/* fields only used in slow path or on TX */
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 06/13] mbuf: use macros only to access the mbuf metadata
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (4 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 05/13] mbuf: minor changes for readability Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 07/13] mbuf: move metadata macros to rte_port library Bruce Richardson
                     ` (7 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

Removed the explicit zero-sized metadata definition at the end of the
mbuf data structure. Updated the metadata macros to take account of this
change so that all existing code which uses those macros still works.

Updates in V2:
* None

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 6ce852a..f855e64 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -171,31 +171,25 @@ struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
 
-	union {
-		uint8_t metadata[0];
-		uint16_t metadata16[0];
-		uint32_t metadata32[0];
-		uint64_t metadata64[0];
-	} __rte_cache_aligned;
 } __rte_cache_aligned;
 
 #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
-	(mbuf->metadata[offset])
+	(((uint8_t *)&(mbuf)[1])[offset])
 #define RTE_MBUF_METADATA_UINT16(mbuf, offset)             \
-	(mbuf->metadata16[offset/sizeof(uint16_t)])
+	(((uint16_t *)&(mbuf)[1])[offset/sizeof(uint16_t)])
 #define RTE_MBUF_METADATA_UINT32(mbuf, offset)             \
-	(mbuf->metadata32[offset/sizeof(uint32_t)])
+	(((uint32_t *)&(mbuf)[1])[offset/sizeof(uint32_t)])
 #define RTE_MBUF_METADATA_UINT64(mbuf, offset)             \
-	(mbuf->metadata64[offset/sizeof(uint64_t)])
+	(((uint64_t *)&(mbuf)[1])[offset/sizeof(uint64_t)])
 
 #define RTE_MBUF_METADATA_UINT8_PTR(mbuf, offset)          \
-	(&mbuf->metadata[offset])
+	(&RTE_MBUF_METADATA_UINT8(mbuf, offset))
 #define RTE_MBUF_METADATA_UINT16_PTR(mbuf, offset)         \
-	(&mbuf->metadata16[offset/sizeof(uint16_t)])
+	(&RTE_MBUF_METADATA_UINT16(mbuf, offset))
 #define RTE_MBUF_METADATA_UINT32_PTR(mbuf, offset)         \
-	(&mbuf->metadata32[offset/sizeof(uint32_t)])
+	(&RTE_MBUF_METADATA_UINT32(mbuf, offset))
 #define RTE_MBUF_METADATA_UINT64_PTR(mbuf, offset)         \
-	(&mbuf->metadata64[offset/sizeof(uint64_t)])
+	(&RTE_MBUF_METADATA_UINT64(mbuf, offset))
 
 /**
  * Given the buf_addr returns the pointer to corresponding mbuf.
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 07/13] mbuf: move metadata macros to rte_port library
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (5 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 06/13] mbuf: use macros only to access the mbuf metadata Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 08/13] mbuf: add named points inside the mbuf structure Bruce Richardson
                     ` (6 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

The metadata macros are only used by libs and apps using the rte_port
packet framework library, so move them to a header file there.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 18 ------------------
 lib/librte_port/rte_port.h | 22 ++++++++++++++++++++++
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index f855e64..2f1bd4a 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -173,24 +173,6 @@ struct rte_mbuf {
 
 } __rte_cache_aligned;
 
-#define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
-	(((uint8_t *)&(mbuf)[1])[offset])
-#define RTE_MBUF_METADATA_UINT16(mbuf, offset)             \
-	(((uint16_t *)&(mbuf)[1])[offset/sizeof(uint16_t)])
-#define RTE_MBUF_METADATA_UINT32(mbuf, offset)             \
-	(((uint32_t *)&(mbuf)[1])[offset/sizeof(uint32_t)])
-#define RTE_MBUF_METADATA_UINT64(mbuf, offset)             \
-	(((uint64_t *)&(mbuf)[1])[offset/sizeof(uint64_t)])
-
-#define RTE_MBUF_METADATA_UINT8_PTR(mbuf, offset)          \
-	(&RTE_MBUF_METADATA_UINT8(mbuf, offset))
-#define RTE_MBUF_METADATA_UINT16_PTR(mbuf, offset)         \
-	(&RTE_MBUF_METADATA_UINT16(mbuf, offset))
-#define RTE_MBUF_METADATA_UINT32_PTR(mbuf, offset)         \
-	(&RTE_MBUF_METADATA_UINT32(mbuf, offset))
-#define RTE_MBUF_METADATA_UINT64_PTR(mbuf, offset)         \
-	(&RTE_MBUF_METADATA_UINT64(mbuf, offset))
-
 /**
  * Given the buf_addr returns the pointer to corresponding mbuf.
  */
diff --git a/lib/librte_port/rte_port.h b/lib/librte_port/rte_port.h
index 1d763c2..6bd5039 100644
--- a/lib/librte_port/rte_port.h
+++ b/lib/librte_port/rte_port.h
@@ -50,6 +50,28 @@ extern "C" {
 #include <stdint.h>
 #include <rte_mbuf.h>
 
+/**
+ * Macros to allow accessing metadata stored in the mbuf headroom
+ * just beyond the end of the mbuf data structure returned by a port
+ */
+#define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
+	(((uint8_t *)&(mbuf)[1])[offset])
+#define RTE_MBUF_METADATA_UINT16(mbuf, offset)             \
+	(((uint16_t *)&(mbuf)[1])[offset/sizeof(uint16_t)])
+#define RTE_MBUF_METADATA_UINT32(mbuf, offset)             \
+	(((uint32_t *)&(mbuf)[1])[offset/sizeof(uint32_t)])
+#define RTE_MBUF_METADATA_UINT64(mbuf, offset)             \
+	(((uint64_t *)&(mbuf)[1])[offset/sizeof(uint64_t)])
+
+#define RTE_MBUF_METADATA_UINT8_PTR(mbuf, offset)          \
+	(&RTE_MBUF_METADATA_UINT8(mbuf, offset))
+#define RTE_MBUF_METADATA_UINT16_PTR(mbuf, offset)         \
+	(&RTE_MBUF_METADATA_UINT16(mbuf, offset))
+#define RTE_MBUF_METADATA_UINT32_PTR(mbuf, offset)         \
+	(&RTE_MBUF_METADATA_UINT32(mbuf, offset))
+#define RTE_MBUF_METADATA_UINT64_PTR(mbuf, offset)         \
+	(&RTE_MBUF_METADATA_UINT64(mbuf, offset))
+
 /*
  * Port IN
  *
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 08/13] mbuf: add named points inside the mbuf structure
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (6 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 07/13] mbuf: move metadata macros to rte_port library Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 09/13] ixgbe: rework vector pmd following mbuf changes Bruce Richardson
                     ` (5 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

Add markers or "labels" at given points inside the mbuf which can be
used instead of individual fields to identify the start of logical
sections inside the mbuf.

The use of typedefs and dummy fields was chosen over using unions
because of a couple reasons:
* unions cause an extra level of indentation (more likely two levels as
  a union containing a struct for multiple fields would be needed). This
  makes the lines longer than they need to be and increases the need for
  wrapping. [This was the main reason]
* with markers, you can apply multiple markers at the same point if
  wanted.

Updates in V2:
* None

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 2f1bd4a..34900d4 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -115,14 +115,22 @@ extern "C" {
  */
 #define PKT_TX_OFFLOAD_MASK (PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK)
 
+/* define a set of marker types that can be used to refer to set points in the
+ * mbuf */
+typedef void    *MARKER[0];   /**< generic marker for a point in a structure */
+typedef uint64_t MARKER64[0]; /**< marker that allows us to overwrite 8 bytes
+                               * with a single assignment */
 /**
  * The generic rte_mbuf, containing a packet mbuf.
  */
 struct rte_mbuf {
+	MARKER cacheline0;
+
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
 
 	/* next 8 bytes are initialised on RX descriptor rearm */
+	MARKER64 rearm_data;
 	uint16_t buf_len;         /**< Length of segment buffer. */
 	uint16_t data_off;
 
@@ -147,6 +155,7 @@ struct rte_mbuf {
 	uint64_t ol_flags;        /**< Offload features. */
 
 	/* remaining bytes are set on RX when pulling packet from descriptor */
+	MARKER rx_descriptor_fields1;
 	uint16_t reserved2;       /**< Unused field. Required for padding */
 	uint16_t data_len;        /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 09/13] ixgbe: rework vector pmd following mbuf changes
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (7 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 08/13] mbuf: add named points inside the mbuf structure Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 10/13] mbuf: split mbuf across two cache lines Bruce Richardson
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

The vector PMD expects fields to be in a specific order so that it can
do vector operations on multiple fields at a time. Following mbuf
rework, adjust driver to take account of the new layout and re-enable it
in the config.

Updates in V2:
* Remove extra line at end of file to eliminate git warning

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 config/common_linuxapp                |   2 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |   2 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     |   4 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 135 ++++++++++++----------------------
 4 files changed, 53 insertions(+), 90 deletions(-)

diff --git a/config/common_linuxapp b/config/common_linuxapp
index b140af7..5bee910 100644
--- a/config/common_linuxapp
+++ b/config/common_linuxapp
@@ -191,7 +191,7 @@ CONFIG_RTE_LIBRTE_IXGBE_DEBUG_DRIVER=n
 CONFIG_RTE_LIBRTE_IXGBE_PF_DISABLE_STRIP_CRC=n
 CONFIG_RTE_LIBRTE_IXGBE_RX_ALLOW_BULK_ALLOC=y
 CONFIG_RTE_LIBRTE_IXGBE_ALLOW_UNSUPPORTED_SFP=n
-CONFIG_RTE_IXGBE_INC_VECTOR=n
+CONFIG_RTE_IXGBE_INC_VECTOR=y
 CONFIG_RTE_IXGBE_RX_OLFLAGS_ENABLE=y
 
 #
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 26cb176..1a46393 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -2166,7 +2166,7 @@ ixgbe_dev_rx_queue_setup(struct rte_eth_dev *dev,
 		if (!ixgbe_rx_vec_condition_check(dev)) {
 			PMD_INIT_LOG(INFO, "Vector rx enabled, please make "
 				     "sure RX burst size no less than 32.\n");
-			ixgbe_rxq_vec_setup(rxq, socket_id);
+			ixgbe_rxq_vec_setup(rxq);
 			dev->rx_pkt_burst = ixgbe_recv_pkts_vec;
 		}
 #endif
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 0d633d6..e92a864 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -115,6 +115,7 @@ struct igb_rx_queue {
 	struct igb_rx_entry *sw_ring; /**< address of RX software ring. */
 	struct rte_mbuf *pkt_first_seg; /**< First segment of current packet. */
 	struct rte_mbuf *pkt_last_seg; /**< Last segment of current packet. */
+	uint64_t            mbuf_initializer; /**< value to init mbufs */
 	uint16_t            nb_rx_desc; /**< number of RX descriptors. */
 	uint16_t            rx_tail;  /**< current value of RDT register. */
 	uint16_t            nb_rx_hold; /**< number of held free RX desc. */
@@ -126,7 +127,6 @@ struct igb_rx_queue {
 #ifdef RTE_IXGBE_INC_VECTOR
 	uint16_t            rxrearm_nb; /**< the idx we start the re-arming from */
 	uint16_t            rxrearm_start;  /**< number of remaining to be re-armed */
-	__m128i             misc_info;  /**< cache XMM combine port_id/crc/nb_segs */
 #endif
 	uint16_t            rx_free_thresh; /**< max free RX desc to hold. */
 	uint16_t            queue_id; /**< RX queue index. */
@@ -259,7 +259,7 @@ struct ixgbe_txq_ops {
 uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
 int ixgbe_txq_vec_setup(struct igb_tx_queue *txq, unsigned int socket_id);
-int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq, unsigned int socket_id);
+int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq);
 int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev);
 #endif
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index bafb215..d53e239 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -47,15 +47,11 @@
 static inline void
 ixgbe_rxq_rearm(struct igb_rx_queue *rxq)
 {
-	static const struct rte_mbuf mb_def = {
-		.nb_segs = 1,
-	};
 	int i;
 	uint16_t rx_id;
 	volatile union ixgbe_adv_rx_desc *rxdp;
 	struct igb_rx_entry *rxep = &rxq->sw_ring[rxq->rxrearm_start];
 	struct rte_mbuf *mb0, *mb1;
-	__m128i def_low;
 	__m128i hdr_room = _mm_set_epi64x(RTE_PKTMBUF_HEADROOM,
 			RTE_PKTMBUF_HEADROOM);
 
@@ -66,8 +62,6 @@ ixgbe_rxq_rearm(struct igb_rx_queue *rxq)
 
 	rxdp = rxq->rx_ring + rxq->rxrearm_start;
 
-	def_low = _mm_load_si128((__m128i *)&(mb_def.next));
-
 	/* Initialize the mbufs in vector, process 2 mbufs in one loop */
 	for (i = 0; i < RTE_IXGBE_RXQ_REARM_THRESH; i += 2, rxep += 2) {
 		__m128i dma_addr0, dma_addr1;
@@ -76,33 +70,25 @@ ixgbe_rxq_rearm(struct igb_rx_queue *rxq)
 		mb0 = rxep[0].mbuf;
 		mb1 = rxep[1].mbuf;
 
+		/* flush mbuf with pkt template */
+		mb0->rearm_data[0] = rxq->mbuf_initializer;
+		mb1->rearm_data[0] = rxq->mbuf_initializer;
+
 		/* load buf_addr(lo 64bit) and buf_physaddr(hi 64bit) */
 		vaddr0 = _mm_loadu_si128((__m128i *)&(mb0->buf_addr));
 		vaddr1 = _mm_loadu_si128((__m128i *)&(mb1->buf_addr));
 
-		/* calc va/pa of pkt data point */
-		vaddr0 = _mm_add_epi64(vaddr0, hdr_room);
-		vaddr1 = _mm_add_epi64(vaddr1, hdr_room);
-
 		/* convert pa to dma_addr hdr/data */
 		dma_addr0 = _mm_unpackhi_epi64(vaddr0, vaddr0);
 		dma_addr1 = _mm_unpackhi_epi64(vaddr1, vaddr1);
 
-		/* fill va into t0 def pkt template */
-		vaddr0 = _mm_unpacklo_epi64(def_low, vaddr0);
-		vaddr1 = _mm_unpacklo_epi64(def_low, vaddr1);
+		/* add headroom to pa values */
+		dma_addr0 = _mm_add_epi64(dma_addr0, hdr_room);
+		dma_addr1 = _mm_add_epi64(dma_addr1, hdr_room);
 
 		/* flush desc with pa dma_addr */
 		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr0);
 		_mm_store_si128((__m128i *)&rxdp++->read, dma_addr1);
-
-		/* flush mbuf with pkt template */
-		_mm_store_si128((__m128i *)&mb0->next, vaddr0);
-		_mm_store_si128((__m128i *)&mb1->next, vaddr1);
-
-		/* update refcnt per pkt */
-		rte_mbuf_refcnt_set(mb0, 1);
-		rte_mbuf_refcnt_set(mb1, 1);
 	}
 
 	rxq->rxrearm_start += RTE_IXGBE_RXQ_REARM_THRESH;
@@ -189,7 +175,13 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 	int pos;
 	uint64_t var;
 	__m128i shuf_msk;
-	__m128i in_port;
+	__m128i crc_adjust = _mm_set_epi16(
+				0, 0, 0, 0, /* ignore non-length fields */
+				0,          /* ignore high-16bits of pkt_len */
+				-rxq->crc_len, /* sub crc on pkt_len */
+				-rxq->crc_len, /* sub crc on data_len */
+				0            /* ignore pkt_type field */
+			);
 	__m128i dd_check;
 
 	if (unlikely(nb_pkts < RTE_IXGBE_VPMD_RX_BURST))
@@ -222,8 +214,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		15, 14,      /* octet 14~15, low 16 bits vlan_macip */
 		0xFF, 0xFF,  /* skip high 16 bits pkt_len, zero out */
 		13, 12,      /* octet 12~13, low 16 bits pkt_len */
-		0xFF, 0xFF,  /* skip nb_segs and in_port, zero out */
-		13, 12       /* octet 12~13, 16 bits data_len */
+		13, 12,      /* octet 12~13, 16 bits data_len */
+		0xFF, 0xFF   /* skip pkt_type field */
 		);
 
 
@@ -231,9 +223,6 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 	 * the next 'n' mbufs into the cache */
 	sw_ring = &rxq->sw_ring[rxq->rx_tail];
 
-	/* in_port, nb_seg = 1, crc_len */
-	in_port = rxq->misc_info;
-
 	/*
 	 * A. load 4 packet in one loop
 	 * B. copy 4 mbuf point from swring to rx_pkts
@@ -285,8 +274,8 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		desc_to_olflags_v(descs, &rx_pkts[pos]);
 
 		/* D.2 pkt 3,4 set in_port/nb_seg and remove crc */
-		pkt_mb4 = _mm_add_epi16(pkt_mb4, in_port);
-		pkt_mb3 = _mm_add_epi16(pkt_mb3, in_port);
+		pkt_mb4 = _mm_add_epi16(pkt_mb4, crc_adjust);
+		pkt_mb3 = _mm_add_epi16(pkt_mb3, crc_adjust);
 
 		/* D.1 pkt 1,2 convert format from desc to pktmbuf */
 		pkt_mb2 = _mm_shuffle_epi8(descs[1], shuf_msk);
@@ -297,23 +286,23 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		staterr = _mm_unpacklo_epi32(sterr_tmp1, sterr_tmp2);
 
 		/* D.3 copy final 3,4 data to rx_pkts */
-		_mm_storeu_si128((__m128i *)&(rx_pkts[pos+3]->data_len),
+		_mm_storeu_si128((void *)&rx_pkts[pos+3]->rx_descriptor_fields1,
 				pkt_mb4);
-		_mm_storeu_si128((__m128i *)&(rx_pkts[pos+2]->data_len),
+		_mm_storeu_si128((void *)&rx_pkts[pos+2]->rx_descriptor_fields1,
 				pkt_mb3);
 
 		/* D.2 pkt 1,2 set in_port/nb_seg and remove crc */
-		pkt_mb2 = _mm_add_epi16(pkt_mb2, in_port);
-		pkt_mb1 = _mm_add_epi16(pkt_mb1, in_port);
+		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
+		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
 
 		/* C.3 calc avaialbe number of desc */
 		staterr = _mm_and_si128(staterr, dd_check);
 		staterr = _mm_packs_epi32(staterr, zero);
 
 		/* D.3 copy final 1,2 data to rx_pkts */
-		_mm_storeu_si128((__m128i *)&(rx_pkts[pos+1]->data_len),
+		_mm_storeu_si128((void *)&rx_pkts[pos+1]->rx_descriptor_fields1,
 				pkt_mb2);
-		_mm_storeu_si128((__m128i *)&(rx_pkts[pos]->data_len),
+		_mm_storeu_si128((void *)&rx_pkts[pos]->rx_descriptor_fields1,
 				pkt_mb1);
 
 		/* C.4 calc avaialbe number of desc */
@@ -330,46 +319,19 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 	return nb_pkts_recd;
 }
-
 static inline void
 vtx1(volatile union ixgbe_adv_tx_desc *txdp,
-		struct rte_mbuf *pkt, __m128i flags)
+		struct rte_mbuf *pkt, uint64_t flags)
 {
-	__m128i t0, t1, offset, ols, ba, ctl;
-
-	/* load buf_addr/buf_physaddr in t0 */
-	t0 = _mm_loadu_si128((__m128i *)&(pkt->buf_addr));
-	/* load data, ... pkt_len in t1 */
-	t1 = _mm_loadu_si128((__m128i *)&(pkt->data));
-
-	/* calc offset = (data - buf_adr) */
-	offset = _mm_sub_epi64(t1, t0);
-
-	/* cmd_type_len: pkt_len |= DCMD_DTYP_FLAGS */
-	ctl = _mm_or_si128(t1, flags);
-
-	/* reorder as buf_physaddr/buf_addr */
-	offset = _mm_shuffle_epi32(offset, 0x4E);
-
-	/* olinfo_stats: pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT */
-	ols = _mm_slli_epi32(t1, IXGBE_ADVTXD_PAYLEN_SHIFT);
-
-	/* buffer_addr = buf_physaddr + offset */
-	ba = _mm_add_epi64(t0, offset);
-
-	/* format cmd_type_len/olinfo_status */
-	ctl = _mm_unpackhi_epi32(ctl, ols);
-
-	/* format buf_physaddr/cmd_type_len */
-	ba = _mm_unpackhi_epi64(ba, ctl);
-
-	/* write desc */
-	_mm_store_si128((__m128i *)&txdp->read, ba);
+	__m128i descriptor = _mm_set_epi64x((uint64_t)pkt->pkt_len << 46 |
+			flags | pkt->data_len,
+			pkt->buf_physaddr + pkt->data_off);
+	_mm_store_si128((__m128i *)&txdp->read, descriptor);
 }
 
 static inline void
 vtx(volatile union ixgbe_adv_tx_desc *txdp,
-		struct rte_mbuf **pkt, uint16_t nb_pkts,  __m128i flags)
+		struct rte_mbuf **pkt, uint16_t nb_pkts,  uint64_t flags)
 {
 	int i;
 	for (i = 0; i < nb_pkts; ++i, ++txdp, ++pkt)
@@ -456,9 +418,8 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct igb_tx_entry_v *txep;
 	struct igb_tx_entry_seq *txsp;
 	uint16_t n, nb_commit, tx_id;
-	__m128i flags = _mm_set_epi32(DCMD_DTYP_FLAGS, 0, 0, 0);
-	__m128i rs = _mm_set_epi32(IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS,
-			0, 0, 0);
+	uint64_t flags = DCMD_DTYP_FLAGS;
+	uint64_t rs = IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS;
 	int i;
 
 	if (unlikely(nb_pkts > RTE_IXGBE_VPMD_TX_BURST))
@@ -610,6 +571,23 @@ static struct ixgbe_txq_ops vec_txq_ops = {
 	.reset = ixgbe_reset_tx_queue,
 };
 
+int
+ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq)
+{
+	static struct rte_mbuf mb_def = {
+		.nb_segs = 1,
+		.data_off = RTE_PKTMBUF_HEADROOM,
+#ifdef RTE_MBUF_REFCNT
+		.refcnt = 1,
+#endif
+	};
+
+	mb_def.buf_len = rxq->mb_pool->elt_size - sizeof(struct rte_mbuf);
+	mb_def.port = rxq->port_id;
+	rxq->mbuf_initializer = *((uint64_t *)&mb_def.rearm_data);
+	return 0;
+}
+
 int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
 			unsigned int socket_id)
 {
@@ -637,21 +615,6 @@ int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
 	return 0;
 }
 
-int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq,
-			__rte_unused unsigned int socket_id)
-{
-	rxq->misc_info =
-		_mm_set_epi16(
-			0, 0, 0, 0, 0,
-			(uint16_t)-rxq->crc_len, /* sub crc on pkt_len */
-			(uint16_t)(rxq->port_id << 8 | 1),
-			/* 8b port_id and 8b nb_seg*/
-			(uint16_t)-rxq->crc_len  /* sub crc on data_len */
-			);
-
-	return 0;
-}
-
 int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev)
 {
 #ifndef RTE_LIBRTE_IEEE1588
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 10/13] mbuf: split mbuf across two cache lines.
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (8 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 09/13] ixgbe: rework vector pmd following mbuf changes Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 11/13] mbuf: move l2_len and l3_len to second cache line Bruce Richardson
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

This change splits the mbuf in two to move the pool and next pointers to
the second cache line. This frees up 16 bytes in first cache line.

The reason for this change is that we believe that there is no possible
way that we can ever fit all the fields we need to fit into a 64-byte
mbuf, and so we need to start looking at a 128-byte mbuf instead. Examples
of new fields that need to fit in, include -
* 32-bits more for filter information for support for the new filters in
  the i40e driver (and possibly other future drivers)
* an additional 2-4 bytes for storing info on a second vlan tag to allow
  drivers to support double Vlan/QinQ
* 4-bytes for storing a sequence number to enable out of order packet
  processing and subsequent packet reordering
as well as potentially a number of other fields or splitting out fields
that are superimposed over each other right now, e.g. for the qos scheduler.
We also want to allow space for use by other non-Intel NIC drivers that may
be open-sourced to dpdk.org in the future too, where they support fields
and offloads that currently supported hardware doesn't.

If we accept the fact of a 2-cache-line mbuf, then the issue becomes
how to rework things so that we spread our fields over the two
cache lines while causing the lowest slow-down possible. The general
approach that we are looking to take is to focus the first cache
line on fields that are updated on RX , so that receive only deals
with one cache line. The second cache line can be used for application
data and information that will only be used on the TX leg. This would
allow us to work on the first cache line in RX as now, and have the
second cache line being prefetched in the background so that it is
available when necessary. Hardware prefetches should help us out
here. We also may move rarely used, or slow-path RX fields e.g. such
as those for chained mbufs with jumbo frames, to the second
cache line, depending upon the performance impact and bytes savings
achieved.

Updated in V2:
* Expanded commit description to include contents of a previous mail
  describing some of the logic behind expanding mbuf to two cache lines
* Update kni mbuf structure

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 app/test/test_mbuf.c                                          | 2 +-
 lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h | 6 +++---
 lib/librte_mbuf/rte_mbuf.h                                    | 3 ++-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 1b25481..66bcbc5 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -782,7 +782,7 @@ test_failing_mbuf_sanity_check(void)
 static int
 test_mbuf(void)
 {
-	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != 64);
+	RTE_BUILD_BUG_ON(sizeof(struct rte_mbuf) != CACHE_LINE_SIZE * 2);

 	/* create pktmbuf pool if it does not exist */
 	if (pktmbuf_pool == NULL) {
diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
index ab022bd..25ed672 100644
--- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
@@ -108,7 +108,7 @@ struct rte_kni_fifo {
  * Padding is necessary to assure the offsets of these fields
  */
 struct rte_kni_mbuf {
-	void *buf_addr;
+	void *buf_addr __attribute__((__aligned__(64)));
 	char pad0[10];
 	uint16_t data_off;      /**< Start address of data in segment buffer. */
 	char pad1[4];
@@ -117,9 +117,9 @@ struct rte_kni_mbuf {
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
 	char pad3[8];
-	void *pool;
+	void *pool __attribute__((__aligned__(64)));
 	void *next;
-} __attribute__((__aligned__(64)));
+};

 /*
  * Struct used to create a KNI device. Passed to the kernel in IOCTL call
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 34900d4..508021b 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -176,7 +176,8 @@ struct rte_mbuf {
 		uint32_t sched;   /**< Hierarchical scheduler */
 	} hash;                   /**< hash information */

-	/* fields only used in slow path or on TX */
+	/* second cache line - fields only used in slow path or on TX */
+	MARKER cacheline1 __rte_cache_aligned;
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	struct rte_mbuf *next;    /**< Next segment of scattered packet. */

-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 11/13] mbuf: move l2_len and l3_len to second cache line
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (9 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 10/13] mbuf: split mbuf across two cache lines Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 12/13] ixgbe: Fix perf regression due to moved pool ptr Bruce Richardson
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

The l2_len and l3_len fields are used for TX offloads and so should be
put on the second cache line, along with the other fields only used on
TX.

Updates in V2:
* The l2 and l3 lengths can be accessed as a single uint16_t for
  performance, as well as individually.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_mbuf/rte_mbuf.h | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 508021b..1c6e115 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -159,13 +159,7 @@ struct rte_mbuf {
 	uint16_t reserved2;       /**< Unused field. Required for padding */
 	uint16_t data_len;        /**< Amount of data in segment buffer. */
 	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
-	union {
-		uint16_t l2_l3_len; /**< combined l2/l3 lengths as single var */
-		struct {
-			uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
-			uint16_t l2_len:7;      /**< L2 (MAC) Header Length. */
-		};
-	};
+	uint16_t reserved;
 	uint16_t vlan_tci;        /**< VLAN Tag Control Identifier (CPU order) */
 	union {
 		uint32_t rss;     /**< RSS hash result if RSS enabled */
@@ -181,6 +175,14 @@ struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
 
+	/* fields to support TX offloads */
+	union {
+		uint16_t l2_l3_len; /**< combined l2/l3 lengths as single var */
+		struct {
+			uint16_t l3_len:9;      /**< L3 (IP) Header Length. */
+			uint16_t l2_len:7;      /**< L2 (MAC) Header Length. */
+		};
+	};
 } __rte_cache_aligned;
 
 /**
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 12/13] ixgbe: Fix perf regression due to moved pool ptr
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (10 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 11/13] mbuf: move l2_len and l3_len to second cache line Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-15 16:20     ` [dpdk-dev] [PATCH v3 " Bruce Richardson
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 13/13] ixgbe: Improve slow-path perf: vector scattered RX Bruce Richardson
  2014-09-17 22:35   ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Thomas Monjalon
  13 siblings, 1 reply; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

Adjust the fast-path code to fix the regression caused by the pool
pointer moving to the second cache line. This change adjusts the
prefetching and also the way in which the mbufs are freed back to the
mempool.
Note: slow-path e.g. path supporting jumbo frames, is still slower, but
is dealt with by a later commit

Updates in V2:
* fixup checkpatch issue

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |  8 ++-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     | 14 +-----
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 92 +++++++++++++----------------------
 3 files changed, 38 insertions(+), 76 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 1a46393..d6448a4 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -142,10 +142,6 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	 */
 	txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
 
-	/* prefetch the mbufs that are about to be freed */
-	for (i = 0; i < txq->tx_rs_thresh; ++i)
-		rte_prefetch0((txep + i)->mbuf);
-
 	/* free buffers one at a time */
 	if ((txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) != 0) {
 		for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
@@ -186,6 +182,7 @@ tx4(volatile union ixgbe_adv_tx_desc *txdp, struct rte_mbuf **pkts)
 				((uint32_t)DCMD_DTYP_FLAGS | pkt_len);
 		txdp->read.olinfo_status =
 				(pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+		rte_prefetch0(&(*pkts)->pool);
 	}
 }
 
@@ -205,6 +202,7 @@ tx1(volatile union ixgbe_adv_tx_desc *txdp, struct rte_mbuf **pkts)
 			((uint32_t)DCMD_DTYP_FLAGS | pkt_len);
 	txdp->read.olinfo_status =
 			(pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+	rte_prefetch0(&(*pkts)->pool);
 }
 
 /*
@@ -1875,7 +1873,7 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
 		PMD_INIT_LOG(INFO, "Using simple tx code path\n");
 #ifdef RTE_IXGBE_INC_VECTOR
 		if (txq->tx_rs_thresh <= RTE_IXGBE_TX_MAX_FREE_BUF_SZ &&
-		    ixgbe_txq_vec_setup(txq, socket_id) == 0) {
+		    ixgbe_txq_vec_setup(txq) == 0) {
 			PMD_INIT_LOG(INFO, "Vector tx enabled.\n");
 			dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
 		}
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index e92a864..a97fddb 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -96,14 +96,6 @@ struct igb_tx_entry_v {
 };
 
 /**
- * continuous entry sequence, gather by the same mempool
- */
-struct igb_tx_entry_seq {
-	const struct rte_mempool* pool;
-	uint32_t same_pool;
-};
-
-/**
  * Structure associated with each RX queue.
  */
 struct igb_rx_queue {
@@ -190,10 +182,6 @@ struct igb_tx_queue {
 	volatile union ixgbe_adv_tx_desc *tx_ring;
 	uint64_t            tx_ring_phys_addr; /**< TX ring DMA address. */
 	struct igb_tx_entry *sw_ring;      /**< virtual address of SW ring. */
-#ifdef RTE_IXGBE_INC_VECTOR
-	/** continuous tx entry sequence within the same mempool */
-	struct igb_tx_entry_seq *sw_ring_seq;
-#endif
 	volatile uint32_t   *tdt_reg_addr; /**< Address of TDT register. */
 	uint16_t            nb_tx_desc;    /**< number of TX descriptors. */
 	uint16_t            tx_tail;       /**< current value of TDT reg. */
@@ -258,7 +246,7 @@ struct ixgbe_txq_ops {
 #ifdef RTE_IXGBE_INC_VECTOR
 uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
-int ixgbe_txq_vec_setup(struct igb_tx_queue *txq, unsigned int socket_id);
+int ixgbe_txq_vec_setup(struct igb_tx_queue *txq);
 int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq);
 int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev);
 #endif
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index d53e239..8f34f59 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -342,9 +342,8 @@ static inline int __attribute__((always_inline))
 ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 {
 	struct igb_tx_entry_v *txep;
-	struct igb_tx_entry_seq *txsp;
 	uint32_t status;
-	uint32_t n, k;
+	uint32_t n;
 #ifdef RTE_MBUF_REFCNT
 	uint32_t i;
 	int nb_free = 0;
@@ -364,23 +363,38 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	 */
 	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[txq->tx_next_dd -
 			(n - 1)];
-	txsp = &txq->sw_ring_seq[txq->tx_next_dd - (n - 1)];
-
-	while (n > 0) {
-		k = RTE_MIN(n, txsp[n-1].same_pool);
 #ifdef RTE_MBUF_REFCNT
-		for (i = 0; i < k; i++) {
-			m = __rte_pktmbuf_prefree_seg((txep+n-k+i)->mbuf);
-			if (m != NULL)
-				free[nb_free++] = m;
-		}
-		rte_mempool_put_bulk((void *)txsp[n-1].pool,
-				(void **)free, nb_free);
+	m = __rte_pktmbuf_prefree_seg(txep[0].mbuf);
 #else
-		rte_mempool_put_bulk((void *)txsp[n-1].pool,
-				(void **)(txep+n-k), k);
+	m = txep[0].mbuf;
 #endif
-		n -= k;
+	if (likely(m != NULL)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+#ifdef RTE_MBUF_REFCNT
+			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
+#else
+			m = txep[i]->mbuf;
+#endif
+			if (likely(m != NULL)) {
+				if (likely(m->pool == free[0]->pool))
+					free[nb_free++] = m;
+				else {
+					rte_mempool_put_bulk(free[0]->pool,
+							(void *)free, nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m != NULL)
+				rte_mempool_put(m->pool, m);
+		}
 	}
 
 	/* buffers were freed, update counters */
@@ -394,19 +408,11 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 
 static inline void __attribute__((always_inline))
 tx_backlog_entry(struct igb_tx_entry_v *txep,
-		 struct igb_tx_entry_seq *txsp,
 		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
 	int i;
-	for (i = 0; i < (int)nb_pkts; ++i) {
+	for (i = 0; i < (int)nb_pkts; ++i)
 		txep[i].mbuf = tx_pkts[i];
-		/* check and update sequence number */
-		txsp[i].pool = tx_pkts[i]->pool;
-		if (txsp[i-1].pool == tx_pkts[i]->pool)
-			txsp[i].same_pool = txsp[i-1].same_pool + 1;
-		else
-			txsp[i].same_pool = 1;
-	}
 }
 
 uint16_t
@@ -416,7 +422,6 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct igb_tx_queue *txq = (struct igb_tx_queue *)tx_queue;
 	volatile union ixgbe_adv_tx_desc *txdp;
 	struct igb_tx_entry_v *txep;
-	struct igb_tx_entry_seq *txsp;
 	uint16_t n, nb_commit, tx_id;
 	uint64_t flags = DCMD_DTYP_FLAGS;
 	uint64_t rs = IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS;
@@ -435,14 +440,13 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	tx_id = txq->tx_tail;
 	txdp = &txq->tx_ring[tx_id];
 	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[tx_id];
-	txsp = &txq->sw_ring_seq[tx_id];
 
 	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
 
 	n = (uint16_t)(txq->nb_tx_desc - tx_id);
 	if (nb_commit >= n) {
 
-		tx_backlog_entry(txep, txsp, tx_pkts, n);
+		tx_backlog_entry(txep, tx_pkts, n);
 
 		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
 			vtx1(txdp, *tx_pkts, flags);
@@ -457,10 +461,9 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* avoid reach the end of ring */
 		txdp = &(txq->tx_ring[tx_id]);
 		txep = &(((struct igb_tx_entry_v *)txq->sw_ring)[tx_id]);
-		txsp = &(txq->sw_ring_seq[tx_id]);
 	}
 
-	tx_backlog_entry(txep, txsp, tx_pkts, nb_commit);
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
 
 	vtx(txdp, tx_pkts, nb_commit, flags);
 
@@ -484,7 +487,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
 {
 	unsigned i;
 	struct igb_tx_entry_v *txe;
-	struct igb_tx_entry_seq *txs;
 	uint16_t nb_free, max_desc;
 
 	if (txq->sw_ring != NULL) {
@@ -502,10 +504,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
 		for (i = 0; i < txq->nb_tx_desc; i++) {
 			txe = (struct igb_tx_entry_v *)&txq->sw_ring[i];
 			txe->mbuf = NULL;
-
-			txs = &txq->sw_ring_seq[i];
-			txs->pool = NULL;
-			txs->same_pool = 0;
 		}
 	}
 }
@@ -520,11 +518,6 @@ ixgbe_tx_free_swring(struct igb_tx_queue *txq)
 		rte_free((struct igb_rx_entry *)txq->sw_ring - 1);
 		txq->sw_ring = NULL;
 	}
-
-	if (txq->sw_ring_seq != NULL) {
-		rte_free(txq->sw_ring_seq - 1);
-		txq->sw_ring_seq = NULL;
-	}
 }
 
 static void
@@ -533,7 +526,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
 	static const union ixgbe_adv_tx_desc zeroed_desc = { .read = {
 			.buffer_addr = 0} };
 	struct igb_tx_entry_v *txe = (struct igb_tx_entry_v *)txq->sw_ring;
-	struct igb_tx_entry_seq *txs = txq->sw_ring_seq;
 	uint16_t i;
 
 	/* Zero out HW ring memory */
@@ -545,8 +537,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
 		volatile union ixgbe_adv_tx_desc *txd = &txq->tx_ring[i];
 		txd->wb.status = IXGBE_TXD_STAT_DD;
 		txe[i].mbuf = NULL;
-		txs[i].pool = NULL;
-		txs[i].same_pool = 0;
 	}
 
 	txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
@@ -588,28 +578,14 @@ ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq)
 	return 0;
 }
 
-int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
-			unsigned int socket_id)
+int ixgbe_txq_vec_setup(struct igb_tx_queue *txq)
 {
-	uint16_t nb_desc;
-
 	if (txq->sw_ring == NULL)
 		return -1;
 
-	/* request addtional one entry for continous sequence check */
-	nb_desc = (uint16_t)(txq->nb_tx_desc + 1);
-
-	txq->sw_ring_seq = rte_zmalloc_socket("txq->sw_ring_seq",
-				sizeof(struct igb_tx_entry_seq) * nb_desc,
-				CACHE_LINE_SIZE, socket_id);
-	if (txq->sw_ring_seq == NULL)
-		return -1;
-
-
 	/* leave the first one for overflow */
 	txq->sw_ring = (struct igb_tx_entry *)
 		((struct igb_tx_entry_v *)txq->sw_ring + 1);
-	txq->sw_ring_seq += 1;
 	txq->ops = &vec_txq_ops;
 
 	return 0;
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v2 13/13] ixgbe: Improve slow-path perf: vector scattered RX
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (11 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 12/13] ixgbe: Fix perf regression due to moved pool ptr Bruce Richardson
@ 2014-09-11 13:15   ` Bruce Richardson
  2014-09-17 22:35   ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Thomas Monjalon
  13 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-11 13:15 UTC (permalink / raw)
  To: dev

Provide a wrapper routine to enable receive of scattered packets with a
vector driver. This improves the performance of the slow-path RX.

Updates in V2:
* Fixed up some checkpatch errors and warnings

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |  16 ++++
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     |   8 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 171 ++++++++++++++++++++++++++++++++--
 3 files changed, 186 insertions(+), 9 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index d6448a4..a80cade 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -3476,12 +3476,20 @@ ixgbe_dev_rx_init(struct rte_eth_dev *dev)
 		if ((dev->data->dev_conf.rxmode.max_rx_pkt_len +
 				2 * IXGBE_VLAN_TAG_SIZE) > buf_size){
 			dev->data->scattered_rx = 1;
+#ifdef RTE_IXGBE_INC_VECTOR
+			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
+#else
 			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
+#endif
 		}
 	}
 
 	if (dev->data->dev_conf.rxmode.enable_scatter) {
+#ifdef RTE_IXGBE_INC_VECTOR
+		dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
+#else
 		dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
+#endif
 		dev->data->scattered_rx = 1;
 	}
 
@@ -3969,12 +3977,20 @@ ixgbevf_dev_rx_init(struct rte_eth_dev *dev)
 		if ((dev->data->dev_conf.rxmode.max_rx_pkt_len +
 				2 * IXGBE_VLAN_TAG_SIZE) > buf_size) {
 			dev->data->scattered_rx = 1;
+#ifdef RTE_IXGBE_INC_VECTOR
+			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
+#else
 			dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
+#endif
 		}
 	}
 
 	if (dev->data->dev_conf.rxmode.enable_scatter) {
+#ifdef RTE_IXGBE_INC_VECTOR
+		dev->rx_pkt_burst = ixgbe_recv_scattered_pkts_vec;
+#else
 		dev->rx_pkt_burst = ixgbe_recv_scattered_pkts;
+#endif
 		dev->data->scattered_rx = 1;
 	}
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index a97fddb..7b5ac0e 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -244,8 +244,12 @@ struct ixgbe_txq_ops {
 			 IXGBE_ADVTXD_DCMD_EOP)
 
 #ifdef RTE_IXGBE_INC_VECTOR
-uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
-uint16_t ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
+uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts);
+uint16_t ixgbe_recv_scattered_pkts_vec(void *rx_queue,
+		struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
+uint16_t ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
+		uint16_t nb_pkts);
 int ixgbe_txq_vec_setup(struct igb_tx_queue *txq);
 int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq);
 int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev);
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index 8f34f59..56f5314 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -164,12 +164,11 @@ desc_to_olflags_v(__m128i descs[4], struct rte_mbuf **rx_pkts)
  *   numbers of DD bit
  * - don't support ol_flags for rss and csum err
  */
-uint16_t
-ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
-		uint16_t nb_pkts)
+static inline uint16_t
+_recv_raw_pkts_vec(struct igb_rx_queue *rxq, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts, uint8_t *split_packet)
 {
 	volatile union ixgbe_adv_rx_desc *rxdp;
-	struct igb_rx_queue *rxq = rx_queue;
 	struct igb_rx_entry *sw_ring;
 	uint16_t nb_pkts_recd;
 	int pos;
@@ -182,7 +181,7 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 				-rxq->crc_len, /* sub crc on data_len */
 				0            /* ignore pkt_type field */
 			);
-	__m128i dd_check;
+	__m128i dd_check, eop_check;
 
 	if (unlikely(nb_pkts < RTE_IXGBE_VPMD_RX_BURST))
 		return 0;
@@ -207,6 +206,9 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 	/* 4 packets DD mask */
 	dd_check = _mm_set_epi64x(0x0000000100000001LL, 0x0000000100000001LL);
 
+	/* 4 packets EOP mask */
+	eop_check = _mm_set_epi64x(0x0000000200000002LL, 0x0000000200000002LL);
+
 	/* mask to shuffle from desc. to mbuf */
 	shuf_msk = _mm_set_epi8(
 		7, 6, 5, 4,  /* octet 4~7, 32bits rss */
@@ -218,7 +220,6 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		0xFF, 0xFF   /* skip pkt_type field */
 		);
 
-
 	/* Cache is empty -> need to scan the buffer rings, but first move
 	 * the next 'n' mbufs into the cache */
 	sw_ring = &rxq->sw_ring[rxq->rx_tail];
@@ -227,6 +228,7 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 	 * A. load 4 packet in one loop
 	 * B. copy 4 mbuf point from swring to rx_pkts
 	 * C. calc the number of DD bits among the 4 packets
+	 * [C*. extract the end-of-packet bit, if requested]
 	 * D. fill info. from desc to mbuf
 	 */
 	for (pos = 0, nb_pkts_recd = 0; pos < RTE_IXGBE_VPMD_RX_BURST;
@@ -237,6 +239,13 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		__m128i zero, staterr, sterr_tmp1, sterr_tmp2;
 		__m128i mbp1, mbp2; /* two mbuf pointer in one XMM reg. */
 
+		if (split_packet) {
+			rte_prefetch0(&rx_pkts[pos]->cacheline1);
+			rte_prefetch0(&rx_pkts[pos + 1]->cacheline1);
+			rte_prefetch0(&rx_pkts[pos + 2]->cacheline1);
+			rte_prefetch0(&rx_pkts[pos + 3]->cacheline1);
+		}
+
 		/* B.1 load 1 mbuf point */
 		mbp1 = _mm_loadu_si128((__m128i *)&sw_ring[pos]);
 
@@ -295,7 +304,34 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 		pkt_mb2 = _mm_add_epi16(pkt_mb2, crc_adjust);
 		pkt_mb1 = _mm_add_epi16(pkt_mb1, crc_adjust);
 
-		/* C.3 calc avaialbe number of desc */
+		/* C* extract and record EOP bit */
+		if (split_packet) {
+			__m128i eop_shuf_mask = _mm_set_epi8(
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0xFF, 0xFF, 0xFF, 0xFF,
+					0x04, 0x0C, 0x00, 0x08
+					);
+
+			/* and with mask to extract bits, flipping 1-0 */
+			__m128i eop_bits = _mm_andnot_si128(staterr, eop_check);
+			/* the staterr values are not in order, as the count
+			 * count of dd bits doesn't care. However, for end of
+			 * packet tracking, we do care, so shuffle. This also
+			 * compresses the 32-bit values to 8-bit */
+			eop_bits = _mm_shuffle_epi8(eop_bits, eop_shuf_mask);
+			/* store the resulting 32-bit value */
+			*(int *)split_packet = _mm_cvtsi128_si32(eop_bits);
+			split_packet += RTE_IXGBE_DESCS_PER_LOOP;
+
+			/* zero-out next pointers */
+			rx_pkts[pos]->next = NULL;
+			rx_pkts[pos + 1]->next = NULL;
+			rx_pkts[pos + 2]->next = NULL;
+			rx_pkts[pos + 3]->next = NULL;
+		}
+
+		/* C.3 calc available number of desc */
 		staterr = _mm_and_si128(staterr, dd_check);
 		staterr = _mm_packs_epi32(staterr, zero);
 
@@ -319,6 +355,127 @@ ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 	return nb_pkts_recd;
 }
+
+/*
+ * vPMD receive routine, now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
+ * in one loop
+ *
+ * Notice:
+ * - nb_pkts < RTE_IXGBE_VPMD_RX_BURST, just return no packet
+ * - nb_pkts > RTE_IXGBE_VPMD_RX_BURST, only scan RTE_IXGBE_VPMD_RX_BURST
+ *   numbers of DD bit
+ * - don't support ol_flags for rss and csum err
+ */
+uint16_t
+ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts)
+{
+	return _recv_raw_pkts_vec(rx_queue, rx_pkts, nb_pkts, NULL);
+}
+
+static inline uint16_t
+reassemble_packets(struct igb_rx_queue *rxq, struct rte_mbuf **rx_bufs,
+		uint16_t nb_bufs, uint8_t *split_flags)
+{
+	struct rte_mbuf *pkts[RTE_IXGBE_VPMD_RX_BURST]; /*finished pkts*/
+	struct rte_mbuf *start = rxq->pkt_first_seg;
+	struct rte_mbuf *end =  rxq->pkt_last_seg;
+	unsigned pkt_idx = 0, buf_idx = 0;
+
+
+	while (buf_idx < nb_bufs) {
+		if (end != NULL) {
+			/* processing a split packet */
+			end->next = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+
+			start->nb_segs++;
+			start->pkt_len += rx_bufs[buf_idx]->data_len;
+			end = end->next;
+
+			if (!split_flags[buf_idx]) {
+				/* it's the last packet of the set */
+				start->hash = end->hash;
+				start->ol_flags = end->ol_flags;
+				/* we need to strip crc for the whole packet */
+				start->pkt_len -= rxq->crc_len;
+				if (end->data_len > rxq->crc_len)
+					end->data_len -= rxq->crc_len;
+				else {
+					/* free up last mbuf */
+					struct rte_mbuf *secondlast = start;
+					while (secondlast->next != end)
+						secondlast = secondlast->next;
+					secondlast->data_len -= (rxq->crc_len -
+							end->data_len);
+					secondlast->next = NULL;
+					rte_pktmbuf_free_seg(end);
+					end = secondlast;
+				}
+				pkts[pkt_idx++] = start;
+				start = end = NULL;
+			}
+		} else {
+			/* not processing a split packet */
+			if (!split_flags[buf_idx]) {
+				/* not a split packet, save and skip */
+				pkts[pkt_idx++] = rx_bufs[buf_idx];
+				continue;
+			}
+			end = start = rx_bufs[buf_idx];
+			rx_bufs[buf_idx]->data_len += rxq->crc_len;
+			rx_bufs[buf_idx]->pkt_len += rxq->crc_len;
+		}
+		buf_idx++;
+	}
+
+	/* save the partial packet for next time */
+	rxq->pkt_first_seg = start;
+	rxq->pkt_last_seg = end;
+	memcpy(rx_bufs, pkts, pkt_idx * (sizeof(*pkts)));
+	return pkt_idx;
+}
+
+/*
+ * vPMD receive routine that reassembles scattered packets
+ *
+ * Notice:
+ * - don't support ol_flags for rss and csum err
+ * - now only accept (nb_pkts == RTE_IXGBE_VPMD_RX_BURST)
+ */
+uint16_t
+ixgbe_recv_scattered_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts,
+		uint16_t nb_pkts)
+{
+	struct igb_rx_queue *rxq = rx_queue;
+	uint8_t split_flags[RTE_IXGBE_VPMD_RX_BURST] = {0};
+
+	/* get some new buffers */
+	uint16_t nb_bufs = _recv_raw_pkts_vec(rxq, rx_pkts, nb_pkts,
+			split_flags);
+	if (nb_bufs == 0)
+		return 0;
+
+	/* happy day case, full burst + no packets to be joined */
+	const uint32_t *split_fl32 = (uint32_t *)split_flags;
+	if (rxq->pkt_first_seg == NULL &&
+			split_fl32[0] == 0 && split_fl32[1] == 0 &&
+			split_fl32[2] == 0 && split_fl32[3] == 0)
+		return nb_bufs;
+
+	/* reassemble any packets that need reassembly*/
+	unsigned i = 0;
+	if (rxq->pkt_first_seg == NULL) {
+		/* find the first split flag, and only reassemble then*/
+		while (!split_flags[i] && i < nb_bufs)
+			i++;
+		if (i == nb_bufs)
+			return nb_bufs;
+	}
+	return i + reassemble_packets(rxq, &rx_pkts[i], nb_bufs - i,
+		&split_flags[i]);
+}
+
 static inline void
 vtx1(volatile union ixgbe_adv_tx_desc *txdp,
 		struct rte_mbuf *pkt, uint64_t flags)
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-09  9:01     ` Richardson, Bruce
@ 2014-09-12 16:56       ` Dumitrescu, Cristian
  2014-09-12 21:02         ` Olivier MATZ
  0 siblings, 1 reply; 62+ messages in thread
From: Dumitrescu, Cristian @ 2014-09-12 16:56 UTC (permalink / raw)
  To: Richardson, Bruce, Olivier MATZ, dev

Bruce, Olivier, 

What is the reason to remove this field? Please explain the rationale of removing this field.

We previously agreed we need to provide an easy and standard mechanism for applications to extend the mandatory per buffer metadata (mbuf) with optional application-dependent metadata. This field just provides a clean way for the apps to know where is the end of the mandatory metadata, i.e. the first location in the packet buffer where the app can add its own metadata (of course, the app has to manage the headroom space before the first byte of packet data). A zero-size field is the standard mechanism that DPDK uses extensively in pretty much every library to access memory immediately after a header structure.

The impact of removing this field is that there is no standard way to identify where the end of the mandatory metadata is, so each application will have to reinvent this. With no clear convention, we will end up with a lot of non-standard ways. Every time the format of the mbuf structure is going to be changed, this can potentially break applications that use custom metadata, while using this simple standard mechanism would prevent this. So why remove this?

Having applications define their optional meta-data is a real need. Please take a look at the Service Chaining IEFT emerging protocols (https://datatracker.ietf.org/wg/sfc/documents/), which provide standard mechanisms for applications to define their own packet meta-data and share it between the elements of the processing pipeline (for Service Chaining, these are typically virtual machines scattered amongst the data center).

And, in my opinion, there is no negative impact/cost associated with keeping this field.

Regards,
Cristian

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Richardson, Bruce
Sent: Tuesday, September 9, 2014 10:01 AM
To: Olivier MATZ; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata

> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Monday, September 08, 2014 1:06 PM
> To: Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the
> mbuf metadata
> 
> Hi Bruce,
> 
> On 09/03/2014 05:49 PM, Bruce Richardson wrote:
> > Removed the explicit zero-sized metadata definition at the end of the
> > mbuf data structure. Updated the metadata macros to take account of this
> > change so that all existing code which uses those macros still works.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  lib/librte_mbuf/rte_mbuf.h | 22 ++++++++--------------
> >  1 file changed, 8 insertions(+), 14 deletions(-)
> >
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
> > index 5260001..ca66d9a 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -166,31 +166,25 @@ struct rte_mbuf {
> >  	struct rte_mempool *pool; /**< Pool from which mbuf was allocated.
> */
> >  	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> >
> > -	union {
> > -		uint8_t metadata[0];
> > -		uint16_t metadata16[0];
> > -		uint32_t metadata32[0];
> > -		uint64_t metadata64[0];
> > -	} __rte_cache_aligned;
> >  } __rte_cache_aligned;
> >
> >  #define RTE_MBUF_METADATA_UINT8(mbuf, offset)              \
> > -	(mbuf->metadata[offset])
> > +	(((uint8_t *)&(mbuf)[1])[offset])
> >  #define RTE_MBUF_METADATA_UINT16(mbuf, offset)             \
> > -	(mbuf->metadata16[offset/sizeof(uint16_t)])
> > +	(((uint16_t *)&(mbuf)[1])[offset/sizeof(uint16_t)])
> >  #define RTE_MBUF_METADATA_UINT32(mbuf, offset)             \
> > -	(mbuf->metadata32[offset/sizeof(uint32_t)])
> > +	(((uint32_t *)&(mbuf)[1])[offset/sizeof(uint32_t)])
> >  #define RTE_MBUF_METADATA_UINT64(mbuf, offset)             \
> > -	(mbuf->metadata64[offset/sizeof(uint64_t)])
> > +	(((uint64_t *)&(mbuf)[1])[offset/sizeof(uint64_t)])
> >
> >  #define RTE_MBUF_METADATA_UINT8_PTR(mbuf, offset)          \
> > -	(&mbuf->metadata[offset])
> > +	(&RTE_MBUF_METADATA_UINT8(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT16_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata16[offset/sizeof(uint16_t)])
> > +	(&RTE_MBUF_METADATA_UINT16(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT32_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata32[offset/sizeof(uint32_t)])
> > +	(&RTE_MBUF_METADATA_UINT32(mbuf, offset))
> >  #define RTE_MBUF_METADATA_UINT64_PTR(mbuf, offset)         \
> > -	(&mbuf->metadata64[offset/sizeof(uint64_t)])
> > +	(&RTE_MBUF_METADATA_UINT64(mbuf, offset))
> >
> >  /**
> >   * Given the buf_addr returns the pointer to corresponding mbuf.
> >
> 
> I think it goes in the good direction. So:
> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> 
> Just one question: why not removing RTE_MBUF_METADATA*() macros?
> I'd just provide one macro that gives a (void*) to the first byte
> after the mbuf structure.
> 
> The format of the metadata is up to the application, that usually
> casts (m + 1) into a private structure, making the macros not very
> useful. I suggest to move these macros outside rte_mbuf.h, in the
> application-specific or library-specific header, what do you think?
> 
> Regards,
> Olivier
> 
Yes, I'll look into that.

/Bruce
--------------------------------------------------------------
Intel Shannon Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263
Business address: Dromore House, East Park, Shannon, Co. Clare

This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-12 16:56       ` Dumitrescu, Cristian
@ 2014-09-12 21:02         ` Olivier MATZ
  2014-09-16 20:07           ` Dumitrescu, Cristian
  0 siblings, 1 reply; 62+ messages in thread
From: Olivier MATZ @ 2014-09-12 21:02 UTC (permalink / raw)
  To: Dumitrescu, Cristian, Richardson, Bruce, dev

Hello Cristian,

> What is the reason to remove this field? Please explain the
> rationale of removing this field.

The rationale is explained in
http://dpdk.org/ml/archives/dev/2014-September/005232.html

"The format of the metadata is up to the application".

The type of data the application stores after the mbuf has not
to be defined in the mbuf. These macros limits the types of
metadata to uint8, uint16, uint32, uint64? What should I do
if I need a void*, a struct foo ? Should we add a macro for
each possible type?

> We previously agreed we need to provide an easy and standard
> mechanism for applications to extend the mandatory per buffer
> metadata (mbuf) with optional application-dependent
> metadata.

Defining a structure in the application which does not pollute
the rte_mbuf structure is "easy and standard(TM)" too.

> This field just provides a clean way for the apps to
> know where is the end of the mandatory metadata, i.e. the first
> location in the packet buffer where the app can add its own
> metadata (of course, the app has to manage the headroom space
> before the first byte of packet data). A zero-size field is the
> standard mechanism that DPDK uses extensively in pretty much
> every library to access memory immediately after a header
> structure.

Having the following is clean too:

struct metadata {
     ...
};

struct app_mbuf {
     struct rte_mbuf mbuf;
     struct metadata metadata;
};

There is no need to define anything in the mbuf structure.

> 
> The impact of removing this field is that there is no standard
> way to identify where the end of the mandatory metadata is, so
> each application will have to reinvent this. With no clear
> convention, we will end up with a lot of non-standard ways. Every
> time the format of the mbuf structure is going to be changed,
> this can potentially break applications that use custom metadata,
> while using this simple standard mechanism would prevent this. So
> why remove this?

Waow. Five occurences of "standard" until now. Could you give a
reference to the standard you're refering to? :)

Our application defines private metadata in mbufs in the way described
above, we never changed that since we're supporting the dpdk. So
I don't understand when you say that each time mbuf is reformatted
it breaks the application.

> Having applications define their optional meta-data is a real
> need. 

Sure. This patch does not prevent this at all. You can continue
to do exactly the same, but in the concerned application, not
in the generic mbuf structure.

> Please take a look at the Service Chaining IEFT emerging
> protocols (https://datatracker.ietf.org/wg/sfc/documents/), which
> provide standard mechanisms for applications to define their own

Six :)

I'm not sure these documents define the way to extend a packet
structure with metadata in a C program. Again, Bruce's patch does
not prevent to do what you need, it just moves it at the proper
place.

> packet meta-data and share it between the elements of the
> processing pipeline (for Service Chaining, these are typically
> virtual machines scattered amongst the data center).
> 
> And, in my opinion, there is no negative impact/cost associated
> with keeping this field.

To summarize what I think:

- this patch does not prevent to do what you want to do
- removing the macros help to have a shorter and more comprehensible
  mbuf structure
- the previous approach does not scale because it would require
  a macro per type


> --------------------------------------------------------------
> Intel Shannon Limited
> Registered in Ireland
> Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
> Registered Number: 308263
> Business address: Dromore House, East Park, Shannon, Co. Clare
> 
> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

This is a public mailing list, this disclaimer is invalid.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use Bruce Richardson
@ 2014-09-15  7:11     ` Liu, Jijiang
  2014-09-15  8:19       ` Richardson, Bruce
  0 siblings, 1 reply; 62+ messages in thread
From: Liu, Jijiang @ 2014-09-15  7:11 UTC (permalink / raw)
  To: Richardson, Bruce, dev

Hi Bruce,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Thursday, September 11, 2014 9:16 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use
> 
> *  Reorder the fields in the mbuf so that we have fields that are used together
> side-by-side in the structure. This means that we have a contiguous block of
> 8-bytes in the mbuf which are used to reset an mbuf of descriptor rearm, and a
> block of 16-bytes of data (excluding flags) which are set on RX from the
> received packet descriptor.
> * Use dummy fields as appropriate to ensure alignment or to reserve gaps for
> later field additions.
> * Place most items which are not used by fast-path RX separately at the end of
> the structure so they can later be moved to a separate cache line.
> [The l2/l3 length fields are not moved at this stage as doing so will cause
> overflow to the next cache line].
> 
> Updated in V2:
> * Updated the KNI buffer structure to match that of new mbuf
> * Cleaned up commit message typos.
> 
> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
>  .../linuxapp/eal/include/exec-env/rte_kni_common.h | 12 ++++++-----
>  lib/librte_mbuf/rte_mbuf.h                         | 25
> ++++++++++++----------
>  2 files changed, 21 insertions(+), 16 deletions(-)
> 
> diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> index 07908ac..f2b502c 100644
> --- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> +++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> @@ -108,15 +108,17 @@ struct rte_kni_fifo {
>   * Padding is necessary to assure the offsets of these fields
>   */
>  struct rte_kni_mbuf {
> -	void *pool;
>  	void *buf_addr;
> -	char pad0[16];
> -	void *next;
> +	char pad0[10];
>  	uint16_t data_off;      /**< Start address of data in segment buffer. */
> +	char pad1[4];
> +	uint16_t ol_flags;      /**< Offload features. */
> +	char pad2[8];
>  	uint16_t data_len;      /**< Amount of data in segment buffer. */
>  	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len.
> */
> -	char pad2[4];
> -	uint16_t ol_flags;      /**< Offload features. */
> +	char pad3[8];
> +	void *pool;
> +	void *next;
>  } __attribute__((__aligned__(64)));
> 
>  /*
> diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index
> 71080e5..413a5a1 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -115,16 +115,12 @@ extern "C" {
>   * The generic rte_mbuf, containing a packet mbuf.
>   */
>  struct rte_mbuf {
> -	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
>  	void *buf_addr;           /**< Virtual address of segment buffer. */
>  	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
> -	uint16_t buf_len;         /**< Length of segment buffer. */
> 
> -	/* valid for any segment */
> -	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> +	/* next 8 bytes are initialised on RX descriptor rearm */
> +	uint16_t buf_len;         /**< Length of segment buffer. */
>  	uint16_t data_off;
> -	uint16_t data_len;        /**< Amount of data in segment buffer. */
> -	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
> 
>  #ifdef RTE_MBUF_REFCNT
>  	/**
> @@ -142,14 +138,17 @@ struct rte_mbuf {
>  #else
>  	uint16_t refcnt_reserved;     /**< Do not use this field */
>  #endif
> -	uint16_t reserved;            /**< Unused field. Required for padding
> */
> -	uint16_t ol_flags;            /**< Offload features. */
> -
> -	/* these fields are valid for first segment only */
>  	uint8_t nb_segs;        /**< Number of segments. */
>  	uint8_t port;           /**< Input port. */
> 
> -	/* offload features, valid for first segment only */
> +	uint16_t ol_flags;      /**< Offload features. */
> +	uint16_t reserved0;     /**< Unused field. Required for padding */
> +	uint32_t reserved1;     /**< Unused field. Required for padding */
> +
> +	/* remaining bytes are set on RX when pulling packet from descriptor */
> +	uint16_t reserved2;     /**< Unused field. Required for padding */
> +	uint16_t data_len;      /**< Amount of data in segment buffer. */
> +	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
>  	union {
>  		uint16_t l2_l3_len; /**< combined l2/l3 lengths as single var */
>  		struct {
> @@ -167,6 +166,10 @@ struct rte_mbuf {
>  		uint32_t sched;     /**< Hierarchical scheduler */
>  	} hash;                 /**< hash information */
> 
> +	/* fields only used in slow path or on TX */
> +	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
> +	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> +
>  	union {
>  		uint8_t metadata[0];
>  		uint16_t metadata16[0];
> --
> 1.9.3

It seems that you removed the inner_l3_len and inner_l2_len fields from the rte_mbuf.h structure, what is the reason?

/Jijiang Liu 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use
  2014-09-15  7:11     ` Liu, Jijiang
@ 2014-09-15  8:19       ` Richardson, Bruce
  0 siblings, 0 replies; 62+ messages in thread
From: Richardson, Bruce @ 2014-09-15  8:19 UTC (permalink / raw)
  To: Liu, Jijiang, dev

> -----Original Message-----
> From: Liu, Jijiang
> Sent: Monday, September 15, 2014 8:12 AM
> To: Richardson, Bruce; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use
> 
> Hi Bruce,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Thursday, September 11, 2014 9:16 PM
> > To: dev@dpdk.org
> > Subject: [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use
> >
> > *  Reorder the fields in the mbuf so that we have fields that are used together
> > side-by-side in the structure. This means that we have a contiguous block of
> > 8-bytes in the mbuf which are used to reset an mbuf of descriptor rearm, and a
> > block of 16-bytes of data (excluding flags) which are set on RX from the
> > received packet descriptor.
> > * Use dummy fields as appropriate to ensure alignment or to reserve gaps for
> > later field additions.
> > * Place most items which are not used by fast-path RX separately at the end of
> > the structure so they can later be moved to a separate cache line.
> > [The l2/l3 length fields are not moved at this stage as doing so will cause
> > overflow to the next cache line].
> >
> > Updated in V2:
> > * Updated the KNI buffer structure to match that of new mbuf
> > * Cleaned up commit message typos.
> >
> > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> >  .../linuxapp/eal/include/exec-env/rte_kni_common.h | 12 ++++++-----
> >  lib/librte_mbuf/rte_mbuf.h                         | 25
> > ++++++++++++----------
> >  2 files changed, 21 insertions(+), 16 deletions(-)
> >
> > diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > index 07908ac..f2b502c 100644
> > --- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > +++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
> > @@ -108,15 +108,17 @@ struct rte_kni_fifo {
> >   * Padding is necessary to assure the offsets of these fields
> >   */
> >  struct rte_kni_mbuf {
> > -	void *pool;
> >  	void *buf_addr;
> > -	char pad0[16];
> > -	void *next;
> > +	char pad0[10];
> >  	uint16_t data_off;      /**< Start address of data in segment buffer. */
> > +	char pad1[4];
> > +	uint16_t ol_flags;      /**< Offload features. */
> > +	char pad2[8];
> >  	uint16_t data_len;      /**< Amount of data in segment buffer. */
> >  	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len.
> > */
> > -	char pad2[4];
> > -	uint16_t ol_flags;      /**< Offload features. */
> > +	char pad3[8];
> > +	void *pool;
> > +	void *next;
> >  } __attribute__((__aligned__(64)));
> >
> >  /*
> > diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index
> > 71080e5..413a5a1 100644
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -115,16 +115,12 @@ extern "C" {
> >   * The generic rte_mbuf, containing a packet mbuf.
> >   */
> >  struct rte_mbuf {
> > -	struct rte_mempool *pool; /**< Pool from which mbuf was allocated.
> */
> >  	void *buf_addr;           /**< Virtual address of segment buffer. */
> >  	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
> > -	uint16_t buf_len;         /**< Length of segment buffer. */
> >
> > -	/* valid for any segment */
> > -	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> > +	/* next 8 bytes are initialised on RX descriptor rearm */
> > +	uint16_t buf_len;         /**< Length of segment buffer. */
> >  	uint16_t data_off;
> > -	uint16_t data_len;        /**< Amount of data in segment buffer. */
> > -	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
> >
> >  #ifdef RTE_MBUF_REFCNT
> >  	/**
> > @@ -142,14 +138,17 @@ struct rte_mbuf {
> >  #else
> >  	uint16_t refcnt_reserved;     /**< Do not use this field */
> >  #endif
> > -	uint16_t reserved;            /**< Unused field. Required for padding
> > */
> > -	uint16_t ol_flags;            /**< Offload features. */
> > -
> > -	/* these fields are valid for first segment only */
> >  	uint8_t nb_segs;        /**< Number of segments. */
> >  	uint8_t port;           /**< Input port. */
> >
> > -	/* offload features, valid for first segment only */
> > +	uint16_t ol_flags;      /**< Offload features. */
> > +	uint16_t reserved0;     /**< Unused field. Required for padding */
> > +	uint32_t reserved1;     /**< Unused field. Required for padding */
> > +
> > +	/* remaining bytes are set on RX when pulling packet from descriptor */
> > +	uint16_t reserved2;     /**< Unused field. Required for padding */
> > +	uint16_t data_len;      /**< Amount of data in segment buffer. */
> > +	uint32_t pkt_len;       /**< Total pkt len: sum of all segments. */
> >  	union {
> >  		uint16_t l2_l3_len; /**< combined l2/l3 lengths as single var */
> >  		struct {
> > @@ -167,6 +166,10 @@ struct rte_mbuf {
> >  		uint32_t sched;     /**< Hierarchical scheduler */
> >  	} hash;                 /**< hash information */
> >
> > +	/* fields only used in slow path or on TX */
> > +	struct rte_mempool *pool; /**< Pool from which mbuf was allocated.
> */
> > +	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> > +
> >  	union {
> >  		uint8_t metadata[0];
> >  		uint16_t metadata16[0];
> > --
> > 1.9.3
> 
> It seems that you removed the inner_l3_len and inner_l2_len fields from the
> rte_mbuf.h structure, what is the reason?
> 
> /Jijiang Liu

No, those fields were never added to the mbuf, they were simply proposed as a new field addition in an RFC patch set. The adding of new fields to the mbuf is beyond the scope of the two patch sets already submitted. Those patch sets merely reorder the existing l2 and l3 length fields without changing their size or name.

/Bruce

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [dpdk-dev] [PATCH v3 12/13] ixgbe: Fix perf regression due to moved pool ptr
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 12/13] ixgbe: Fix perf regression due to moved pool ptr Bruce Richardson
@ 2014-09-15 16:20     ` Bruce Richardson
  0 siblings, 0 replies; 62+ messages in thread
From: Bruce Richardson @ 2014-09-15 16:20 UTC (permalink / raw)
  To: dev

Adjust the fast-path code to fix the regression caused by the pool
pointer moving to the second cache line. This change adjusts the
prefetching and also the way in which the mbufs are freed back to the
mempool.
Note: slow-path e.g. path supporting jumbo frames, is still slower, but
is dealt with by a later commit

Updates in V2:
* fixup checkpatch issue

Updates in V3:
* The variable definitions for freeing mbufs now need to be included
whether or not reference counting is enabled.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     |  8 ++-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     | 14 +-----
 lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c | 94 +++++++++++++----------------------
 3 files changed, 38 insertions(+), 78 deletions(-)

diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 1a46393..d6448a4 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -142,10 +142,6 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	 */
 	txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
 
-	/* prefetch the mbufs that are about to be freed */
-	for (i = 0; i < txq->tx_rs_thresh; ++i)
-		rte_prefetch0((txep + i)->mbuf);
-
 	/* free buffers one at a time */
 	if ((txq->txq_flags & (uint32_t)ETH_TXQ_FLAGS_NOREFCOUNT) != 0) {
 		for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
@@ -186,6 +182,7 @@ tx4(volatile union ixgbe_adv_tx_desc *txdp, struct rte_mbuf **pkts)
 				((uint32_t)DCMD_DTYP_FLAGS | pkt_len);
 		txdp->read.olinfo_status =
 				(pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+		rte_prefetch0(&(*pkts)->pool);
 	}
 }
 
@@ -205,6 +202,7 @@ tx1(volatile union ixgbe_adv_tx_desc *txdp, struct rte_mbuf **pkts)
 			((uint32_t)DCMD_DTYP_FLAGS | pkt_len);
 	txdp->read.olinfo_status =
 			(pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+	rte_prefetch0(&(*pkts)->pool);
 }
 
 /*
@@ -1875,7 +1873,7 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev *dev,
 		PMD_INIT_LOG(INFO, "Using simple tx code path\n");
 #ifdef RTE_IXGBE_INC_VECTOR
 		if (txq->tx_rs_thresh <= RTE_IXGBE_TX_MAX_FREE_BUF_SZ &&
-		    ixgbe_txq_vec_setup(txq, socket_id) == 0) {
+		    ixgbe_txq_vec_setup(txq) == 0) {
 			PMD_INIT_LOG(INFO, "Vector tx enabled.\n");
 			dev->tx_pkt_burst = ixgbe_xmit_pkts_vec;
 		}
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index e92a864..a97fddb 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -96,14 +96,6 @@ struct igb_tx_entry_v {
 };
 
 /**
- * continuous entry sequence, gather by the same mempool
- */
-struct igb_tx_entry_seq {
-	const struct rte_mempool* pool;
-	uint32_t same_pool;
-};
-
-/**
  * Structure associated with each RX queue.
  */
 struct igb_rx_queue {
@@ -190,10 +182,6 @@ struct igb_tx_queue {
 	volatile union ixgbe_adv_tx_desc *tx_ring;
 	uint64_t            tx_ring_phys_addr; /**< TX ring DMA address. */
 	struct igb_tx_entry *sw_ring;      /**< virtual address of SW ring. */
-#ifdef RTE_IXGBE_INC_VECTOR
-	/** continuous tx entry sequence within the same mempool */
-	struct igb_tx_entry_seq *sw_ring_seq;
-#endif
 	volatile uint32_t   *tdt_reg_addr; /**< Address of TDT register. */
 	uint16_t            nb_tx_desc;    /**< number of TX descriptors. */
 	uint16_t            tx_tail;       /**< current value of TDT reg. */
@@ -258,7 +246,7 @@ struct ixgbe_txq_ops {
 #ifdef RTE_IXGBE_INC_VECTOR
 uint16_t ixgbe_recv_pkts_vec(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts);
 uint16_t ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts);
-int ixgbe_txq_vec_setup(struct igb_tx_queue *txq, unsigned int socket_id);
+int ixgbe_txq_vec_setup(struct igb_tx_queue *txq);
 int ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq);
 int ixgbe_rx_vec_condition_check(struct rte_eth_dev *dev);
 #endif
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
index d53e239..9869b8b 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx_vec.c
@@ -342,14 +342,11 @@ static inline int __attribute__((always_inline))
 ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 {
 	struct igb_tx_entry_v *txep;
-	struct igb_tx_entry_seq *txsp;
 	uint32_t status;
-	uint32_t n, k;
-#ifdef RTE_MBUF_REFCNT
+	uint32_t n;
 	uint32_t i;
 	int nb_free = 0;
 	struct rte_mbuf *m, *free[RTE_IXGBE_TX_MAX_FREE_BUF_SZ];
-#endif
 
 	/* check DD bit on threshold descriptor */
 	status = txq->tx_ring[txq->tx_next_dd].wb.status;
@@ -364,23 +361,38 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 	 */
 	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[txq->tx_next_dd -
 			(n - 1)];
-	txsp = &txq->sw_ring_seq[txq->tx_next_dd - (n - 1)];
-
-	while (n > 0) {
-		k = RTE_MIN(n, txsp[n-1].same_pool);
 #ifdef RTE_MBUF_REFCNT
-		for (i = 0; i < k; i++) {
-			m = __rte_pktmbuf_prefree_seg((txep+n-k+i)->mbuf);
-			if (m != NULL)
-				free[nb_free++] = m;
-		}
-		rte_mempool_put_bulk((void *)txsp[n-1].pool,
-				(void **)free, nb_free);
+	m = __rte_pktmbuf_prefree_seg(txep[0].mbuf);
 #else
-		rte_mempool_put_bulk((void *)txsp[n-1].pool,
-				(void **)(txep+n-k), k);
+	m = txep[0].mbuf;
 #endif
-		n -= k;
+	if (likely(m != NULL)) {
+		free[0] = m;
+		nb_free = 1;
+		for (i = 1; i < n; i++) {
+#ifdef RTE_MBUF_REFCNT
+			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
+#else
+			m = txep[i]->mbuf;
+#endif
+			if (likely(m != NULL)) {
+				if (likely(m->pool == free[0]->pool))
+					free[nb_free++] = m;
+				else {
+					rte_mempool_put_bulk(free[0]->pool,
+							(void *)free, nb_free);
+					free[0] = m;
+					nb_free = 1;
+				}
+			}
+		}
+		rte_mempool_put_bulk(free[0]->pool, (void **)free, nb_free);
+	} else {
+		for (i = 1; i < n; i++) {
+			m = __rte_pktmbuf_prefree_seg(txep[i].mbuf);
+			if (m != NULL)
+				rte_mempool_put(m->pool, m);
+		}
 	}
 
 	/* buffers were freed, update counters */
@@ -394,19 +406,11 @@ ixgbe_tx_free_bufs(struct igb_tx_queue *txq)
 
 static inline void __attribute__((always_inline))
 tx_backlog_entry(struct igb_tx_entry_v *txep,
-		 struct igb_tx_entry_seq *txsp,
 		 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
 	int i;
-	for (i = 0; i < (int)nb_pkts; ++i) {
+	for (i = 0; i < (int)nb_pkts; ++i)
 		txep[i].mbuf = tx_pkts[i];
-		/* check and update sequence number */
-		txsp[i].pool = tx_pkts[i]->pool;
-		if (txsp[i-1].pool == tx_pkts[i]->pool)
-			txsp[i].same_pool = txsp[i-1].same_pool + 1;
-		else
-			txsp[i].same_pool = 1;
-	}
 }
 
 uint16_t
@@ -416,7 +420,6 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	struct igb_tx_queue *txq = (struct igb_tx_queue *)tx_queue;
 	volatile union ixgbe_adv_tx_desc *txdp;
 	struct igb_tx_entry_v *txep;
-	struct igb_tx_entry_seq *txsp;
 	uint16_t n, nb_commit, tx_id;
 	uint64_t flags = DCMD_DTYP_FLAGS;
 	uint64_t rs = IXGBE_ADVTXD_DCMD_RS|DCMD_DTYP_FLAGS;
@@ -435,14 +438,13 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 	tx_id = txq->tx_tail;
 	txdp = &txq->tx_ring[tx_id];
 	txep = &((struct igb_tx_entry_v *)txq->sw_ring)[tx_id];
-	txsp = &txq->sw_ring_seq[tx_id];
 
 	txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
 
 	n = (uint16_t)(txq->nb_tx_desc - tx_id);
 	if (nb_commit >= n) {
 
-		tx_backlog_entry(txep, txsp, tx_pkts, n);
+		tx_backlog_entry(txep, tx_pkts, n);
 
 		for (i = 0; i < n - 1; ++i, ++tx_pkts, ++txdp)
 			vtx1(txdp, *tx_pkts, flags);
@@ -457,10 +459,9 @@ ixgbe_xmit_pkts_vec(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* avoid reach the end of ring */
 		txdp = &(txq->tx_ring[tx_id]);
 		txep = &(((struct igb_tx_entry_v *)txq->sw_ring)[tx_id]);
-		txsp = &(txq->sw_ring_seq[tx_id]);
 	}
 
-	tx_backlog_entry(txep, txsp, tx_pkts, nb_commit);
+	tx_backlog_entry(txep, tx_pkts, nb_commit);
 
 	vtx(txdp, tx_pkts, nb_commit, flags);
 
@@ -484,7 +485,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
 {
 	unsigned i;
 	struct igb_tx_entry_v *txe;
-	struct igb_tx_entry_seq *txs;
 	uint16_t nb_free, max_desc;
 
 	if (txq->sw_ring != NULL) {
@@ -502,10 +502,6 @@ ixgbe_tx_queue_release_mbufs(struct igb_tx_queue *txq)
 		for (i = 0; i < txq->nb_tx_desc; i++) {
 			txe = (struct igb_tx_entry_v *)&txq->sw_ring[i];
 			txe->mbuf = NULL;
-
-			txs = &txq->sw_ring_seq[i];
-			txs->pool = NULL;
-			txs->same_pool = 0;
 		}
 	}
 }
@@ -520,11 +516,6 @@ ixgbe_tx_free_swring(struct igb_tx_queue *txq)
 		rte_free((struct igb_rx_entry *)txq->sw_ring - 1);
 		txq->sw_ring = NULL;
 	}
-
-	if (txq->sw_ring_seq != NULL) {
-		rte_free(txq->sw_ring_seq - 1);
-		txq->sw_ring_seq = NULL;
-	}
 }
 
 static void
@@ -533,7 +524,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
 	static const union ixgbe_adv_tx_desc zeroed_desc = { .read = {
 			.buffer_addr = 0} };
 	struct igb_tx_entry_v *txe = (struct igb_tx_entry_v *)txq->sw_ring;
-	struct igb_tx_entry_seq *txs = txq->sw_ring_seq;
 	uint16_t i;
 
 	/* Zero out HW ring memory */
@@ -545,8 +535,6 @@ ixgbe_reset_tx_queue(struct igb_tx_queue *txq)
 		volatile union ixgbe_adv_tx_desc *txd = &txq->tx_ring[i];
 		txd->wb.status = IXGBE_TXD_STAT_DD;
 		txe[i].mbuf = NULL;
-		txs[i].pool = NULL;
-		txs[i].same_pool = 0;
 	}
 
 	txq->tx_next_dd = (uint16_t)(txq->tx_rs_thresh - 1);
@@ -588,28 +576,14 @@ ixgbe_rxq_vec_setup(struct igb_rx_queue *rxq)
 	return 0;
 }
 
-int ixgbe_txq_vec_setup(struct igb_tx_queue *txq,
-			unsigned int socket_id)
+int ixgbe_txq_vec_setup(struct igb_tx_queue *txq)
 {
-	uint16_t nb_desc;
-
 	if (txq->sw_ring == NULL)
 		return -1;
 
-	/* request addtional one entry for continous sequence check */
-	nb_desc = (uint16_t)(txq->nb_tx_desc + 1);
-
-	txq->sw_ring_seq = rte_zmalloc_socket("txq->sw_ring_seq",
-				sizeof(struct igb_tx_entry_seq) * nb_desc,
-				CACHE_LINE_SIZE, socket_id);
-	if (txq->sw_ring_seq == NULL)
-		return -1;
-
-
 	/* leave the first one for overflow */
 	txq->sw_ring = (struct igb_tx_entry *)
 		((struct igb_tx_entry_v *)txq->sw_ring + 1);
-	txq->sw_ring_seq += 1;
 	txq->ops = &vec_txq_ops;
 
 	return 0;
-- 
1.9.3

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-12 21:02         ` Olivier MATZ
@ 2014-09-16 20:07           ` Dumitrescu, Cristian
  2014-09-16 22:06             ` Ramia, Kannan Babu
  0 siblings, 1 reply; 62+ messages in thread
From: Dumitrescu, Cristian @ 2014-09-16 20:07 UTC (permalink / raw)
  To: Olivier MATZ, Richardson, Bruce, dev

Hi Olivier,

I agree that your suggested approach for application-dependent metadata makes sense, in fact the two approaches work in exactly the same way (packet metadata immediately after the regular mbuf), there is only a subtle difference, which is related to defining consistent DPDK usage guidelines.

1. Advertising the presence of application-dependent meta-data as supported mechanism
If we explicitly have a metadata zero-size field at the end of the mbuf, we basically tell people that adding their own application meta-data at the end of the mandatory meta-data (mbuf structure) is a mechanism that DPDK allows and supports, and will continue to do so for the foreseeable future. In other words, we guarantee that an application doing so will continue to build successfully with future releases of DPDK, and we will not introduce changes in DPDK that could potentially break this mechanism. It is also a hint to people of where to put their application dependent meta-data.

2. Defining a standard base address for the application-dependent metadata
- There are also libraries in DPDK that work with application dependent meta-data, currently these are the Packet Framework libraries: librte_port, librte_table, librte_pipeline. Of course, the library does not have the knowledge of the application dependent meta-data format, so they treat it as opaque array of bytes, with the offset and size of the array given as arguments. In my opinion, it is safer (and more elegant) if these libraries (and others) can rely on an mbuf API to access the application dependent meta-data (in an opaque way) rather than make an assumption about the mbuf (i.e. the location of custom metadata relative to the mbuf) that is not clearly supported/defined by the mbuf library. 
- By having this API, we basically say: we define the custom meta-data base address (first location where custom metadata _could_ be placed) immediately after the mbuf, so libraries and apps accessing custom meta-data should do so by using a relative offset from this base rather than each application defining its own base: immediately after mbuf, or 128 bytes after mbuf, or 64 bytes before the end of the buffer, or other.

More (minor) comments inline below.

Thanks,
Cristian

-----Original Message-----
From: Olivier MATZ [mailto:olivier.matz@6wind.com] 
Sent: Friday, September 12, 2014 10:02 PM
To: Dumitrescu, Cristian; Richardson, Bruce; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata

Hello Cristian,

> What is the reason to remove this field? Please explain the
> rationale of removing this field.

The rationale is explained in
http://dpdk.org/ml/archives/dev/2014-September/005232.html

"The format of the metadata is up to the application".

The type of data the application stores after the mbuf has not
to be defined in the mbuf. These macros limits the types of
metadata to uint8, uint16, uint32, uint64? What should I do
if I need a void*, a struct foo ? Should we add a macro for
each possible type?

[Cristian] Actually, this is not correct, as macros to access metadata through pointers (to void or scalar types) are provided as well. This pointer can be converted by the application to the format is defines.

> We previously agreed we need to provide an easy and standard
> mechanism for applications to extend the mandatory per buffer
> metadata (mbuf) with optional application-dependent
> metadata.

Defining a structure in the application which does not pollute
the rte_mbuf structure is "easy and standard(TM)" too.

[Cristian] I agree, both approaches work the same really, it is just the difference in advertising the presence of meta-data as supported mechanism and defining a standard base address for it.

> This field just provides a clean way for the apps to
> know where is the end of the mandatory metadata, i.e. the first
> location in the packet buffer where the app can add its own
> metadata (of course, the app has to manage the headroom space
> before the first byte of packet data). A zero-size field is the
> standard mechanism that DPDK uses extensively in pretty much
> every library to access memory immediately after a header
> structure.

Having the following is clean too:

struct metadata {
     ...
};

struct app_mbuf {
     struct rte_mbuf mbuf;
     struct metadata metadata;
};

There is no need to define anything in the mbuf structure.

[Cristian] I agree, both approaches work the same really, it is just the difference in advertising the presence of meta-data as supported mechanism and defining a standard base address for it.

> 
> The impact of removing this field is that there is no standard
> way to identify where the end of the mandatory metadata is, so
> each application will have to reinvent this. With no clear
> convention, we will end up with a lot of non-standard ways. Every
> time the format of the mbuf structure is going to be changed,
> this can potentially break applications that use custom metadata,
> while using this simple standard mechanism would prevent this. So
> why remove this?

Waow. Five occurences of "standard" until now. 
[Cristian] I am sorry :)

Could you give a
reference to the standard you're refering to? :)

[Cristian] See the IEFT Service Function Chaining link below, the environment is different (data center pipeline vs. CPU core-level pipeline), but the concepts are very similar.

Our application defines private metadata in mbufs in the way described
above, we never changed that since we're supporting the dpdk. So
I don't understand when you say that each time mbuf is reformatted
it breaks the application.

> Having applications define their optional meta-data is a real
> need. 

Sure. This patch does not prevent this at all. You can continue
to do exactly the same, but in the concerned application, not
in the generic mbuf structure.

> Please take a look at the Service Chaining IEFT emerging
> protocols (https://datatracker.ietf.org/wg/sfc/documents/), which
> provide standard mechanisms for applications to define their own

Six :)

I'm not sure these documents define the way to extend a packet
structure with metadata in a C program. Again, Bruce's patch does
not prevent to do what you need, it just moves it at the proper
place.

> packet meta-data and share it between the elements of the
> processing pipeline (for Service Chaining, these are typically
> virtual machines scattered amongst the data center).
> 
> And, in my opinion, there is no negative impact/cost associated
> with keeping this field.

To summarize what I think:

- this patch does not prevent to do what you want to do
- removing the macros help to have a shorter and more comprehensible
  mbuf structure
- the previous approach does not scale because it would require
  a macro per type

> --------------------------------------------------------------
> Intel Shannon Limited
> Registered in Ireland
> Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
> Registered Number: 308263
> Business address: Dromore House, East Park, Shannon, Co. Clare
> 
> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

This is a public mailing list, this disclaimer is invalid.

Regards,
Olivier

--------------------------------------------------------------
Intel Shannon Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263
Business address: Dromore House, East Park, Shannon, Co. Clare

This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-16 20:07           ` Dumitrescu, Cristian
@ 2014-09-16 22:06             ` Ramia, Kannan Babu
  2014-09-17 10:31               ` Richardson, Bruce
  0 siblings, 1 reply; 62+ messages in thread
From: Ramia, Kannan Babu @ 2014-09-16 22:06 UTC (permalink / raw)
  To: Dumitrescu, Cristian, Olivier MATZ, Richardson, Bruce, dev

I completely agree with Cristian here, instead of leaving to applications where to place their meta data, we can provide a guidance by having this field about placement of application meta while maintaining transparency on the contents of application meta information. 

Regards
Kannan Babu Ramia
Sr.System Architect
Communication Storage Infrastructure Group
DCG
-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Dumitrescu, Cristian
Sent: Wednesday, September 17, 2014 1:37 AM
To: Olivier MATZ; Richardson, Bruce; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata

Hi Olivier,

I agree that your suggested approach for application-dependent metadata makes sense, in fact the two approaches work in exactly the same way (packet metadata immediately after the regular mbuf), there is only a subtle difference, which is related to defining consistent DPDK usage guidelines.

1. Advertising the presence of application-dependent meta-data as supported mechanism If we explicitly have a metadata zero-size field at the end of the mbuf, we basically tell people that adding their own application meta-data at the end of the mandatory meta-data (mbuf structure) is a mechanism that DPDK allows and supports, and will continue to do so for the foreseeable future. In other words, we guarantee that an application doing so will continue to build successfully with future releases of DPDK, and we will not introduce changes in DPDK that could potentially break this mechanism. It is also a hint to people of where to put their application dependent meta-data.

2. Defining a standard base address for the application-dependent metadata
- There are also libraries in DPDK that work with application dependent meta-data, currently these are the Packet Framework libraries: librte_port, librte_table, librte_pipeline. Of course, the library does not have the knowledge of the application dependent meta-data format, so they treat it as opaque array of bytes, with the offset and size of the array given as arguments. In my opinion, it is safer (and more elegant) if these libraries (and others) can rely on an mbuf API to access the application dependent meta-data (in an opaque way) rather than make an assumption about the mbuf (i.e. the location of custom metadata relative to the mbuf) that is not clearly supported/defined by the mbuf library. 
- By having this API, we basically say: we define the custom meta-data base address (first location where custom metadata _could_ be placed) immediately after the mbuf, so libraries and apps accessing custom meta-data should do so by using a relative offset from this base rather than each application defining its own base: immediately after mbuf, or 128 bytes after mbuf, or 64 bytes before the end of the buffer, or other.

More (minor) comments inline below.

Thanks,
Cristian

-----Original Message-----
From: Olivier MATZ [mailto:olivier.matz@6wind.com]
Sent: Friday, September 12, 2014 10:02 PM
To: Dumitrescu, Cristian; Richardson, Bruce; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata

Hello Cristian,

> What is the reason to remove this field? Please explain the rationale 
> of removing this field.

The rationale is explained in
http://dpdk.org/ml/archives/dev/2014-September/005232.html

"The format of the metadata is up to the application".

The type of data the application stores after the mbuf has not to be defined in the mbuf. These macros limits the types of metadata to uint8, uint16, uint32, uint64? What should I do if I need a void*, a struct foo ? Should we add a macro for each possible type?

[Cristian] Actually, this is not correct, as macros to access metadata through pointers (to void or scalar types) are provided as well. This pointer can be converted by the application to the format is defines.

> We previously agreed we need to provide an easy and standard mechanism 
> for applications to extend the mandatory per buffer metadata (mbuf) 
> with optional application-dependent metadata.

Defining a structure in the application which does not pollute the rte_mbuf structure is "easy and standard(TM)" too.

[Cristian] I agree, both approaches work the same really, it is just the difference in advertising the presence of meta-data as supported mechanism and defining a standard base address for it.

> This field just provides a clean way for the apps to know where is the 
> end of the mandatory metadata, i.e. the first location in the packet 
> buffer where the app can add its own metadata (of course, the app has 
> to manage the headroom space before the first byte of packet data). A 
> zero-size field is the standard mechanism that DPDK uses extensively 
> in pretty much every library to access memory immediately after a 
> header structure.

Having the following is clean too:

struct metadata {
     ...
};

struct app_mbuf {
     struct rte_mbuf mbuf;
     struct metadata metadata;
};

There is no need to define anything in the mbuf structure.

[Cristian] I agree, both approaches work the same really, it is just the difference in advertising the presence of meta-data as supported mechanism and defining a standard base address for it.

> 
> The impact of removing this field is that there is no standard way to 
> identify where the end of the mandatory metadata is, so each 
> application will have to reinvent this. With no clear convention, we 
> will end up with a lot of non-standard ways. Every time the format of 
> the mbuf structure is going to be changed, this can potentially break 
> applications that use custom metadata, while using this simple 
> standard mechanism would prevent this. So why remove this?

Waow. Five occurences of "standard" until now. 
[Cristian] I am sorry :)

Could you give a
reference to the standard you're refering to? :)

[Cristian] See the IEFT Service Function Chaining link below, the environment is different (data center pipeline vs. CPU core-level pipeline), but the concepts are very similar.

Our application defines private metadata in mbufs in the way described above, we never changed that since we're supporting the dpdk. So I don't understand when you say that each time mbuf is reformatted it breaks the application.

> Having applications define their optional meta-data is a real need.

Sure. This patch does not prevent this at all. You can continue to do exactly the same, but in the concerned application, not in the generic mbuf structure.

> Please take a look at the Service Chaining IEFT emerging protocols 
> (https://datatracker.ietf.org/wg/sfc/documents/), which provide 
> standard mechanisms for applications to define their own

Six :)

I'm not sure these documents define the way to extend a packet structure with metadata in a C program. Again, Bruce's patch does not prevent to do what you need, it just moves it at the proper place.

> packet meta-data and share it between the elements of the processing 
> pipeline (for Service Chaining, these are typically virtual machines 
> scattered amongst the data center).
> 
> And, in my opinion, there is no negative impact/cost associated with 
> keeping this field.

To summarize what I think:

- this patch does not prevent to do what you want to do
- removing the macros help to have a shorter and more comprehensible
  mbuf structure
- the previous approach does not scale because it would require
  a macro per type

> --------------------------------------------------------------
> Intel Shannon Limited
> Registered in Ireland
> Registered Office: Collinstown Industrial Park, Leixlip, County 
> Kildare Registered Number: 308263 Business address: Dromore House, 
> East Park, Shannon, Co. Clare
> 
> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

This is a public mailing list, this disclaimer is invalid.

Regards,
Olivier

--------------------------------------------------------------
Intel Shannon Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare Registered Number: 308263 Business address: Dromore House, East Park, Shannon, Co. Clare

This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-16 22:06             ` Ramia, Kannan Babu
@ 2014-09-17 10:31               ` Richardson, Bruce
  2014-09-17 14:01                 ` Thomas Monjalon
  0 siblings, 1 reply; 62+ messages in thread
From: Richardson, Bruce @ 2014-09-17 10:31 UTC (permalink / raw)
  To: Ramia, Kannan Babu, Dumitrescu, Cristian, Olivier MATZ, dev

> -----Original Message-----
> From: Ramia, Kannan Babu
> Sent: Tuesday, September 16, 2014 11:06 PM
> To: Dumitrescu, Cristian; Olivier MATZ; Richardson, Bruce; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the
> mbuf metadata
> 
> I completely agree with Cristian here, instead of leaving to applications where to
> place their meta data, we can provide a guidance by having this field about
> placement of application meta while maintaining transparency on the contents
> of application meta information.
> 
My opinion on this is that this is better served via documentation or a comment in the code. The reason is that this approach is not going to be suitable for all applications. The mbuf headroom being used by the metadata is actually designed to be used for any additional headers to be added to the packet - though other things can obviously be stored in it too. Therefore the amount of metadata that can be stored in it will depend from application to application, as any apps doing e.g. tunnelling will need the headroom for tunnelling headers and may only be able to store a small amount of metadata - potentially none. For larger amounts of metadata - I would feel that anything over 64-bytes or so - I have proposed adding in a separate userdata pointer in the mbuf structure so that apps have the option of storing the metadata externally e.g. pointing to a flow table entry or similar. [Please see mbuf rework patch set 3 proposal].
Because of this, I think it's better to put in a comment in the code indicating that metadata can go in the headroom, document this properly - including caveats and limitations - in the documentation, and provide an example of doing such - something we already have in the packet framework.

All that being said, and while I think this is a good patch, I don't feel too strongly about it. I'm happy enough if this particular patch does not get merged in for 1.8, as it's incidental to the overall mbuf changes.

Regards,
/Bruce


> Regards
> Kannan Babu Ramia
> Sr.System Architect
> Communication Storage Infrastructure Group
> DCG
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Dumitrescu, Cristian
> Sent: Wednesday, September 17, 2014 1:37 AM
> To: Olivier MATZ; Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the
> mbuf metadata
> 
> Hi Olivier,
> 
> I agree that your suggested approach for application-dependent metadata
> makes sense, in fact the two approaches work in exactly the same way (packet
> metadata immediately after the regular mbuf), there is only a subtle difference,
> which is related to defining consistent DPDK usage guidelines.
> 
> 1. Advertising the presence of application-dependent meta-data as supported
> mechanism If we explicitly have a metadata zero-size field at the end of the
> mbuf, we basically tell people that adding their own application meta-data at
> the end of the mandatory meta-data (mbuf structure) is a mechanism that DPDK
> allows and supports, and will continue to do so for the foreseeable future. In
> other words, we guarantee that an application doing so will continue to build
> successfully with future releases of DPDK, and we will not introduce changes in
> DPDK that could potentially break this mechanism. It is also a hint to people of
> where to put their application dependent meta-data.
> 
> 2. Defining a standard base address for the application-dependent metadata
> - There are also libraries in DPDK that work with application dependent meta-
> data, currently these are the Packet Framework libraries: librte_port,
> librte_table, librte_pipeline. Of course, the library does not have the knowledge
> of the application dependent meta-data format, so they treat it as opaque array
> of bytes, with the offset and size of the array given as arguments. In my opinion,
> it is safer (and more elegant) if these libraries (and others) can rely on an mbuf
> API to access the application dependent meta-data (in an opaque way) rather
> than make an assumption about the mbuf (i.e. the location of custom metadata
> relative to the mbuf) that is not clearly supported/defined by the mbuf library.
> - By having this API, we basically say: we define the custom meta-data base
> address (first location where custom metadata _could_ be placed) immediately
> after the mbuf, so libraries and apps accessing custom meta-data should do so
> by using a relative offset from this base rather than each application defining its
> own base: immediately after mbuf, or 128 bytes after mbuf, or 64 bytes before
> the end of the buffer, or other.
> 
> More (minor) comments inline below.
> 
> Thanks,
> Cristian
> 
> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Friday, September 12, 2014 10:02 PM
> To: Dumitrescu, Cristian; Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the
> mbuf metadata
> 
> Hello Cristian,
> 
> > What is the reason to remove this field? Please explain the rationale
> > of removing this field.
> 
> The rationale is explained in
> http://dpdk.org/ml/archives/dev/2014-September/005232.html
> 
> "The format of the metadata is up to the application".
> 
> The type of data the application stores after the mbuf has not to be defined in
> the mbuf. These macros limits the types of metadata to uint8, uint16, uint32,
> uint64? What should I do if I need a void*, a struct foo ? Should we add a macro
> for each possible type?
> 
> [Cristian] Actually, this is not correct, as macros to access metadata through
> pointers (to void or scalar types) are provided as well. This pointer can be
> converted by the application to the format is defines.
> 
> > We previously agreed we need to provide an easy and standard mechanism
> > for applications to extend the mandatory per buffer metadata (mbuf)
> > with optional application-dependent metadata.
> 
> Defining a structure in the application which does not pollute the rte_mbuf
> structure is "easy and standard(TM)" too.
> 
> [Cristian] I agree, both approaches work the same really, it is just the difference
> in advertising the presence of meta-data as supported mechanism and defining a
> standard base address for it.
> 
> > This field just provides a clean way for the apps to know where is the
> > end of the mandatory metadata, i.e. the first location in the packet
> > buffer where the app can add its own metadata (of course, the app has
> > to manage the headroom space before the first byte of packet data). A
> > zero-size field is the standard mechanism that DPDK uses extensively
> > in pretty much every library to access memory immediately after a
> > header structure.
> 
> Having the following is clean too:
> 
> struct metadata {
>      ...
> };
> 
> struct app_mbuf {
>      struct rte_mbuf mbuf;
>      struct metadata metadata;
> };
> 
> There is no need to define anything in the mbuf structure.
> 
> [Cristian] I agree, both approaches work the same really, it is just the difference
> in advertising the presence of meta-data as supported mechanism and defining a
> standard base address for it.
> 
> >
> > The impact of removing this field is that there is no standard way to
> > identify where the end of the mandatory metadata is, so each
> > application will have to reinvent this. With no clear convention, we
> > will end up with a lot of non-standard ways. Every time the format of
> > the mbuf structure is going to be changed, this can potentially break
> > applications that use custom metadata, while using this simple
> > standard mechanism would prevent this. So why remove this?
> 
> Waow. Five occurences of "standard" until now.
> [Cristian] I am sorry :)
> 
> Could you give a
> reference to the standard you're refering to? :)
> 
> [Cristian] See the IEFT Service Function Chaining link below, the environment is
> different (data center pipeline vs. CPU core-level pipeline), but the concepts are
> very similar.
> 
> Our application defines private metadata in mbufs in the way described above,
> we never changed that since we're supporting the dpdk. So I don't understand
> when you say that each time mbuf is reformatted it breaks the application.
> 
> > Having applications define their optional meta-data is a real need.
> 
> Sure. This patch does not prevent this at all. You can continue to do exactly the
> same, but in the concerned application, not in the generic mbuf structure.
> 
> 
> > Please take a look at the Service Chaining IEFT emerging protocols
> > (https://datatracker.ietf.org/wg/sfc/documents/), which provide
> > standard mechanisms for applications to define their own
> 
> Six :)
> 
> I'm not sure these documents define the way to extend a packet structure with
> metadata in a C program. Again, Bruce's patch does not prevent to do what you
> need, it just moves it at the proper place.
> 
> > packet meta-data and share it between the elements of the processing
> > pipeline (for Service Chaining, these are typically virtual machines
> > scattered amongst the data center).
> >
> > And, in my opinion, there is no negative impact/cost associated with
> > keeping this field.
> 
> To summarize what I think:
> 
> - this patch does not prevent to do what you want to do
> - removing the macros help to have a shorter and more comprehensible
>   mbuf structure
> - the previous approach does not scale because it would require
>   a macro per type
> 
> 
> > --------------------------------------------------------------
> > Intel Shannon Limited
> > Registered in Ireland
> > Registered Office: Collinstown Industrial Park, Leixlip, County
> > Kildare Registered Number: 308263 Business address: Dromore House,
> > East Park, Shannon, Co. Clare
> >
> > This e-mail and any attachments may contain confidential material for the sole
> use of the intended recipient(s). Any review or distribution by others is strictly
> prohibited. If you are not the intended recipient, please contact the sender and
> delete all copies.
> 
> This is a public mailing list, this disclaimer is invalid.
> 
> Regards,
> Olivier
> 
> --------------------------------------------------------------
> Intel Shannon Limited
> Registered in Ireland
> Registered Office: Collinstown Industrial Park, Leixlip, County Kildare Registered
> Number: 308263 Business address: Dromore House, East Park, Shannon, Co.
> Clare
> 
> This e-mail and any attachments may contain confidential material for the sole
> use of the intended recipient(s). Any review or distribution by others is strictly
> prohibited. If you are not the intended recipient, please contact the sender and
> delete all copies.
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata
  2014-09-17 10:31               ` Richardson, Bruce
@ 2014-09-17 14:01                 ` Thomas Monjalon
  0 siblings, 0 replies; 62+ messages in thread
From: Thomas Monjalon @ 2014-09-17 14:01 UTC (permalink / raw)
  To: Richardson, Bruce; +Cc: dev

2014-09-17 10:31, Richardson, Bruce:
> > From: Ramia, Kannan Babu
> > 
> > I completely agree with Cristian here, instead of leaving to applications
> > where to place their meta data, we can provide a guidance by having this
> > field about placement of application meta while maintaining transparency
> > on the contents of application meta information.
> 
> My opinion on this is that this is better served via documentation or a
> comment in the code. The reason is that this approach is not going to be
> suitable for all applications. The mbuf headroom being used by the metadata
> is actually designed to be used for any additional headers to be added to
> the packet - though other things can obviously be stored in it too.
> Therefore the amount of metadata that can be stored in it will depend from
> application to application, as any apps doing e.g. tunnelling will need the
> headroom for tunnelling headers and may only be able to store a small
> amount of metadata - potentially none. For larger amounts of metadata - I
> would feel that anything over 64-bytes or so - I have proposed adding in a
> separate userdata pointer in the mbuf structure so that apps have the
> option of storing the metadata externally e.g. pointing to a flow table
> entry or similar. [Please see mbuf rework patch set 3 proposal]. Because of
> this, I think it's better to put in a comment in the code indicating that
> metadata can go in the headroom, document this properly - including caveats
> and limitations - in the documentation, and provide an example of doing
> such - something we already have in the packet framework.

I agree that replacing these markers by documentation give more accurate
informations and (bonus) is simpler.
When documentation will be embedded in the git repository, I'd like to see
a patch to document these mbuf usages.

> All that being said, and while I think this is a good patch, I don't feel
> too strongly about it. I'm happy enough if this particular patch does not
> get merged in for 1.8, as it's incidental to the overall mbuf changes.

I think also it's a good patch so I keep it.

-- 
Thomas

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2
  2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
                     ` (12 preceding siblings ...)
  2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 13/13] ixgbe: Improve slow-path perf: vector scattered RX Bruce Richardson
@ 2014-09-17 22:35   ` Thomas Monjalon
  13 siblings, 0 replies; 62+ messages in thread
From: Thomas Monjalon @ 2014-09-17 22:35 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

2014-09-11 14:15, Bruce Richardson:
> This patch set continues on from the changes in part 1, and depends
> upon that patch set.
> 
> This patch set reorders the fields in the mbuf structure and splits
> the structure across two cache lines, given lots of new space for new
> fields to be added. This set uses some of that space by expanding the
> ol_flags field. A part 3 patchset is planned to introduce some other
> new fields into the new mbuf structure.
> 
> With the splitting of the mbuf across multiple cache lines, performance
> degradations are seen inside the drivers, both fast-path and slow path.
> For the fast-path, this patchset reworks the way in which the pool pointer
> is used to free packets post-TX, which removes the perf regression. For
> the slow path, an alternative approach is taken - a new scattered packets
> RX function is introduced into the vector PMD. Using this function,
> throughput for the slow path RX-TX using testpmd is increased by up to 20%
> over the original baseline.
> 
> Changes in V2:
> * General minor updates follow comments on V1 set
> * Updated a number of patches to include KNI mbuf changes where needed
> * Deferred the patch to add the packet type field to a future patch set
> * After removing meta-data element from mbuf structure also added in patch
>   to move the macros to rte_port/rte_port.h
> 
> Bruce Richardson (11):
>   mbuf: reorder fields by time of use
>   mbuf: expand ol_flags field to 64-bits
>   mbuf: introduce a flag to indicate a control mbuf
>   mbuf: minor changes for readability
>   mbuf: use macros only to access the mbuf metadata
>   mbuf: add named points inside the mbuf structure
>   ixgbe: rework vector pmd following mbuf changes
>   mbuf: split mbuf across two cache lines.
>   mbuf: move l2_len and l3_len to second cache line
>   ixgbe: Fix perf regression due to moved pool ptr
>   ixgbe: Improve slow-path perf: vector scattered RX
> 
> Olivier Matz (1):
>   mbuf: replace data pointer by an offset

Applied for version 1.8.0.

Note that ixgbe patches are not reviewed but I make an exception here,
because Intel is responsible of ixgbe performances and we have some
time to fix it (if needed) before the release.

-- 
Thomas

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2014-09-17 22:30 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-03 15:49 [dpdk-dev] [PATCH 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
2014-09-03 15:49 ` [dpdk-dev] [PATCH 01/13] mbuf: replace data pointer by an offset Bruce Richardson
2014-09-08  9:52   ` Olivier MATZ
2014-09-08  9:55     ` Olivier MATZ
2014-09-03 15:49 ` [dpdk-dev] [PATCH 02/13] mbuf: reorder fields by time of use Bruce Richardson
2014-09-08 10:17   ` Olivier MATZ
2014-09-03 15:49 ` [dpdk-dev] [PATCH 03/13] mbuf: add packet_type field Bruce Richardson
2014-09-08 10:17   ` Olivier MATZ
2014-09-08 10:33     ` Yerden Zhumabekov
2014-09-08 11:17       ` Olivier MATZ
2014-09-09  3:59         ` Zhang, Helin
     [not found]           ` <540EB428.9060706@6wind.com>
2014-09-09  8:45             ` Zhang, Helin
2014-09-09  9:47             ` Richardson, Bruce
2014-09-09 15:05         ` Jim Thompson
2014-09-03 15:49 ` [dpdk-dev] [PATCH 04/13] mbuf: expand ol_flags field to 64-bits Bruce Richardson
2014-09-08 10:25   ` Olivier MATZ
2014-09-09  9:00     ` Richardson, Bruce
2014-09-03 15:49 ` [dpdk-dev] [PATCH 05/13] mbuf: introduce a flag to indicate a control mbuf Bruce Richardson
2014-09-08 11:53   ` Olivier MATZ
2014-09-03 15:49 ` [dpdk-dev] [PATCH 06/13] mbuf: minor changes for readability Bruce Richardson
2014-09-08 12:03   ` Olivier MATZ
2014-09-03 15:49 ` [dpdk-dev] [PATCH 07/13] mbuf: use macros only to access the mbuf metadata Bruce Richardson
2014-09-08 12:05   ` Olivier MATZ
2014-09-09  9:01     ` Richardson, Bruce
2014-09-12 16:56       ` Dumitrescu, Cristian
2014-09-12 21:02         ` Olivier MATZ
2014-09-16 20:07           ` Dumitrescu, Cristian
2014-09-16 22:06             ` Ramia, Kannan Babu
2014-09-17 10:31               ` Richardson, Bruce
2014-09-17 14:01                 ` Thomas Monjalon
2014-09-10 15:09     ` Bruce Richardson
2014-09-10 15:31       ` Olivier MATZ
2014-09-03 15:49 ` [dpdk-dev] [PATCH 08/13] mbuf: add named points inside the mbuf structure Bruce Richardson
2014-09-08 12:08   ` Olivier MATZ
2014-09-03 15:49 ` [dpdk-dev] [PATCH 09/13] ixgbe: rework vector pmd following mbuf changes Bruce Richardson
2014-09-03 15:49 ` [dpdk-dev] [PATCH 10/13] mbuf: split mbuf across two cache lines Bruce Richardson
2014-09-08 12:10   ` Olivier MATZ
2014-09-03 15:49 ` [dpdk-dev] [PATCH 11/13] mbuf: move l2_len and l3_len to second cache line Bruce Richardson
2014-09-04  5:08   ` Yerden Zhumabekov
2014-09-04 10:27     ` Bruce Richardson
2014-09-04 11:00       ` Yerden Zhumabekov
2014-09-04 11:55         ` Bruce Richardson
2014-09-03 15:49 ` [dpdk-dev] [PATCH 12/13] ixgbe: Fix perf regression due to moved pool ptr Bruce Richardson
2014-09-03 15:49 ` [dpdk-dev] [PATCH 13/13] ixgbe: Improve slow-path perf: vector scattered RX Bruce Richardson
2014-09-11 13:15 ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 01/13] mbuf: replace data pointer by an offset Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 02/13] mbuf: reorder fields by time of use Bruce Richardson
2014-09-15  7:11     ` Liu, Jijiang
2014-09-15  8:19       ` Richardson, Bruce
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 03/13] mbuf: expand ol_flags field to 64-bits Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 04/13] mbuf: introduce a flag to indicate a control mbuf Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 05/13] mbuf: minor changes for readability Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 06/13] mbuf: use macros only to access the mbuf metadata Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 07/13] mbuf: move metadata macros to rte_port library Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 08/13] mbuf: add named points inside the mbuf structure Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 09/13] ixgbe: rework vector pmd following mbuf changes Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 10/13] mbuf: split mbuf across two cache lines Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 11/13] mbuf: move l2_len and l3_len to second cache line Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 12/13] ixgbe: Fix perf regression due to moved pool ptr Bruce Richardson
2014-09-15 16:20     ` [dpdk-dev] [PATCH v3 " Bruce Richardson
2014-09-11 13:15   ` [dpdk-dev] [PATCH v2 13/13] ixgbe: Improve slow-path perf: vector scattered RX Bruce Richardson
2014-09-17 22:35   ` [dpdk-dev] [PATCH v2 00/13] Mbuf Structure Rework, part 2 Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).