DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support
@ 2014-05-09 14:50 Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 01/11] igb/ixgbe: fix IP checksum calculation Olivier Matz
                   ` (12 more replies)
  0 siblings, 13 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

This series add TSO support in ixgbe DPDK driver. As discussed
previously on the list [1], one problem is that there is not enough room
in rte_mbuf today to store the required information to implement this
feature:
  - a new ol_flag
  - the MSS
  - the L4 header len

A solution would be to increase the size of the mbuf to 2 cache lines
but it could have a bad impact on performance. This series proposes some
rework to drastically reduce the size of the rte_mbuf structures before
implementing the TSO, avoiding to change the mbuf size to 128 bytes.

After the rework of mbuf structures, the size of rte_mbuf structure is
reduced by 9 bytes. The implementation of TSO requires to double the
size of ol_flags (16 to 32 bits) and to double the size of offload
information in order to add the mss and the l4 header length (32 to 64
bits). At the end of the whole series, sizeof(rte_mbuf) is still 64
bytes and 4 bytes are available for future use.

This rework causes a lot of modifications in the mbuf structure,
implying some changes in the applications that directly use the mbuf
structure fields instead of using the API functions (sometimes there is
no function). That's why this series is a RFC. In my opinion, it's the
proper moment for this evolution as the 1.7.0 window is open.

About TSO, the new fields in mbuf try to be generic enough to apply to
other hardware in the future. To delegate the TCP segmentation to the
hardware, the user has to:

  - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
    PKT_TX_IP_CKSUM and PKT_TX_TCP_CKSUM)
  - fill the mbuf->hw_offload information: l2_len, l3_len, l4_len, mss
  - calculate the pseudo header checksum and set it in the TCP header,
    as required when doing hardware TCP checksum offload
  - set the IP checksum to 0

Compilation of DPDK and examples is tested for the following
targets: x86_64-*-linuxapp-gcc, i686-*-linuxapp-gcc, x86_64-*-bsdapp-gcc

The mbuf rework series is validated with autotests:

  cd dpdk.org/
  make install T=x86_64-default-linuxapp-gcc
  cd x86_64-default-linuxapp-gcc/
  modprobe uio
  insmod kmod/igb_uio.ko
  python ../tools/igb_uio_bind.py -b igb_uio 0000:02:00.0
  echo 0 > /proc/sys/kernel/randomize_va_space
  echo 1000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
  echo 1000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
  mount -t hugetlbfs none /mnt/huge
  make test

TSO is validated with IPv4 and IPv6 with testpmd (see the commit log of
last patch for details).

The performance non-regression has been tested with 6WINDGate fast path.

Note: this patches may conflict with patch [2] which is pushed yet, but
will probably be integrated before this series.

[1] http://dpdk.org/ml/archives/dev/2013-October/thread.html#572
[2] http://dpdk.org/ml/archives/dev/2014-April/002166.html


Olivier Matz (11):
  igb/ixgbe: fix IP checksum calculation
  mbuf: rename RTE_MBUF_SCATTER_GATHER into RTE_MBUF_REFCNT
  mbuf: remove rte_ctrlmbuf
  mbuf: remove the rte_pktmbuf structure
  mbuf: merge physaddr and buf_len in a bitfield
  mbuf: replace data pointer by an offset
  mbuf: add functions to get the name of an ol_flag
  mbuf: change ol_flags to 32 bits
  mbuf: rename vlan_macip_len in hw_offload and increase its size
  testpmd: modify source address to validate checksum calculation
  ixgbe/mbuf: add TSO support

 app/test-pmd/cmdline.c                             |  60 ++-
 app/test-pmd/config.c                              |  18 +-
 app/test-pmd/csumonly.c                            |  50 ++-
 app/test-pmd/ieee1588fwd.c                         |   6 +-
 app/test-pmd/macfwd-retry.c                        |   2 +-
 app/test-pmd/macfwd.c                              |   8 +-
 app/test-pmd/rxonly.c                              |  47 +-
 app/test-pmd/testpmd.c                             |  10 +-
 app/test-pmd/testpmd.h                             |  15 +-
 app/test-pmd/txonly.c                              |  47 +-
 app/test/commands.c                                |   2 -
 app/test/test_mbuf.c                               | 100 +----
 app/test/test_sched.c                              |   4 +-
 config/defconfig_i686-default-linuxapp-gcc         |   2 +-
 config/defconfig_i686-default-linuxapp-icc         |   2 +-
 config/defconfig_x86_64-default-bsdapp-gcc         |   2 +-
 config/defconfig_x86_64-default-linuxapp-gcc       |   2 +-
 config/defconfig_x86_64-default-linuxapp-icc       |   2 +-
 doc/doxy-api.conf                                  |   2 +-
 examples/dpdk_qat/crypto.c                         |  22 +-
 examples/dpdk_qat/main.c                           |   2 +-
 examples/exception_path/main.c                     |  11 +-
 examples/ip_reassembly/ipv4_rsmbl.h                |  20 +-
 examples/ip_reassembly/main.c                      |   6 +-
 examples/ipv4_frag/Makefile                        |   4 +-
 examples/ipv4_frag/main.c                          |   4 +-
 examples/ipv4_frag/rte_ipv4_frag.h                 |  42 +-
 examples/ipv4_multicast/Makefile                   |   4 +-
 examples/ipv4_multicast/main.c                     |  16 +-
 examples/l3fwd-power/main.c                        |   2 +-
 examples/l3fwd-vf/main.c                           |   2 +-
 examples/l3fwd/main.c                              |  10 +-
 examples/load_balancer/runtime.c                   |   2 +-
 .../client_server_mp/mp_client/client.c            |   2 +-
 examples/quota_watermark/qw/main.c                 |   4 +-
 examples/vhost/main.c                              |  27 +-
 examples/vhost_xen/main.c                          |  22 +-
 .../bsdapp/eal/include/exec-env/rte_kni_common.h   |   2 +-
 .../linuxapp/eal/include/exec-env/rte_kni_common.h |   2 +-
 lib/librte_mbuf/rte_mbuf.c                         |  91 ++--
 lib/librte_mbuf/rte_mbuf.h                         | 476 +++++++++------------
 lib/librte_pmd_e1000/em_rxtx.c                     | 143 ++++---
 lib/librte_pmd_e1000/igb_rxtx.c                    | 192 +++++----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c                  | 381 ++++++++++-------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h                  |   8 +-
 lib/librte_pmd_pcap/rte_eth_pcap.c                 |  14 +-
 lib/librte_pmd_virtio/virtio_rxtx.c                |  18 +-
 lib/librte_pmd_virtio/virtqueue.h                  |   7 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c              |  29 +-
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c           |  14 +-
 lib/librte_pmd_xenvirt/virtqueue.h                 |   4 +-
 lib/librte_sched/rte_sched.c                       |  14 +-
 lib/librte_sched/rte_sched.h                       |  10 +-
 53 files changed, 986 insertions(+), 1002 deletions(-)

-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 01/11] igb/ixgbe: fix IP checksum calculation
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-15 10:40   ` Ananyev, Konstantin
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 02/11] mbuf: rename RTE_MBUF_SCATTER_GATHER into RTE_MBUF_REFCNT Olivier Matz
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

According to Intel® 82599 10 GbE Controller Datasheet (Table 7-38), both
L2 and L3 lengths are needed to offload the IP checksum.

Note that the e1000 driver does not need to be patched as it already
contains the fix.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 lib/librte_pmd_e1000/igb_rxtx.c   | 2 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index 4608595..b3c8149 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -233,7 +233,7 @@ igbe_set_xmit_ctx(struct igb_tx_queue* txq,
 
 	if (ol_flags & PKT_TX_IP_CKSUM) {
 		type_tucmd_mlhl = E1000_ADVTXD_TUCMD_IPV4;
-		cmp_mask |= TX_MAC_LEN_CMP_MASK;
+		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
 	}
 
 	/* Specify which HW CTX to upload. */
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 55414b9..4e307c2 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -367,7 +367,7 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 
 	if (ol_flags & PKT_TX_IP_CKSUM) {
 		type_tucmd_mlhl = IXGBE_ADVTXD_TUCMD_IPV4;
-		cmp_mask |= TX_MAC_LEN_CMP_MASK;
+		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
 	}
 
 	/* Specify which HW CTX to upload. */
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 02/11] mbuf: rename RTE_MBUF_SCATTER_GATHER into RTE_MBUF_REFCNT
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 01/11] igb/ixgbe: fix IP checksum calculation Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf Olivier Matz
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

It seems that RTE_MBUF_SCATTER_GATHER is not the proper name for the
feature it provides. "Scatter gather" means that data is stored using
several buffers. RTE_MBUF_REFCNT seems to be a better name for that
feature as it provides a reference counter for mbufs.

The macro RTE_MBUF_SCATTER_GATHER is poisoned to ensure this
modification is seen by drivers or applications using it.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test/test_mbuf.c                         | 16 +++++++-------
 config/defconfig_i686-default-linuxapp-gcc   |  2 +-
 config/defconfig_i686-default-linuxapp-icc   |  2 +-
 config/defconfig_x86_64-default-bsdapp-gcc   |  2 +-
 config/defconfig_x86_64-default-linuxapp-gcc |  2 +-
 config/defconfig_x86_64-default-linuxapp-icc |  2 +-
 doc/doxy-api.conf                            |  2 +-
 examples/ipv4_frag/Makefile                  |  4 ++--
 examples/ipv4_multicast/Makefile             |  4 ++--
 lib/librte_mbuf/rte_mbuf.c                   |  2 +-
 lib/librte_mbuf/rte_mbuf.h                   | 31 +++++++++++++++-------------
 11 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index f443734..fe0f4f6 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -82,7 +82,7 @@
 static struct rte_mempool *pktmbuf_pool = NULL;
 static struct rte_mempool *ctrlmbuf_pool = NULL;
 
-#if defined RTE_MBUF_SCATTER_GATHER  && defined RTE_MBUF_REFCNT_ATOMIC
+#if defined RTE_MBUF_REFCNT  && defined RTE_MBUF_REFCNT_ATOMIC
 
 static struct rte_mempool *refcnt_pool = NULL;
 static struct rte_ring *refcnt_mbuf_ring = NULL;
@@ -365,7 +365,7 @@ fail:
 static int
 testclone_testupdate_testdetach(void)
 {
-#ifndef RTE_MBUF_SCATTER_GATHER
+#ifndef RTE_MBUF_REFCNT
 	return 0;
 #else
 	struct rte_mbuf *mc = NULL;
@@ -406,7 +406,7 @@ fail:
 	if (mc)
 		rte_pktmbuf_free(mc);
 	return -1;
-#endif /* RTE_MBUF_SCATTER_GATHER */
+#endif /* RTE_MBUF_REFCNT */
 }
 #undef GOTO_FAIL
 
@@ -439,7 +439,7 @@ test_pktmbuf_pool(void)
 		printf("Error pool not empty");
 		ret = -1;
 	}
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 	extra = rte_pktmbuf_clone(m[0], pktmbuf_pool);
 	if(extra != NULL) {
 		printf("Error pool not empty");
@@ -548,11 +548,11 @@ test_pktmbuf_free_segment(void)
 /*
  * Stress test for rte_mbuf atomic refcnt.
  * Implies that:
- * RTE_MBUF_SCATTER_GATHER and RTE_MBUF_REFCNT_ATOMIC are both defined.
+ * RTE_MBUF_REFCNT and RTE_MBUF_REFCNT_ATOMIC are both defined.
  * For more efficency, recomended to run with RTE_LIBRTE_MBUF_DEBUG defined.
  */
 
-#if defined RTE_MBUF_SCATTER_GATHER  && defined RTE_MBUF_REFCNT_ATOMIC
+#if defined RTE_MBUF_REFCNT  && defined RTE_MBUF_REFCNT_ATOMIC
 
 static int
 test_refcnt_slave(__attribute__((unused)) void *arg)
@@ -657,7 +657,7 @@ test_refcnt_master(void)
 static int
 test_refcnt_mbuf(void)
 {
-#if defined RTE_MBUF_SCATTER_GATHER  && defined RTE_MBUF_REFCNT_ATOMIC
+#if defined RTE_MBUF_REFCNT  && defined RTE_MBUF_REFCNT_ATOMIC
 
 	unsigned lnum, master, slave, tref;
 
@@ -808,7 +808,7 @@ test_failing_mbuf_sanity_check(void)
 		return -1;
 	}
 
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 	badbuf = *buf;
 	badbuf.refcnt = 0;
 	if (verify_mbuf_check_panics(&badbuf)) {
diff --git a/config/defconfig_i686-default-linuxapp-gcc b/config/defconfig_i686-default-linuxapp-gcc
index 14bd3d1..dd0f0d0 100644
--- a/config/defconfig_i686-default-linuxapp-gcc
+++ b/config/defconfig_i686-default-linuxapp-gcc
@@ -235,7 +235,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_MBUF=y
 CONFIG_RTE_LIBRTE_MBUF_DEBUG=n
-CONFIG_RTE_MBUF_SCATTER_GATHER=y
+CONFIG_RTE_MBUF_REFCNT=y
 CONFIG_RTE_MBUF_REFCNT_ATOMIC=y
 CONFIG_RTE_PKTMBUF_HEADROOM=128
 
diff --git a/config/defconfig_i686-default-linuxapp-icc b/config/defconfig_i686-default-linuxapp-icc
index ec3386e..ef11051 100644
--- a/config/defconfig_i686-default-linuxapp-icc
+++ b/config/defconfig_i686-default-linuxapp-icc
@@ -234,7 +234,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_MBUF=y
 CONFIG_RTE_LIBRTE_MBUF_DEBUG=n
-CONFIG_RTE_MBUF_SCATTER_GATHER=y
+CONFIG_RTE_MBUF_REFCNT=y
 CONFIG_RTE_MBUF_REFCNT_ATOMIC=y
 CONFIG_RTE_PKTMBUF_HEADROOM=128
 
diff --git a/config/defconfig_x86_64-default-bsdapp-gcc b/config/defconfig_x86_64-default-bsdapp-gcc
index d960e1d..f5f2140 100644
--- a/config/defconfig_x86_64-default-bsdapp-gcc
+++ b/config/defconfig_x86_64-default-bsdapp-gcc
@@ -210,7 +210,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_MBUF=y
 CONFIG_RTE_LIBRTE_MBUF_DEBUG=n
-CONFIG_RTE_MBUF_SCATTER_GATHER=y
+CONFIG_RTE_MBUF_REFCNT=y
 CONFIG_RTE_MBUF_REFCNT_ATOMIC=y
 CONFIG_RTE_PKTMBUF_HEADROOM=128
 
diff --git a/config/defconfig_x86_64-default-linuxapp-gcc b/config/defconfig_x86_64-default-linuxapp-gcc
index f11ffbf..25a7e1a 100644
--- a/config/defconfig_x86_64-default-linuxapp-gcc
+++ b/config/defconfig_x86_64-default-linuxapp-gcc
@@ -237,7 +237,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_MBUF=y
 CONFIG_RTE_LIBRTE_MBUF_DEBUG=n
-CONFIG_RTE_MBUF_SCATTER_GATHER=y
+CONFIG_RTE_MBUF_REFCNT=y
 CONFIG_RTE_MBUF_REFCNT_ATOMIC=y
 CONFIG_RTE_PKTMBUF_HEADROOM=128
 
diff --git a/config/defconfig_x86_64-default-linuxapp-icc b/config/defconfig_x86_64-default-linuxapp-icc
index 4eaca4c..d209874 100644
--- a/config/defconfig_x86_64-default-linuxapp-icc
+++ b/config/defconfig_x86_64-default-linuxapp-icc
@@ -234,7 +234,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
 #
 CONFIG_RTE_LIBRTE_MBUF=y
 CONFIG_RTE_LIBRTE_MBUF_DEBUG=n
-CONFIG_RTE_MBUF_SCATTER_GATHER=y
+CONFIG_RTE_MBUF_REFCNT=y
 CONFIG_RTE_MBUF_REFCNT_ATOMIC=y
 CONFIG_RTE_PKTMBUF_HEADROOM=128
 
diff --git a/doc/doxy-api.conf b/doc/doxy-api.conf
index 642f77a..92e48c3 100644
--- a/doc/doxy-api.conf
+++ b/doc/doxy-api.conf
@@ -49,7 +49,7 @@ FILE_PATTERNS           = rte_*.h \
                           cmdline.h
 PREDEFINED              = __DOXYGEN__ \
                           __attribute__(x)= \
-                          RTE_MBUF_SCATTER_GATHER
+                          RTE_MBUF_REFCNT
 
 OPTIMIZE_OUTPUT_FOR_C   = YES
 ENABLE_PREPROCESSING    = YES
diff --git a/examples/ipv4_frag/Makefile b/examples/ipv4_frag/Makefile
index 97fa452..b964017 100644
--- a/examples/ipv4_frag/Makefile
+++ b/examples/ipv4_frag/Makefile
@@ -39,8 +39,8 @@ RTE_TARGET ?= x86_64-default-linuxapp-gcc
 
 include $(RTE_SDK)/mk/rte.vars.mk
 
-ifneq ($(CONFIG_RTE_MBUF_SCATTER_GATHER),y)
-$(error This application requires RTE_MBUF_SCATTER_GATHER to be enabled)
+ifneq ($(CONFIG_RTE_MBUF_REFCNT),y)
+$(error This application requires RTE_MBUF_REFCNT to be enabled)
 endif
 
 # binary name
diff --git a/examples/ipv4_multicast/Makefile b/examples/ipv4_multicast/Makefile
index f76f8d8..5a9612d 100644
--- a/examples/ipv4_multicast/Makefile
+++ b/examples/ipv4_multicast/Makefile
@@ -39,8 +39,8 @@ RTE_TARGET ?= x86_64-default-linuxapp-gcc
 
 include $(RTE_SDK)/mk/rte.vars.mk
 
-ifneq ($(CONFIG_RTE_MBUF_SCATTER_GATHER),y)
-$(error This application requires RTE_MBUF_SCATTER_GATHER to be enabled)
+ifneq ($(CONFIG_RTE_MBUF_REFCNT),y)
+$(error This application requires RTE_MBUF_REFCNT to be enabled)
 endif
 
 # binary name
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 6129b1a..bffc2c4 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -160,7 +160,7 @@ rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
 	if (m->buf_addr == NULL)
 		rte_panic("bad virt addr\n");
 
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 	uint16_t cnt = rte_mbuf_refcnt_read(m);
 	if ((cnt == 0) || (cnt == UINT16_MAX))
 		rte_panic("bad ref cnt\n");
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index edffc2c..1b1a84e 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -67,6 +67,9 @@
 extern "C" {
 #endif
 
+/* deprecated feature, renamed in RTE_MBUF_REFCNT */
+#pragma GCC poison RTE_MBUF_SCATTER_GATHER
+
 /**
  * A control message buffer.
  */
@@ -177,7 +180,7 @@ struct rte_mbuf {
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
 	uint16_t buf_len;         /**< Length of segment buffer. */
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 	/**
 	 * 16-bit Reference counter.
 	 * It should only be accessed using the following functions:
@@ -265,7 +268,7 @@ if (!(exp)) {                                                        \
 
 #endif /*  RTE_LIBRTE_MBUF_DEBUG */
 
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 #ifdef RTE_MBUF_REFCNT_ATOMIC
 
 /**
@@ -347,14 +350,14 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
 		rte_prefetch0(m);               \
 } while (0)
 
-#else /* ! RTE_MBUF_SCATTER_GATHER */
+#else /* ! RTE_MBUF_REFCNT */
 
 /** Mbuf prefetch */
 #define RTE_MBUF_PREFETCH_TO_FREE(m) do { } while(0)
 
 #define rte_mbuf_refcnt_set(m,v) do { } while(0)
 
-#endif /* RTE_MBUF_SCATTER_GATHER */
+#endif /* RTE_MBUF_REFCNT */
 
 
 /**
@@ -393,10 +396,10 @@ static inline struct rte_mbuf *__rte_mbuf_raw_alloc(struct rte_mempool *mp)
 	if (rte_mempool_get(mp, &mb) < 0)
 		return NULL;
 	m = (struct rte_mbuf *)mb;
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 	RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m) == 0);
 	rte_mbuf_refcnt_set(m, 1);
-#endif /* RTE_MBUF_SCATTER_GATHER */
+#endif /* RTE_MBUF_REFCNT */
 	return (m);
 }
 
@@ -411,9 +414,9 @@ static inline struct rte_mbuf *__rte_mbuf_raw_alloc(struct rte_mempool *mp)
 static inline void __attribute__((always_inline))
 __rte_mbuf_raw_free(struct rte_mbuf *m)
 {
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 	RTE_MBUF_ASSERT(rte_mbuf_refcnt_read(m) == 0);
-#endif /* RTE_MBUF_SCATTER_GATHER */
+#endif /* RTE_MBUF_REFCNT */
 	rte_mempool_put(m->pool, m);
 }
 
@@ -590,7 +593,7 @@ static inline struct rte_mbuf *rte_pktmbuf_alloc(struct rte_mempool *mp)
 	return (m);
 }
 
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 
 /**
  * Attach packet mbuf to another packet mbuf.
@@ -658,7 +661,7 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 	m->pkt.data_len = 0;
 }
 
-#endif /* RTE_MBUF_SCATTER_GATHER */
+#endif /* RTE_MBUF_REFCNT */
 
 
 static inline struct rte_mbuf* __attribute__((always_inline))
@@ -666,7 +669,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
 
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
 			likely (rte_mbuf_refcnt_update(m, -1) == 0)) {
 		struct rte_mbuf *md = RTE_MBUF_FROM_BADDR(m->buf_addr);
@@ -684,7 +687,7 @@ __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 		}
 #endif
 		return(m);
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 	}
 	return (NULL);
 #endif
@@ -728,7 +731,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 	}
 }
 
-#ifdef RTE_MBUF_SCATTER_GATHER
+#ifdef RTE_MBUF_REFCNT
 
 /**
  * Creates a "clone" of the given packet mbuf.
@@ -804,7 +807,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 	} while ((m = m->pkt.next) != NULL);
 }
 
-#endif /* RTE_MBUF_SCATTER_GATHER */
+#endif /* RTE_MBUF_REFCNT */
 
 /**
  * Get the headroom in a packet mbuf.
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 01/11] igb/ixgbe: fix IP checksum calculation Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 02/11] mbuf: rename RTE_MBUF_SCATTER_GATHER into RTE_MBUF_REFCNT Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-25 21:39   ` Gilmore, Walter E
  2014-05-27  0:17   ` Stephen Hemminger
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 04/11] mbuf: remove the rte_pktmbuf structure Olivier Matz
                   ` (9 subsequent siblings)
  12 siblings, 2 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

The initial role of rte_ctrlmbuf is to carry generic messages (data
pointer + data length) but it's not used by the DPDK or it applications.
Keeping it implies:
  - loosing 1 byte in the rte_mbuf structure
  - having some dead code rte_mbuf.[ch]

This patch removes this feature. Thanks to it, it is now possible to
simplify the rte_mbuf structure by merging the rte_pktmbuf structure
in it. This is done in next commit.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/cmdline.c                   |   1 -
 app/test-pmd/testpmd.c                   |   2 -
 app/test-pmd/txonly.c                    |   2 +-
 app/test/commands.c                      |   1 -
 app/test/test_mbuf.c                     |  72 +------------
 examples/ipv4_multicast/main.c           |   2 +-
 lib/librte_mbuf/rte_mbuf.c               |  65 +++---------
 lib/librte_mbuf/rte_mbuf.h               | 175 ++++++-------------------------
 lib/librte_pmd_e1000/em_rxtx.c           |   2 +-
 lib/librte_pmd_e1000/igb_rxtx.c          |   2 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c        |   4 +-
 lib/librte_pmd_virtio/virtio_rxtx.c      |   2 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c    |   2 +-
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c |   2 +-
 14 files changed, 54 insertions(+), 280 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 7becedc..e3d1849 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -5010,7 +5010,6 @@ dump_struct_sizes(void)
 #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
 	DUMP_SIZE(struct rte_mbuf);
 	DUMP_SIZE(struct rte_pktmbuf);
-	DUMP_SIZE(struct rte_ctrlmbuf);
 	DUMP_SIZE(struct rte_mempool);
 	DUMP_SIZE(struct rte_ring);
 #undef DUMP_SIZE
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 9c56914..76b3823 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -389,13 +389,11 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 	mb_ctor_arg = (struct mbuf_ctor_arg *) opaque_arg;
 	mb = (struct rte_mbuf *) raw_mbuf;
 
-	mb->type         = RTE_MBUF_PKT;
 	mb->pool         = mp;
 	mb->buf_addr     = (void *) ((char *)mb + mb_ctor_arg->seg_buf_offset);
 	mb->buf_physaddr = (uint64_t) (rte_mempool_virt2phy(mp, mb) +
 			mb_ctor_arg->seg_buf_offset);
 	mb->buf_len      = mb_ctor_arg->seg_buf_size;
-	mb->type         = RTE_MBUF_PKT;
 	mb->ol_flags     = 0;
 	mb->pkt.data     = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
 	mb->pkt.nb_segs  = 1;
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 1cf2574..1f066d0 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -93,7 +93,7 @@ tx_mbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/app/test/commands.c b/app/test/commands.c
index b145036..c69544b 100644
--- a/app/test/commands.c
+++ b/app/test/commands.c
@@ -262,7 +262,6 @@ dump_struct_sizes(void)
 #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
 	DUMP_SIZE(struct rte_mbuf);
 	DUMP_SIZE(struct rte_pktmbuf);
-	DUMP_SIZE(struct rte_ctrlmbuf);
 	DUMP_SIZE(struct rte_mempool);
 	DUMP_SIZE(struct rte_ring);
 #undef DUMP_SIZE
diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index fe0f4f6..07b5551 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -80,7 +80,6 @@
 #define MAKE_STRING(x)          # x
 
 static struct rte_mempool *pktmbuf_pool = NULL;
-static struct rte_mempool *ctrlmbuf_pool = NULL;
 
 #if defined RTE_MBUF_REFCNT  && defined RTE_MBUF_REFCNT_ATOMIC
 
@@ -272,8 +271,8 @@ test_one_pktmbuf(void)
 		GOTO_FAIL("Buffer should be continuous");
 	memset(hdr, 0x55, MBUF_TEST_HDR2_LEN);
 
-	rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
-	rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+	rte_mbuf_sanity_check(m, 1);
+	rte_mbuf_sanity_check(m, 0);
 	rte_pktmbuf_dump(m, 0);
 
 	/* this prepend should fail */
@@ -320,48 +319,6 @@ fail:
 	return -1;
 }
 
-/*
- * test control mbuf
- */
-static int
-test_one_ctrlmbuf(void)
-{
-	struct rte_mbuf *m = NULL;
-	char message[] = "This is a message carried by a ctrlmbuf";
-
-	printf("Test ctrlmbuf API\n");
-
-	/* alloc a mbuf */
-
-	m = rte_ctrlmbuf_alloc(ctrlmbuf_pool);
-	if (m == NULL)
-		GOTO_FAIL("Cannot allocate mbuf");
-	if (rte_ctrlmbuf_len(m) != 0)
-		GOTO_FAIL("Bad length");
-
-	/* set data */
-	rte_ctrlmbuf_data(m) = &message;
-	rte_ctrlmbuf_len(m) = sizeof(message);
-
-	/* read data */
-	if (rte_ctrlmbuf_data(m) != message)
-		GOTO_FAIL("Invalid data pointer");
-	if (rte_ctrlmbuf_len(m) != sizeof(message))
-		GOTO_FAIL("Invalid len");
-
-	rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-
-	/* free mbuf */
-	rte_ctrlmbuf_free(m);
-	m = NULL;
-	return 0;
-
-fail:
-	if (m)
-		rte_ctrlmbuf_free(m);
-	return -1;
-}
-
 static int
 testclone_testupdate_testdetach(void)
 {
@@ -744,7 +701,7 @@ verify_mbuf_check_panics(struct rte_mbuf *buf)
 	pid = fork();
 
 	if (pid == 0) {
-		rte_mbuf_sanity_check(buf, RTE_MBUF_PKT, 1); /* should panic */
+		rte_mbuf_sanity_check(buf, 1); /* should panic */
 		exit(0);  /* return normally if it doesn't panic */
 	} else if (pid < 0){
 		printf("Fork Failed\n");
@@ -781,13 +738,6 @@ test_failing_mbuf_sanity_check(void)
 	}
 
 	badbuf = *buf;
-	badbuf.type = (uint8_t)-1;
-	if (verify_mbuf_check_panics(&badbuf)) {
-		printf("Error with bad-type mbuf test\n");
-		return -1;
-	}
-
-	badbuf = *buf;
 	badbuf.pool = NULL;
 	if (verify_mbuf_check_panics(&badbuf)) {
 		printf("Error with bad-pool mbuf test\n");
@@ -889,22 +839,6 @@ test_mbuf(void)
 		return -1;
 	}
 
-	/* create ctrlmbuf pool if it does not exist */
-	if (ctrlmbuf_pool == NULL) {
-		ctrlmbuf_pool =
-			rte_mempool_create("test_ctrlmbuf_pool", NB_MBUF,
-					   sizeof(struct rte_mbuf), 32, 0,
-					   NULL, NULL,
-					   rte_ctrlmbuf_init, NULL,
-					   SOCKET_ID_ANY, 0);
-	}
-
-	/* test control mbuf */
-	if (test_one_ctrlmbuf() < 0) {
-		printf("test_one_ctrlmbuf() failed\n");
-		return -1;
-	}
-
 	/* test free pktmbuf segment one by one */
 	if (test_pktmbuf_free_segment() < 0) {
 		printf("test_pktmbuf_free_segment() failed.\n");
diff --git a/examples/ipv4_multicast/main.c b/examples/ipv4_multicast/main.c
index 3bd37e4..3967d7a 100644
--- a/examples/ipv4_multicast/main.c
+++ b/examples/ipv4_multicast/main.c
@@ -343,7 +343,7 @@ mcast_out_pkt(struct rte_mbuf *pkt, int use_clone)
 
 	hdr->ol_flags = pkt->ol_flags;
 
-	__rte_mbuf_sanity_check(hdr, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(hdr, 1);
 	return (hdr);
 }
 
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index bffc2c4..b2e2f0f 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -60,32 +60,6 @@
 #include <rte_hexdump.h>
 
 /*
- * ctrlmbuf constructor, given as a callback function to
- * rte_mempool_create()
- */
-void
-rte_ctrlmbuf_init(struct rte_mempool *mp,
-		  __attribute__((unused)) void *opaque_arg,
-		  void *_m,
-		  __attribute__((unused)) unsigned i)
-{
-	struct rte_mbuf *m = _m;
-
-	memset(m, 0, mp->elt_size);
-
-	/* start of buffer is just after mbuf structure */
-	m->buf_addr = (char *)m + sizeof(struct rte_mbuf);
-	m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
-			sizeof(struct rte_mbuf);
-	m->buf_len = (uint16_t) (mp->elt_size - sizeof(struct rte_mbuf));
-
-	/* init some constant fields */
-	m->type = RTE_MBUF_CTRL;
-	m->ctrl.data = (char *)m->buf_addr;
-	m->pool = (struct rte_mempool *)mp;
-}
-
-/*
  * pktmbuf pool constructor, given as a callback function to
  * rte_mempool_create()
  */
@@ -133,7 +107,6 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->pkt.data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
 
 	/* init some constant fields */
-	m->type = RTE_MBUF_PKT;
 	m->pool = mp;
 	m->pkt.nb_segs = 1;
 	m->pkt.in_port = 0xff;
@@ -141,16 +114,13 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 
 /* do some sanity checks on a mbuf: panic if it fails */
 void
-rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
-		      int is_header)
+rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header)
 {
 	const struct rte_mbuf *m_seg;
 	unsigned nb_segs;
 
 	if (m == NULL)
 		rte_panic("mbuf is NULL\n");
-	if (m->type != (uint8_t)t)
-		rte_panic("bad mbuf type\n");
 
 	/* generic checks */
 	if (m->pool == NULL)
@@ -166,29 +136,18 @@ rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
 		rte_panic("bad ref cnt\n");
 #endif
 
-	/* nothing to check for ctrl messages */
-	if (m->type == RTE_MBUF_CTRL)
+	/* nothing to check for sub-segments */
+	if (is_header == 0)
 		return;
 
-	/* check pkt consistency */
-	else if (m->type == RTE_MBUF_PKT) {
-
-		/* nothing to check for sub-segments */
-		if (is_header == 0)
-			return;
-
-		nb_segs = m->pkt.nb_segs;
-		m_seg = m;
-		while (m_seg && nb_segs != 0) {
-			m_seg = m_seg->pkt.next;
-			nb_segs --;
-		}
-		if (nb_segs != 0)
-			rte_panic("bad nb_segs\n");
-		return;
+	nb_segs = m->pkt.nb_segs;
+	m_seg = m;
+	while (m_seg && nb_segs != 0) {
+		m_seg = m_seg->pkt.next;
+		nb_segs --;
 	}
-
-	rte_panic("unknown mbuf type\n");
+	if (nb_segs != 0)
+		rte_panic("bad nb_segs\n");
 }
 
 /* dump a mbuf on console */
@@ -198,7 +157,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 	unsigned int len;
 	unsigned nb_segs;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	printf("dump mbuf at 0x%p, phys=%"PRIx64", buf_len=%u\n",
 	       m, (uint64_t)m->buf_physaddr, (unsigned)m->buf_len);
@@ -208,7 +167,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 	nb_segs = m->pkt.nb_segs;
 
 	while (m && nb_segs != 0) {
-		__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+		__rte_mbuf_sanity_check(m, 0);
 
 		printf("  segment at 0x%p, data=0x%p, data_len=%u\n",
 		       m, m->pkt.data, (unsigned)m->pkt.data_len);
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 1b1a84e..22e1ac1 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -43,18 +43,13 @@
  * buffers. The message buffers are stored in a mempool, using the
  * RTE mempool library.
  *
- * This library provide an API to allocate/free mbufs, manipulate
- * control message buffer (ctrlmbuf), which are generic message
- * buffers, and packet buffers (pktmbuf), which are used to carry
- * network packets.
+ * This library provide an API to allocate/free packet mbufs, which are
+ * used to carry network packets.
  *
  * To understand the concepts of packet buffers or mbufs, you
  * should read "TCP/IP Illustrated, Volume 2: The Implementation,
  * Addison-Wesley, 1995, ISBN 0-201-63354-X from Richard Stevens"
  * http://www.kohala.com/start/tcpipiv2.html
- *
- * The main modification of this implementation is the use of mbuf for
- * transports other than packets. mbufs can have other types.
  */
 
 #include <stdint.h>
@@ -70,15 +65,6 @@ extern "C" {
 /* deprecated feature, renamed in RTE_MBUF_REFCNT */
 #pragma GCC poison RTE_MBUF_SCATTER_GATHER
 
-/**
- * A control message buffer.
- */
-struct rte_ctrlmbuf {
-	void *data;        /**< Pointer to data. */
-	uint32_t data_len; /**< Length of data. */
-};
-
-
 /*
  * Packet Offload Features Flags. It also carry packet type information.
  * Critical resources. Both rx/tx shared these bits. Be cautious on any change
@@ -165,15 +151,7 @@ struct rte_pktmbuf {
 };
 
 /**
- * This enum indicates the mbuf type.
- */
-enum rte_mbuf_type {
-	RTE_MBUF_CTRL,  /**< Control mbuf. */
-	RTE_MBUF_PKT,   /**< Packet mbuf. */
-};
-
-/**
- * The generic rte_mbuf, containing a packet mbuf or a control mbuf.
+ * The generic rte_mbuf, containing a packet mbuf.
  */
 struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
@@ -196,14 +174,10 @@ struct rte_mbuf {
 #else
 	uint16_t refcnt_reserved;     /**< Do not use this field */
 #endif
-	uint8_t type;                 /**< Type of mbuf. */
-	uint8_t reserved;             /**< Unused field. Required for padding. */
+	uint16_t reserved;             /**< Unused field. Required for padding. */
 	uint16_t ol_flags;            /**< Offload features. */
 
-	union {
-		struct rte_ctrlmbuf ctrl;
-		struct rte_pktmbuf pkt;
-	};
+	struct rte_pktmbuf pkt;
 } __rte_cache_aligned;
 
 /**
@@ -241,12 +215,12 @@ struct rte_pktmbuf_pool_private {
 #ifdef RTE_LIBRTE_MBUF_DEBUG
 
 /**  check mbuf type in debug mode */
-#define __rte_mbuf_sanity_check(m, t, is_h) rte_mbuf_sanity_check(m, t, is_h)
+#define __rte_mbuf_sanity_check(m, is_h) rte_mbuf_sanity_check(m, is_h)
 
 /**  check mbuf type in debug mode if mbuf pointer is not null */
-#define __rte_mbuf_sanity_check_raw(m, t, is_h)	do {       \
+#define __rte_mbuf_sanity_check_raw(m, is_h)	do {       \
 	if ((m) != NULL)                                   \
-		rte_mbuf_sanity_check(m, t, is_h);          \
+		rte_mbuf_sanity_check(m, is_h);          \
 } while (0)
 
 /**  MBUF asserts in debug mode */
@@ -258,10 +232,10 @@ if (!(exp)) {                                                        \
 #else /*  RTE_LIBRTE_MBUF_DEBUG */
 
 /**  check mbuf type in debug mode */
-#define __rte_mbuf_sanity_check(m, t, is_h) do { } while(0)
+#define __rte_mbuf_sanity_check(m, is_h) do { } while(0)
 
 /**  check mbuf type in debug mode if mbuf pointer is not null */
-#define __rte_mbuf_sanity_check_raw(m, t, is_h) do { } while(0)
+#define __rte_mbuf_sanity_check_raw(m, is_h) do { } while(0)
 
 /**  MBUF asserts in debug mode */
 #define RTE_MBUF_ASSERT(exp)                do { } while(0)
@@ -368,20 +342,17 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
  *
  * @param m
  *   The mbuf to be checked.
- * @param t
- *   The expected type of the mbuf.
  * @param is_header
  *   True if the mbuf is a packet header, false if it is a sub-segment
  *   of a packet (in this case, some fields like nb_segs are not checked)
  */
 void
-rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
-		      int is_header);
+rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header);
 
 /**
  * @internal Allocate a new mbuf from mempool *mp*.
  * The use of that function is reserved for RTE internal needs.
- * Please use either rte_ctrlmbuf_alloc() or rte_pktmbuf_alloc().
+ * Please use rte_pktmbuf_alloc().
  *
  * @param mp
  *   The mempool from which mbuf is allocated.
@@ -406,7 +377,7 @@ static inline struct rte_mbuf *__rte_mbuf_raw_alloc(struct rte_mempool *mp)
 /**
  * @internal Put mbuf back into its original mempool.
  * The use of that function is reserved for RTE internal needs.
- * Please use either rte_ctrlmbuf_free() or rte_pktmbuf_free().
+ * Please use rte_pktmbuf_free().
  *
  * @param m
  *   The mbuf to be freed.
@@ -420,95 +391,11 @@ __rte_mbuf_raw_free(struct rte_mbuf *m)
 	rte_mempool_put(m->pool, m);
 }
 
-/* Operations on ctrl mbuf */
-
-/**
- * The control mbuf constructor.
- *
- * This function initializes some fields in an mbuf structure that are
- * not modified by the user once created (mbuf type, origin pool, buffer
- * start address, and so on). This function is given as a callback function
- * to rte_mempool_create() at pool creation time.
- *
- * @param mp
- *   The mempool from which the mbuf is allocated.
- * @param opaque_arg
- *   A pointer that can be used by the user to retrieve useful information
- *   for mbuf initialization. This pointer comes from the ``init_arg``
- *   parameter of rte_mempool_create().
- * @param m
- *   The mbuf to initialize.
- * @param i
- *   The index of the mbuf in the pool table.
- */
-void rte_ctrlmbuf_init(struct rte_mempool *mp, void *opaque_arg,
-		       void *m, unsigned i);
-
-/**
- * Allocate a new mbuf (type is ctrl) from mempool *mp*.
- *
- * This new mbuf is initialized with data pointing to the beginning of
- * buffer, and with a length of zero.
- *
- * @param mp
- *   The mempool from which the mbuf is allocated.
- * @return
- *   - The pointer to the new mbuf on success.
- *   - NULL if allocation failed.
- */
-static inline struct rte_mbuf *rte_ctrlmbuf_alloc(struct rte_mempool *mp)
-{
-	struct rte_mbuf *m;
-	if ((m = __rte_mbuf_raw_alloc(mp)) != NULL) {
-		m->ctrl.data = m->buf_addr;
-		m->ctrl.data_len = 0;
-		__rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-	}
-	return (m);
-}
-
-/**
- * Free a control mbuf back into its original mempool.
- *
- * @param m
- *   The control mbuf to be freed.
- */
-static inline void rte_ctrlmbuf_free(struct rte_mbuf *m)
-{
-	__rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-#ifdef RTE_MBUF_SCATTER_GATHER
-	if (rte_mbuf_refcnt_update(m, -1) == 0)
-#endif /* RTE_MBUF_SCATTER_GATHER */
-		__rte_mbuf_raw_free(m);
-}
-
-/**
- * A macro that returns the pointer to the carried data.
- *
- * The value that can be read or assigned.
- *
- * @param m
- *   The control mbuf.
- */
-#define rte_ctrlmbuf_data(m) ((m)->ctrl.data)
-
-/**
- * A macro that returns the length of the carried data.
- *
- * The value that can be read or assigned.
- *
- * @param m
- *   The control mbuf.
- */
-#define rte_ctrlmbuf_len(m) ((m)->ctrl.data_len)
-
-/* Operations on pkt mbuf */
-
 /**
  * The packet mbuf constructor.
  *
- * This function initializes some fields in the mbuf structure that are not
- * modified by the user once created (mbuf type, origin pool, buffer start
+ * This function initializes some fields in the mbuf structure that are
+ * not modified by the user once created (origin pool, buffer start
  * address, and so on). This function is given as a callback function to
  * rte_mempool_create() at pool creation time.
  *
@@ -569,11 +456,11 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 	m->pkt.data = (char*) m->buf_addr + buf_ofs;
 
 	m->pkt.data_len = 0;
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 }
 
 /**
- * Allocate a new mbuf (type is pkt) from a mempool.
+ * Allocate a new mbuf from a mempool.
  *
  * This new mbuf contains one segment, which has a length of 0. The pointer
  * to data is initialized to have some bytes of headroom in the buffer
@@ -629,8 +516,8 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->pkt.nb_segs = 1;
 	mi->ol_flags = md->ol_flags;
 
-	__rte_mbuf_sanity_check(mi, RTE_MBUF_PKT, 1);
-	__rte_mbuf_sanity_check(md, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check(mi, 1);
+	__rte_mbuf_sanity_check(md, 0);
 }
 
 /**
@@ -667,7 +554,7 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 static inline struct rte_mbuf* __attribute__((always_inline))
 __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check(m, 0);
 
 #ifdef RTE_MBUF_REFCNT
 	if (likely (rte_mbuf_refcnt_read(m) == 1) ||
@@ -722,7 +609,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 {
 	struct rte_mbuf *m_next;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
 		m_next = m->pkt.next;
@@ -783,7 +670,7 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
 		return (NULL);
 	}
 
-	__rte_mbuf_sanity_check(mc, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(mc, 1);
 	return (mc);
 }
 
@@ -800,7 +687,7 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
  */
 static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
@@ -819,7 +706,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
  */
 static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return (uint16_t) ((char*) m->pkt.data - (char*) m->buf_addr);
 }
 
@@ -833,7 +720,7 @@ static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
  */
 static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return (uint16_t)(m->buf_len - rte_pktmbuf_headroom(m) -
 			  m->pkt.data_len);
 }
@@ -850,7 +737,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 {
 	struct rte_mbuf *m2 = (struct rte_mbuf *)m;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	while (m2->pkt.next != NULL)
 		m2 = m2->pkt.next;
 	return m2;
@@ -908,7 +795,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
 					uint16_t len)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	if (unlikely(len > rte_pktmbuf_headroom(m)))
 		return NULL;
@@ -940,7 +827,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
 	void *tail;
 	struct rte_mbuf *m_last;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	m_last = rte_pktmbuf_lastseg(m);
 	if (unlikely(len > rte_pktmbuf_tailroom(m_last)))
@@ -968,7 +855,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
  */
 static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	if (unlikely(len > m->pkt.data_len))
 		return NULL;
@@ -997,7 +884,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
 {
 	struct rte_mbuf *m_last;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	m_last = rte_pktmbuf_lastseg(m);
 	if (unlikely(len > m_last->pkt.data_len))
@@ -1019,7 +906,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
  */
 static inline int rte_pktmbuf_is_contiguous(const struct rte_mbuf *m)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return !!(m->pkt.nb_segs == 1);
 }
 
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index 78c0c44..31f480a 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -85,7 +85,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index b3c8149..62ff7bc 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -79,7 +79,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 4e307c2..76448ab 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -88,7 +88,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
@@ -987,7 +987,6 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 		/* populate the static rte mbuf fields */
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
-		mb->type = RTE_MBUF_PKT;
 		mb->pkt.next = NULL;
 		mb->pkt.data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
 		mb->pkt.nb_segs = 1;
@@ -3084,7 +3083,6 @@ ixgbe_alloc_rx_queue_mbufs(struct igb_rx_queue *rxq)
 		}
 
 		rte_mbuf_refcnt_set(mbuf, 1);
-		mbuf->type = RTE_MBUF_PKT;
 		mbuf->pkt.next = NULL;
 		mbuf->pkt.data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
 		mbuf->pkt.nb_segs = 1;
diff --git a/lib/librte_pmd_virtio/virtio_rxtx.c b/lib/librte_pmd_virtio/virtio_rxtx.c
index fe94a3f..0db3ba0 100644
--- a/lib/librte_pmd_virtio/virtio_rxtx.c
+++ b/lib/librte_pmd_virtio/virtio_rxtx.c
@@ -66,7 +66,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 
 	return (m);
 }
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index 9fdd441..d91404a 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -101,7 +101,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
index 533aa76..5cd1cdb 100644
--- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
+++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
@@ -80,7 +80,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 
 	return m;
 }
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 04/11] mbuf: remove the rte_pktmbuf structure
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (2 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield Olivier Matz
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

The rte_pktmbuf structure was initially included in the rte_mbuf
structure. This was needed when there was 2 types of mbuf (ctrl and
packet). As the control mbuf has been removed, we can merge the
rte_pktmbuf into the rte_mbuf structure.

Advantages of doing this:
  - the access to mbuf fields is easier (ex: m->data instead of m->pkt.data)
  - make the structure more consistent: for instance, there was no reason
    to have the ol_flags field in rte_mbuf
  - it will allow a deeper reorganization of the rte_mbuf structure in the
    next commits, allowing to gain several bytes in it

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/cmdline.c                             |   1 -
 app/test-pmd/csumonly.c                            |   6 +-
 app/test-pmd/ieee1588fwd.c                         |   6 +-
 app/test-pmd/macfwd-retry.c                        |   2 +-
 app/test-pmd/macfwd.c                              |   8 +-
 app/test-pmd/rxonly.c                              |  12 +-
 app/test-pmd/testpmd.c                             |   8 +-
 app/test-pmd/testpmd.h                             |   2 +-
 app/test-pmd/txonly.c                              |  42 +++----
 app/test/commands.c                                |   1 -
 app/test/test_mbuf.c                               |  12 +-
 app/test/test_sched.c                              |   4 +-
 examples/dpdk_qat/crypto.c                         |  22 ++--
 examples/dpdk_qat/main.c                           |   2 +-
 examples/exception_path/main.c                     |  10 +-
 examples/ip_reassembly/ipv4_rsmbl.h                |  20 +--
 examples/ip_reassembly/main.c                      |   6 +-
 examples/ipv4_frag/main.c                          |   4 +-
 examples/ipv4_frag/rte_ipv4_frag.h                 |  42 +++----
 examples/ipv4_multicast/main.c                     |  14 +--
 examples/l3fwd-power/main.c                        |   2 +-
 examples/l3fwd-vf/main.c                           |   2 +-
 examples/l3fwd/main.c                              |  10 +-
 examples/load_balancer/runtime.c                   |   2 +-
 .../client_server_mp/mp_client/client.c            |   2 +-
 examples/quota_watermark/qw/main.c                 |   4 +-
 examples/vhost/main.c                              |  22 ++--
 examples/vhost_xen/main.c                          |  22 ++--
 lib/librte_mbuf/rte_mbuf.c                         |  26 ++--
 lib/librte_mbuf/rte_mbuf.h                         | 140 ++++++++++-----------
 lib/librte_pmd_e1000/em_rxtx.c                     |  64 +++++-----
 lib/librte_pmd_e1000/igb_rxtx.c                    |  68 +++++-----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c                  | 100 +++++++--------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h                  |   2 +-
 lib/librte_pmd_pcap/rte_eth_pcap.c                 |  14 +--
 lib/librte_pmd_virtio/virtio_rxtx.c                |  16 +--
 lib/librte_pmd_virtio/virtqueue.h                  |   6 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c              |  26 ++--
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c           |  12 +-
 lib/librte_pmd_xenvirt/virtqueue.h                 |   4 +-
 lib/librte_sched/rte_sched.c                       |  14 +--
 lib/librte_sched/rte_sched.h                       |  10 +-
 42 files changed, 394 insertions(+), 398 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index e3d1849..c507c46 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -5009,7 +5009,6 @@ dump_struct_sizes(void)
 {
 #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
 	DUMP_SIZE(struct rte_mbuf);
-	DUMP_SIZE(struct rte_pktmbuf);
 	DUMP_SIZE(struct rte_mempool);
 	DUMP_SIZE(struct rte_ring);
 #undef DUMP_SIZE
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 3568ba0..ee82eb6 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -263,7 +263,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		pkt_ol_flags = mb->ol_flags;
 		ol_flags = (uint16_t) (pkt_ol_flags & (~PKT_TX_L4_MASK));
 
-		eth_hdr = (struct ether_hdr *) mb->pkt.data;
+		eth_hdr = (struct ether_hdr *) mb->data;
 		eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
 		if (eth_type == ETHER_TYPE_VLAN) {
 			/* Only allow single VLAN label here */
@@ -430,8 +430,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		}
 
 		/* Combine the packet header write. VLAN is not consider here */
-		mb->pkt.vlan_macip.f.l2_len = l2_len;
-		mb->pkt.vlan_macip.f.l3_len = l3_len;
+		mb->vlan_macip.f.l2_len = l2_len;
+		mb->vlan_macip.f.l3_len = l3_len;
 		mb->ol_flags = ol_flags;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
diff --git a/app/test-pmd/ieee1588fwd.c b/app/test-pmd/ieee1588fwd.c
index 44f0a89..4f18183 100644
--- a/app/test-pmd/ieee1588fwd.c
+++ b/app/test-pmd/ieee1588fwd.c
@@ -546,7 +546,7 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	 * Check that the received packet is a PTP packet that was detected
 	 * by the hardware.
 	 */
-	eth_hdr = (struct ether_hdr *)mb->pkt.data;
+	eth_hdr = (struct ether_hdr *)mb->data;
 	eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
 	if (! (mb->ol_flags & PKT_RX_IEEE1588_PTP)) {
 		if (eth_type == ETHER_TYPE_1588) {
@@ -557,7 +557,7 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 			printf("Port %u Received non PTP packet type=0x%4x "
 			       "len=%u\n",
 			       (unsigned) fs->rx_port, eth_type,
-			       (unsigned) mb->pkt.pkt_len);
+			       (unsigned) mb->pkt_len);
 		}
 		rte_pktmbuf_free(mb);
 		return;
@@ -574,7 +574,7 @@ ieee1588_packet_fwd(struct fwd_stream *fs)
 	 * Check that the received PTP packet is a PTP V2 packet of type
 	 * PTP_SYNC_MESSAGE.
 	 */
-	ptp_hdr = (struct ptpv2_msg *) ((char *) mb->pkt.data +
+	ptp_hdr = (struct ptpv2_msg *) ((char *) mb->data +
 					sizeof(struct ether_hdr));
 	if (ptp_hdr->version != 0x02) {
 		printf("Port %u Received PTP V2 Ethernet frame with wrong PTP"
diff --git a/app/test-pmd/macfwd-retry.c b/app/test-pmd/macfwd-retry.c
index 98fc037..687ff8d 100644
--- a/app/test-pmd/macfwd-retry.c
+++ b/app/test-pmd/macfwd-retry.c
@@ -119,7 +119,7 @@ pkt_burst_mac_retry_forward(struct fwd_stream *fs)
 	fs->rx_packets += nb_rx;
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->pkt.data;
+		eth_hdr = (struct ether_hdr *) mb->data;
 		ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 3099792..8d7612c 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -110,15 +110,15 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 	txp = &ports[fs->tx_port];
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->pkt.data;
+		eth_hdr = (struct ether_hdr *) mb->data;
 		ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
 		mb->ol_flags = txp->tx_ol_flags;
-		mb->pkt.vlan_macip.f.l2_len = sizeof(struct ether_hdr);
-		mb->pkt.vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);
-		mb->pkt.vlan_macip.f.vlan_tci = txp->tx_vlan_id;
+		mb->vlan_macip.f.l2_len = sizeof(struct ether_hdr);
+		mb->vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);
+		mb->vlan_macip.f.vlan_tci = txp->tx_vlan_id;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
 	fs->tx_packets += nb_tx;
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 30f8195..b77c8ce 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -149,24 +149,24 @@ pkt_burst_receive(struct fwd_stream *fs)
 			rte_pktmbuf_free(mb);
 			continue;
 		}
-		eth_hdr = (struct ether_hdr *) mb->pkt.data;
+		eth_hdr = (struct ether_hdr *) mb->data;
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		ol_flags = mb->ol_flags;
 		print_ether_addr("  src=", &eth_hdr->s_addr);
 		print_ether_addr(" - dst=", &eth_hdr->d_addr);
 		printf(" - type=0x%04x - length=%u - nb_segs=%d",
-		       eth_type, (unsigned) mb->pkt.pkt_len,
-		       (int)mb->pkt.nb_segs);
+		       eth_type, (unsigned) mb->pkt_len,
+		       (int)mb->nb_segs);
 		if (ol_flags & PKT_RX_RSS_HASH) {
-			printf(" - RSS hash=0x%x", (unsigned) mb->pkt.hash.rss);
+			printf(" - RSS hash=0x%x", (unsigned) mb->hash.rss);
 			printf(" - RSS queue=0x%x",(unsigned) fs->rx_queue);
 		}
 		else if (ol_flags & PKT_RX_FDIR)
 			printf(" - FDIR hash=0x%x - FDIR id=0x%x ",
-			       mb->pkt.hash.fdir.hash, mb->pkt.hash.fdir.id);
+			       mb->hash.fdir.hash, mb->hash.fdir.id);
 		if (ol_flags & PKT_RX_VLAN_PKT)
 			printf(" - VLAN tci=0x%x",
-				mb->pkt.vlan_macip.f.vlan_tci);
+				mb->vlan_macip.f.vlan_tci);
 		printf("\n");
 		if (ol_flags != 0) {
 			int rxf;
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 76b3823..1964020 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -395,10 +395,10 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 			mb_ctor_arg->seg_buf_offset);
 	mb->buf_len      = mb_ctor_arg->seg_buf_size;
 	mb->ol_flags     = 0;
-	mb->pkt.data     = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
-	mb->pkt.nb_segs  = 1;
-	mb->pkt.vlan_macip.data = 0;
-	mb->pkt.hash.rss = 0;
+	mb->data         = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+	mb->nb_segs      = 1;
+	mb->vlan_macip.data = 0;
+	mb->hash.rss     = 0;
 }
 
 static void
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 5b4ee6f..bb10d3b 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -60,7 +60,7 @@ int main(int argc, char **argv);
  * The maximum number of segments per packet is used when creating
  * scattered transmit packets composed of a list of mbufs.
  */
-#define RTE_MAX_SEGS_PER_PKT 255 /**< pkt.nb_segs is a 8-bit unsigned char. */
+#define RTE_MAX_SEGS_PER_PKT 255 /**< nb_segs is a 8-bit unsigned char. */
 
 #define MAX_PKT_BURST 512
 #define DEF_PKT_BURST 16
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 1f066d0..3baa0c8 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -106,18 +106,18 @@ copy_buf_to_pkt_segs(void* buf, unsigned len, struct rte_mbuf *pkt,
 	unsigned copy_len;
 
 	seg = pkt;
-	while (offset >= seg->pkt.data_len) {
-		offset -= seg->pkt.data_len;
-		seg = seg->pkt.next;
+	while (offset >= seg->data_len) {
+		offset -= seg->data_len;
+		seg = seg->next;
 	}
-	copy_len = seg->pkt.data_len - offset;
-	seg_buf = ((char *) seg->pkt.data + offset);
+	copy_len = seg->data_len - offset;
+	seg_buf = ((char *) seg->data + offset);
 	while (len > copy_len) {
 		rte_memcpy(seg_buf, buf, (size_t) copy_len);
 		len -= copy_len;
 		buf = ((char*) buf + copy_len);
-		seg = seg->pkt.next;
-		seg_buf = seg->pkt.data;
+		seg = seg->next;
+		seg_buf = seg->data;
 	}
 	rte_memcpy(seg_buf, buf, (size_t) len);
 }
@@ -125,8 +125,8 @@ copy_buf_to_pkt_segs(void* buf, unsigned len, struct rte_mbuf *pkt,
 static inline void
 copy_buf_to_pkt(void* buf, unsigned len, struct rte_mbuf *pkt, unsigned offset)
 {
-	if (offset + len <= pkt->pkt.data_len) {
-		rte_memcpy(((char *) pkt->pkt.data + offset), buf, (size_t) len);
+	if (offset + len <= pkt->data_len) {
+		rte_memcpy(((char *) pkt->data + offset), buf, (size_t) len);
 		return;
 	}
 	copy_buf_to_pkt_segs(buf, len, pkt, offset);
@@ -225,19 +225,19 @@ pkt_burst_transmit(struct fwd_stream *fs)
 				return;
 			break;
 		}
-		pkt->pkt.data_len = tx_pkt_seg_lengths[0];
+		pkt->data_len = tx_pkt_seg_lengths[0];
 		pkt_seg = pkt;
 		for (i = 1; i < tx_pkt_nb_segs; i++) {
-			pkt_seg->pkt.next = tx_mbuf_alloc(mbp);
-			if (pkt_seg->pkt.next == NULL) {
-				pkt->pkt.nb_segs = i;
+			pkt_seg->next = tx_mbuf_alloc(mbp);
+			if (pkt_seg->next == NULL) {
+				pkt->nb_segs = i;
 				rte_pktmbuf_free(pkt);
 				goto nomore_mbuf;
 			}
-			pkt_seg = pkt_seg->pkt.next;
-			pkt_seg->pkt.data_len = tx_pkt_seg_lengths[i];
+			pkt_seg = pkt_seg->next;
+			pkt_seg->data_len = tx_pkt_seg_lengths[i];
 		}
-		pkt_seg->pkt.next = NULL; /* Last segment of packet. */
+		pkt_seg->next = NULL; /* Last segment of packet. */
 
 		/*
 		 * Initialize Ethernet header.
@@ -260,12 +260,12 @@ pkt_burst_transmit(struct fwd_stream *fs)
 		 * Complete first mbuf of packet and append it to the
 		 * burst of packets to be transmitted.
 		 */
-		pkt->pkt.nb_segs = tx_pkt_nb_segs;
-		pkt->pkt.pkt_len = tx_pkt_length;
+		pkt->nb_segs = tx_pkt_nb_segs;
+		pkt->pkt_len = tx_pkt_length;
 		pkt->ol_flags = ol_flags;
-		pkt->pkt.vlan_macip.f.vlan_tci  = vlan_tci;
-		pkt->pkt.vlan_macip.f.l2_len = sizeof(struct ether_hdr);
-		pkt->pkt.vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);
+		pkt->vlan_macip.f.vlan_tci  = vlan_tci;
+		pkt->vlan_macip.f.l2_len = sizeof(struct ether_hdr);
+		pkt->vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);
 		pkts_burst[nb_pkt] = pkt;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_pkt);
diff --git a/app/test/commands.c b/app/test/commands.c
index c69544b..ef66fdd 100644
--- a/app/test/commands.c
+++ b/app/test/commands.c
@@ -261,7 +261,6 @@ dump_struct_sizes(void)
 {
 #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
 	DUMP_SIZE(struct rte_mbuf);
-	DUMP_SIZE(struct rte_pktmbuf);
 	DUMP_SIZE(struct rte_mempool);
 	DUMP_SIZE(struct rte_ring);
 #undef DUMP_SIZE
diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 07b5551..320d76f 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -344,8 +344,8 @@ testclone_testupdate_testdetach(void)
 		GOTO_FAIL("cannot clone data\n");
 	rte_pktmbuf_free(clone);
 
-	mc->pkt.next = rte_pktmbuf_alloc(pktmbuf_pool);
-	if(mc->pkt.next == NULL)
+	mc->next = rte_pktmbuf_alloc(pktmbuf_pool);
+	if(mc->next == NULL)
 		GOTO_FAIL("Next Pkt Null\n");
 
 	clone = rte_pktmbuf_clone(mc, pktmbuf_pool);
@@ -432,7 +432,7 @@ test_pktmbuf_pool_ptr(void)
 			printf("rte_pktmbuf_alloc() failed (%u)\n", i);
 			ret = -1;
 		}
-		m[i]->pkt.data = RTE_PTR_ADD(m[i]->pkt.data, 64);
+		m[i]->data = RTE_PTR_ADD(m[i]->data, 64);
 	}
 
 	/* free them */
@@ -451,8 +451,8 @@ test_pktmbuf_pool_ptr(void)
 			printf("rte_pktmbuf_alloc() failed (%u)\n", i);
 			ret = -1;
 		}
-		if (m[i]->pkt.data != RTE_PTR_ADD(m[i]->buf_addr, RTE_PKTMBUF_HEADROOM)) {
-			printf ("pkt.data pointer not set properly\n");
+		if (m[i]->data != RTE_PTR_ADD(m[i]->buf_addr, RTE_PKTMBUF_HEADROOM)) {
+			printf ("data pointer not set properly\n");
 			ret = -1;
 		}
 	}
@@ -493,7 +493,7 @@ test_pktmbuf_free_segment(void)
 			mb = m[i];
 			while(mb != NULL) {
 				mt = mb;
-				mb = mb->pkt.next;
+				mb = mb->next;
 				rte_pktmbuf_free_seg(mt);
 			}
 		}
diff --git a/app/test/test_sched.c b/app/test/test_sched.c
index 0de5b1c..f729ade 100644
--- a/app/test/test_sched.c
+++ b/app/test/test_sched.c
@@ -148,8 +148,8 @@ prepare_pkt(struct rte_mbuf *mbuf)
 	rte_sched_port_pkt_write(mbuf, SUBPORT, PIPE, TC, QUEUE, e_RTE_METER_YELLOW);
 
 	/* 64 byte packet */
-	mbuf->pkt.pkt_len  = 60;
-	mbuf->pkt.data_len = 60;
+	mbuf->pkt_len  = 60;
+	mbuf->data_len = 60;
 }
 
 
diff --git a/examples/dpdk_qat/crypto.c b/examples/dpdk_qat/crypto.c
index 7606d3d..e519e25 100644
--- a/examples/dpdk_qat/crypto.c
+++ b/examples/dpdk_qat/crypto.c
@@ -183,7 +183,7 @@ struct glob_keys g_crypto_hash_keys = {
  *
  */
 #define PACKET_DATA_START_PHYS(p) \
-		((p)->buf_physaddr + ((char *)p->pkt.data - (char *)p->buf_addr))
+		((p)->buf_physaddr + ((char *)p->data - (char *)p->buf_addr))
 
 /*
  * A fixed offset to where the crypto is to be performed, which is the first
@@ -773,7 +773,7 @@ enum crypto_result
 crypto_encrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 {
 	CpaCySymDpOpData *opData =
-			(CpaCySymDpOpData *) ((char *) (rte_buff->pkt.data)
+			(CpaCySymDpOpData *) ((char *) (rte_buff->data)
 					+ CRYPTO_OFFSET_TO_OPDATA);
 	uint32_t lcore_id;
 
@@ -785,7 +785,7 @@ crypto_encrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 	bzero(opData, sizeof(CpaCySymDpOpData));
 
 	opData->srcBuffer = opData->dstBuffer = PACKET_DATA_START_PHYS(rte_buff);
-	opData->srcBufferLen = opData->dstBufferLen = rte_buff->pkt.data_len;
+	opData->srcBufferLen = opData->dstBufferLen = rte_buff->data_len;
 	opData->sessionCtx = qaCoreConf[lcore_id].encryptSessionHandleTbl[c][h];
 	opData->thisPhys = PACKET_DATA_START_PHYS(rte_buff)
 			+ CRYPTO_OFFSET_TO_OPDATA;
@@ -805,7 +805,7 @@ crypto_encrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 			opData->ivLenInBytes = IV_LENGTH_8_BYTES;
 
 		opData->cryptoStartSrcOffsetInBytes = CRYPTO_START_OFFSET;
-		opData->messageLenToCipherInBytes = rte_buff->pkt.data_len
+		opData->messageLenToCipherInBytes = rte_buff->data_len
 				- CRYPTO_START_OFFSET;
 		/*
 		 * Work around for padding, message length has to be a multiple of
@@ -818,7 +818,7 @@ crypto_encrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 	if (NO_HASH != h) {
 
 		opData->hashStartSrcOffsetInBytes = HASH_START_OFFSET;
-		opData->messageLenToHashInBytes = rte_buff->pkt.data_len
+		opData->messageLenToHashInBytes = rte_buff->data_len
 				- HASH_START_OFFSET;
 		/*
 		 * Work around for padding, message length has to be a multiple of block
@@ -831,7 +831,7 @@ crypto_encrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 		 * Assumption: Ok ignore the passed digest pointer and place HMAC at end
 		 * of packet.
 		 */
-		opData->digestResult = rte_buff->buf_physaddr + rte_buff->pkt.data_len;
+		opData->digestResult = rte_buff->buf_physaddr + rte_buff->data_len;
 	}
 
 	if (CPA_STATUS_SUCCESS != enqueueOp(opData, lcore_id)) {
@@ -848,7 +848,7 @@ enum crypto_result
 crypto_decrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 {
 
-	CpaCySymDpOpData *opData = (void*) (((char *) rte_buff->pkt.data)
+	CpaCySymDpOpData *opData = (void*) (((char *) rte_buff->data)
 			+ CRYPTO_OFFSET_TO_OPDATA);
 	uint32_t lcore_id;
 
@@ -860,7 +860,7 @@ crypto_decrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 	bzero(opData, sizeof(CpaCySymDpOpData));
 
 	opData->dstBuffer = opData->srcBuffer = PACKET_DATA_START_PHYS(rte_buff);
-	opData->dstBufferLen = opData->srcBufferLen = rte_buff->pkt.data_len;
+	opData->dstBufferLen = opData->srcBufferLen = rte_buff->data_len;
 	opData->thisPhys = PACKET_DATA_START_PHYS(rte_buff)
 			+ CRYPTO_OFFSET_TO_OPDATA;
 	opData->sessionCtx = qaCoreConf[lcore_id].decryptSessionHandleTbl[c][h];
@@ -880,7 +880,7 @@ crypto_decrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 			opData->ivLenInBytes = IV_LENGTH_8_BYTES;
 
 		opData->cryptoStartSrcOffsetInBytes = CRYPTO_START_OFFSET;
-		opData->messageLenToCipherInBytes = rte_buff->pkt.data_len
+		opData->messageLenToCipherInBytes = rte_buff->data_len
 				- CRYPTO_START_OFFSET;
 
 		/*
@@ -892,7 +892,7 @@ crypto_decrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 	}
 	if (NO_HASH != h) {
 		opData->hashStartSrcOffsetInBytes = HASH_START_OFFSET;
-		opData->messageLenToHashInBytes = rte_buff->pkt.data_len
+		opData->messageLenToHashInBytes = rte_buff->data_len
 				- HASH_START_OFFSET;
 		/*
 		 * Work around for padding, message length has to be a multiple of block
@@ -900,7 +900,7 @@ crypto_decrypt(struct rte_mbuf *rte_buff, enum cipher_alg c, enum hash_alg h)
 		 */
 		opData->messageLenToHashInBytes -= opData->messageLenToHashInBytes
 				% HASH_BLOCK_DEFAULT_SIZE;
-		opData->digestResult = rte_buff->buf_physaddr + rte_buff->pkt.data_len;
+		opData->digestResult = rte_buff->buf_physaddr + rte_buff->data_len;
 	}
 
 	if (CPA_STATUS_SUCCESS != enqueueOp(opData, lcore_id)) {
diff --git a/examples/dpdk_qat/main.c b/examples/dpdk_qat/main.c
index cdf6832..1401d6b 100644
--- a/examples/dpdk_qat/main.c
+++ b/examples/dpdk_qat/main.c
@@ -384,7 +384,7 @@ main_loop(__attribute__((unused)) void *dummy)
 			}
 		}
 
-		port = dst_ports[pkt->pkt.in_port];
+		port = dst_ports[pkt->in_port];
 
 		/* Transmit the packet */
 		nic_tx_send_packet(pkt, (uint8_t)port);
diff --git a/examples/exception_path/main.c b/examples/exception_path/main.c
index 0bc149d..d9a85b5 100644
--- a/examples/exception_path/main.c
+++ b/examples/exception_path/main.c
@@ -302,16 +302,16 @@ main_loop(__attribute__((unused)) void *arg)
 			if (m == NULL)
 				continue;
 
-			ret = read(tap_fd, m->pkt.data, MAX_PACKET_SZ);
+			ret = read(tap_fd, m->data, MAX_PACKET_SZ);
 			lcore_stats[lcore_id].rx++;
 			if (unlikely(ret < 0)) {
 				FATAL_ERROR("Reading from %s interface failed",
 				            tap_name);
 			}
-			m->pkt.nb_segs = 1;
-			m->pkt.next = NULL;
-			m->pkt.pkt_len = (uint16_t)ret;
-			m->pkt.data_len = (uint16_t)ret;
+			m->nb_segs = 1;
+			m->next = NULL;
+			m->pkt_len = (uint16_t)ret;
+			m->data_len = (uint16_t)ret;
 			ret = rte_eth_tx_burst(port_ids[lcore_id], 0, &m, 1);
 			if (unlikely(ret < 1)) {
 				rte_pktmbuf_free(m);
diff --git a/examples/ip_reassembly/ipv4_rsmbl.h b/examples/ip_reassembly/ipv4_rsmbl.h
index 58ec1ee..9b647fb 100644
--- a/examples/ip_reassembly/ipv4_rsmbl.h
+++ b/examples/ip_reassembly/ipv4_rsmbl.h
@@ -168,20 +168,20 @@ ipv4_frag_chain(struct rte_mbuf *mn, struct rte_mbuf *mp)
 	struct rte_mbuf *ms;
 
 	/* adjust start of the last fragment data. */
-	rte_pktmbuf_adj(mp, (uint16_t)(mp->pkt.vlan_macip.f.l2_len +
-		mp->pkt.vlan_macip.f.l3_len));
+	rte_pktmbuf_adj(mp, (uint16_t)(mp->vlan_macip.f.l2_len +
+		mp->vlan_macip.f.l3_len));
 				
 	/* chain two fragments. */
 	ms = rte_pktmbuf_lastseg(mn);
-	ms->pkt.next = mp;
+	ms->next = mp;
 
 	/* accumulate number of segments and total length. */
-	mn->pkt.nb_segs = (uint8_t)(mn->pkt.nb_segs + mp->pkt.nb_segs);
-	mn->pkt.pkt_len += mp->pkt.pkt_len;
+	mn->nb_segs = (uint8_t)(mn->nb_segs + mp->nb_segs);
+	mn->pkt_len += mp->pkt_len;
 					
 	/* reset pkt_len and nb_segs for chained fragment. */
-	mp->pkt.pkt_len = mp->pkt.data_len;
-	mp->pkt.nb_segs = 1;
+	mp->pkt_len = mp->data_len;
+	mp->nb_segs = 1;
 }
 
 /*
@@ -233,10 +233,10 @@ ipv4_frag_reassemble(const struct ipv4_frag_pkt *fp)
 
 	/* update ipv4 header for the reassmebled packet */
 	ip_hdr = (struct ipv4_hdr*)(rte_pktmbuf_mtod(m, uint8_t *) +
-		m->pkt.vlan_macip.f.l2_len);
+		m->vlan_macip.f.l2_len);
 
 	ip_hdr->total_length = rte_cpu_to_be_16((uint16_t)(fp->total_size +
-		m->pkt.vlan_macip.f.l3_len));
+		m->vlan_macip.f.l3_len));
 	ip_hdr->fragment_offset = (uint16_t)(ip_hdr->fragment_offset &
 		rte_cpu_to_be_16(IPV4_HDR_DF_FLAG));
 	ip_hdr->hdr_checksum = 0;
@@ -377,7 +377,7 @@ ipv4_frag_mbuf(struct ipv4_frag_tbl *tbl, struct ipv4_frag_death_row *dr,
 
 	ip_ofs *= IPV4_HDR_OFFSET_UNITS;
 	ip_len = (uint16_t)(rte_be_to_cpu_16(ip_hdr->total_length) -
-		mb->pkt.vlan_macip.f.l3_len);
+		mb->vlan_macip.f.l3_len);
 
 	IPV4_FRAG_LOG(DEBUG, "%s:%d:\n"
 		"mbuf: %p, tms: %" PRIu64
diff --git a/examples/ip_reassembly/main.c b/examples/ip_reassembly/main.c
index 4880a5f..5c5626a 100644
--- a/examples/ip_reassembly/main.c
+++ b/examples/ip_reassembly/main.c
@@ -655,7 +655,7 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid, uint32_t queue,
 
 #ifdef DO_RFC_1812_CHECKS
 		/* Check to make sure the packet is valid (RFC1812) */
-		if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt.pkt_len) < 0) {
+		if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) {
 			rte_pktmbuf_free(m);
 			return;
 		}
@@ -680,8 +680,8 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid, uint32_t queue,
 			dr = &qconf->death_row;
 
 			/* prepare mbuf: setup l2_len/l3_len. */
-			m->pkt.vlan_macip.f.l2_len = sizeof(*eth_hdr);
-			m->pkt.vlan_macip.f.l3_len = sizeof(*ipv4_hdr);
+			m->vlan_macip.f.l2_len = sizeof(*eth_hdr);
+			m->vlan_macip.f.l3_len = sizeof(*ipv4_hdr);
 
 			/* process this fragment. */
 			if ((mo = ipv4_frag_mbuf(tbl, dr, m, tms, ipv4_hdr,
diff --git a/examples/ipv4_frag/main.c b/examples/ipv4_frag/main.c
index 93664c8..b950b87 100644
--- a/examples/ipv4_frag/main.c
+++ b/examples/ipv4_frag/main.c
@@ -257,7 +257,7 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t port_in)
 	len = qconf->tx_mbufs[port_out].len;
 
 	/* if we don't need to do any fragmentation */
-	if (likely (IPV4_MTU_DEFAULT  >= m->pkt.pkt_len)) {
+	if (likely (IPV4_MTU_DEFAULT  >= m->pkt_len)) {
 		qconf->tx_mbufs[port_out].m_table[len] = m;
 		len2 = 1;
 	} else {
@@ -283,7 +283,7 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t port_in)
 			rte_panic("No headroom in mbuf.\n");
 		}
 
-		m->pkt.vlan_macip.f.l2_len = sizeof(struct ether_hdr);
+		m->vlan_macip.f.l2_len = sizeof(struct ether_hdr);
 
 		ether_addr_copy(&remote_eth_addr, &eth_hdr->d_addr);
 		ether_addr_copy(&ports_eth_addr[port_out], &eth_hdr->s_addr);
diff --git a/examples/ipv4_frag/rte_ipv4_frag.h b/examples/ipv4_frag/rte_ipv4_frag.h
index 84fa9c9..6234224 100644
--- a/examples/ipv4_frag/rte_ipv4_frag.h
+++ b/examples/ipv4_frag/rte_ipv4_frag.h
@@ -145,9 +145,9 @@ static inline int32_t rte_ipv4_fragmentation(struct rte_mbuf *pkt_in,
 
 	/* Fragment size should be a multiply of 8. */
 	RTE_IPV4_FRAG_ASSERT(IPV4_MAX_FRAGS_PER_PACKET * frag_size >=
-	    (uint16_t)(pkt_in->pkt.pkt_len - sizeof (struct ipv4_hdr)));
+	    (uint16_t)(pkt_in->pkt_len - sizeof (struct ipv4_hdr)));
 
-	in_hdr = (struct ipv4_hdr*) pkt_in->pkt.data;
+	in_hdr = (struct ipv4_hdr*) pkt_in->data;
 	flag_offset = rte_cpu_to_be_16(in_hdr->fragment_offset);
 
 	/* If Don't Fragment flag is set */
@@ -156,7 +156,7 @@ static inline int32_t rte_ipv4_fragmentation(struct rte_mbuf *pkt_in,
 
 	/* Check that pkts_out is big enough to hold all fragments */
 	if (unlikely (frag_size * nb_pkts_out <
-	    (uint16_t)(pkt_in->pkt.pkt_len - sizeof (struct ipv4_hdr))))
+	    (uint16_t)(pkt_in->pkt_len - sizeof (struct ipv4_hdr))))
 		return (-EINVAL);
 
 	in_seg = pkt_in;
@@ -178,8 +178,8 @@ static inline int32_t rte_ipv4_fragmentation(struct rte_mbuf *pkt_in,
 		}
 
 		/* Reserve space for the IP header that will be built later */
-		out_pkt->pkt.data_len = sizeof(struct ipv4_hdr);
-		out_pkt->pkt.pkt_len = sizeof(struct ipv4_hdr);
+		out_pkt->data_len = sizeof(struct ipv4_hdr);
+		out_pkt->pkt_len = sizeof(struct ipv4_hdr);
 
 		out_seg_prev = out_pkt;
 		more_out_segs = 1;
@@ -194,30 +194,30 @@ static inline int32_t rte_ipv4_fragmentation(struct rte_mbuf *pkt_in,
 				__free_fragments(pkts_out, out_pkt_pos);
 				return (-ENOMEM);
 			}
-			out_seg_prev->pkt.next = out_seg;
+			out_seg_prev->next = out_seg;
 			out_seg_prev = out_seg;
 
 			/* Prepare indirect buffer */
 			rte_pktmbuf_attach(out_seg, in_seg);
-			len = mtu_size - out_pkt->pkt.pkt_len;
-			if (len > (in_seg->pkt.data_len - in_seg_data_pos)) {
-				len = in_seg->pkt.data_len - in_seg_data_pos;
+			len = mtu_size - out_pkt->pkt_len;
+			if (len > (in_seg->data_len - in_seg_data_pos)) {
+				len = in_seg->data_len - in_seg_data_pos;
 			}
-			out_seg->pkt.data = (char*) in_seg->pkt.data + (uint16_t)in_seg_data_pos;
-			out_seg->pkt.data_len = (uint16_t)len;
-			out_pkt->pkt.pkt_len = (uint16_t)(len +
-			    out_pkt->pkt.pkt_len);
-			out_pkt->pkt.nb_segs += 1;
+			out_seg->data = (char*) in_seg->data + (uint16_t)in_seg_data_pos;
+			out_seg->data_len = (uint16_t)len;
+			out_pkt->pkt_len = (uint16_t)(len +
+			    out_pkt->pkt_len);
+			out_pkt->nb_segs += 1;
 			in_seg_data_pos += len;
 
 			/* Current output packet (i.e. fragment) done ? */
-			if (unlikely(out_pkt->pkt.pkt_len >= mtu_size)) {
+			if (unlikely(out_pkt->pkt_len >= mtu_size)) {
 				more_out_segs = 0;
 			}
 
 			/* Current input segment done ? */
-			if (unlikely(in_seg_data_pos == in_seg->pkt.data_len)) {
-				in_seg = in_seg->pkt.next;
+			if (unlikely(in_seg_data_pos == in_seg->data_len)) {
+				in_seg = in_seg->next;
 				in_seg_data_pos = 0;
 
 				if (unlikely(in_seg == NULL)) {
@@ -228,17 +228,17 @@ static inline int32_t rte_ipv4_fragmentation(struct rte_mbuf *pkt_in,
 
 		/* Build the IP header */
 
-		out_hdr = (struct ipv4_hdr*) out_pkt->pkt.data;
+		out_hdr = (struct ipv4_hdr*) out_pkt->data;
 
 		__fill_ipv4hdr_frag(out_hdr, in_hdr,
-		    (uint16_t)out_pkt->pkt.pkt_len,
+		    (uint16_t)out_pkt->pkt_len,
 		    flag_offset, fragment_offset, more_in_segs);
 
 		fragment_offset = (uint16_t)(fragment_offset +
-		    out_pkt->pkt.pkt_len - sizeof(struct ipv4_hdr));
+		    out_pkt->pkt_len - sizeof(struct ipv4_hdr));
 
 		out_pkt->ol_flags |= PKT_TX_IP_CKSUM;
-		out_pkt->pkt.vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);
+		out_pkt->vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);
 
 		/* Write the fragment to the output list */
 		pkts_out[out_pkt_pos] = out_pkt;
diff --git a/examples/ipv4_multicast/main.c b/examples/ipv4_multicast/main.c
index 3967d7a..9c57ce0 100644
--- a/examples/ipv4_multicast/main.c
+++ b/examples/ipv4_multicast/main.c
@@ -329,17 +329,17 @@ mcast_out_pkt(struct rte_mbuf *pkt, int use_clone)
 	}
 
 	/* prepend new header */
-	hdr->pkt.next = pkt;
+	hdr->next = pkt;
 
 
 	/* update header's fields */
-	hdr->pkt.pkt_len = (uint16_t)(hdr->pkt.data_len + pkt->pkt.pkt_len);
-	hdr->pkt.nb_segs = (uint8_t)(pkt->pkt.nb_segs + 1);
+	hdr->pkt_len = (uint16_t)(hdr->data_len + pkt->pkt_len);
+	hdr->nb_segs = (uint8_t)(pkt->nb_segs + 1);
 
 	/* copy metadata from source packet*/
-	hdr->pkt.in_port = pkt->pkt.in_port;
-	hdr->pkt.vlan_macip = pkt->pkt.vlan_macip;
-	hdr->pkt.hash = pkt->pkt.hash;
+	hdr->in_port = pkt->in_port;
+	hdr->vlan_macip = pkt->vlan_macip;
+	hdr->hash = pkt->hash;
 
 	hdr->ol_flags = pkt->ol_flags;
 
@@ -412,7 +412,7 @@ mcast_forward(struct rte_mbuf *m, struct lcore_queue_conf *qconf)
 
 	/* Should we use rte_pktmbuf_clone() or not. */
 	use_clone = (port_num <= MCAST_CLONE_PORTS &&
-	    m->pkt.nb_segs <= MCAST_CLONE_SEGS);
+	    m->nb_segs <= MCAST_CLONE_SEGS);
 
 	/* Mark all packet's segments as referenced port_num times */
 	if (use_clone == 0)
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 219f802..a991809 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -687,7 +687,7 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid,
 
 #ifdef DO_RFC_1812_CHECKS
 		/* Check to make sure the packet is valid (RFC1812) */
-		if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt.pkt_len) < 0) {
+		if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) {
 			rte_pktmbuf_free(m);
 			return;
 		}
diff --git a/examples/l3fwd-vf/main.c b/examples/l3fwd-vf/main.c
index fb811fa..7420d89 100644
--- a/examples/l3fwd-vf/main.c
+++ b/examples/l3fwd-vf/main.c
@@ -489,7 +489,7 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid, lookup_struct_t * l3fwd
 
 #ifdef DO_RFC_1812_CHECKS
 	/* Check to make sure the packet is valid (RFC1812) */
-	if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt.pkt_len) < 0) {
+	if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) {
 		rte_pktmbuf_free(m);
 		return;
 	}
diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index 1ba4ca2..3eff4ea 100755
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -705,19 +705,19 @@ simple_ipv4_fwd_4pkts(struct rte_mbuf* m[4], uint8_t portid, struct lcore_conf *
 #ifdef DO_RFC_1812_CHECKS
 	/* Check to make sure the packet is valid (RFC1812) */
 	uint8_t valid_mask = MASK_ALL_PKTS;
-	if (is_valid_ipv4_pkt(ipv4_hdr[0], m[0]->pkt.pkt_len) < 0) {
+	if (is_valid_ipv4_pkt(ipv4_hdr[0], m[0]->pkt_len) < 0) {
 		rte_pktmbuf_free(m[0]);
 		valid_mask &= EXECLUDE_1ST_PKT;
 	}
-	if (is_valid_ipv4_pkt(ipv4_hdr[1], m[1]->pkt.pkt_len) < 0) {
+	if (is_valid_ipv4_pkt(ipv4_hdr[1], m[1]->pkt_len) < 0) {
 		rte_pktmbuf_free(m[1]);
 		valid_mask &= EXECLUDE_2ND_PKT;
 	}
-	if (is_valid_ipv4_pkt(ipv4_hdr[2], m[2]->pkt.pkt_len) < 0) {
+	if (is_valid_ipv4_pkt(ipv4_hdr[2], m[2]->pkt_len) < 0) {
 		rte_pktmbuf_free(m[2]);
 		valid_mask &= EXECLUDE_3RD_PKT;
 	}
-	if (is_valid_ipv4_pkt(ipv4_hdr[3], m[3]->pkt.pkt_len) < 0) {
+	if (is_valid_ipv4_pkt(ipv4_hdr[3], m[3]->pkt_len) < 0) {
 		rte_pktmbuf_free(m[3]);
 		valid_mask &= EXECLUDE_4TH_PKT;
 	}
@@ -905,7 +905,7 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid, struct lcore_conf *qcon
 
 #ifdef DO_RFC_1812_CHECKS
 		/* Check to make sure the packet is valid (RFC1812) */
-		if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt.pkt_len) < 0) {
+		if (is_valid_ipv4_pkt(ipv4_hdr, m->pkt_len) < 0) {
 			rte_pktmbuf_free(m);
 			return;
 		}
diff --git a/examples/load_balancer/runtime.c b/examples/load_balancer/runtime.c
index e85abdb..bfa7c58 100644
--- a/examples/load_balancer/runtime.c
+++ b/examples/load_balancer/runtime.c
@@ -540,7 +540,7 @@ app_lcore_worker(
 			ipv4_dst = rte_be_to_cpu_32(ipv4_hdr->dst_addr);
 
 			if (unlikely(rte_lpm_lookup(lp->lpm_table, ipv4_dst, &port) != 0)) {
-				port = pkt->pkt.in_port;
+				port = pkt->in_port;
 			}
 
 			pos = lp->mbuf_out[port].n_mbufs;
diff --git a/examples/multi_process/client_server_mp/mp_client/client.c b/examples/multi_process/client_server_mp/mp_client/client.c
index 7543db4..187c80f 100644
--- a/examples/multi_process/client_server_mp/mp_client/client.c
+++ b/examples/multi_process/client_server_mp/mp_client/client.c
@@ -211,7 +211,7 @@ enqueue_packet(struct rte_mbuf *buf, uint8_t port)
 static void
 handle_packet(struct rte_mbuf *buf)
 {
-	const uint8_t in_port = buf->pkt.in_port;
+	const uint8_t in_port = buf->in_port;
 	const uint8_t out_port = output_ports[in_port];
 
 	enqueue_packet(buf, out_port);
diff --git a/examples/quota_watermark/qw/main.c b/examples/quota_watermark/qw/main.c
index 21e0fc7..421b43d 100644
--- a/examples/quota_watermark/qw/main.c
+++ b/examples/quota_watermark/qw/main.c
@@ -104,8 +104,8 @@ static void send_pause_frame(uint8_t port_id, uint16_t duration)
     pause_frame->opcode = rte_cpu_to_be_16(0x0001);
     pause_frame->param  = rte_cpu_to_be_16(duration);
 
-    mbuf->pkt.pkt_len  = 60;
-    mbuf->pkt.data_len = 60;
+    mbuf->pkt_len  = 60;
+    mbuf->data_len = 60;
 
     rte_eth_tx_burst(port_id, 0, &mbuf, 1);
 }
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 816a71a..26cfc8e 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -815,7 +815,7 @@ virtio_dev_rx(struct virtio_net *dev, struct rte_mbuf **pkts, uint32_t count)
 		vq->used->ring[res_cur_idx & (vq->size - 1)].len = packet_len;
 
 		/* Copy mbuf data to buffer */
-		rte_memcpy((void *)(uintptr_t)buff_addr, (const void*)buff->pkt.data, rte_pktmbuf_data_len(buff));
+		rte_memcpy((void *)(uintptr_t)buff_addr, (const void*)buff->data, rte_pktmbuf_data_len(buff));
 
 		res_cur_idx++;
 		packet_success++;
@@ -877,7 +877,7 @@ link_vmdq(struct virtio_net *dev, struct rte_mbuf *m)
 	int i, ret;
 
 	/* Learn MAC address of guest device from packet */
-	pkt_hdr = (struct ether_hdr *)m->pkt.data;
+	pkt_hdr = (struct ether_hdr *)m->data;
 
 	dev_ll = ll_root_used;
 
@@ -965,7 +965,7 @@ virtio_tx_local(struct virtio_net *dev, struct rte_mbuf *m)
 	struct ether_hdr *pkt_hdr;
 	uint64_t ret = 0;
 
-	pkt_hdr = (struct ether_hdr *)m->pkt.data;
+	pkt_hdr = (struct ether_hdr *)m->data;
 
 	/*get the used devices list*/
 	dev_ll = ll_root_used;
@@ -1038,22 +1038,22 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 		return;
 	}
 
-	mbuf->pkt.data_len = m->pkt.data_len + VLAN_HLEN;
-	mbuf->pkt.pkt_len = mbuf->pkt.data_len;
+	mbuf->data_len = m->data_len + VLAN_HLEN;
+	mbuf->pkt_len = mbuf->data_len;
 
 	/* Copy ethernet header to mbuf. */
-	rte_memcpy((void*)mbuf->pkt.data, (const void*)m->pkt.data, ETH_HLEN);
+	rte_memcpy((void*)mbuf->data, (const void*)m->data, ETH_HLEN);
 
 
 	/* Setup vlan header. Bytes need to be re-ordered for network with htons()*/
-	vlan_hdr = (struct vlan_ethhdr *) mbuf->pkt.data;
+	vlan_hdr = (struct vlan_ethhdr *) mbuf->data;
 	vlan_hdr->h_vlan_encapsulated_proto = vlan_hdr->h_vlan_proto;
 	vlan_hdr->h_vlan_proto = htons(ETH_P_8021Q);
 	vlan_hdr->h_vlan_TCI = htons(vlan_tag);
 
 	/* Copy the remaining packet contents to the mbuf. */
-	rte_memcpy((void*) ((uint8_t*)mbuf->pkt.data + VLAN_ETH_HLEN),
-		(const void*) ((uint8_t*)m->pkt.data + ETH_HLEN), (m->pkt.data_len - ETH_HLEN));
+	rte_memcpy((void*) ((uint8_t*)mbuf->data + VLAN_ETH_HLEN),
+		(const void*) ((uint8_t*)m->data + ETH_HLEN), (m->data_len - ETH_HLEN));
 	tx_q->m_table[len] = mbuf;
 	len++;
 	if (enable_stats) {
@@ -1143,8 +1143,8 @@ virtio_dev_tx(struct virtio_net* dev, struct rte_mempool *mbuf_pool)
 		vq->used->ring[used_idx].len = 0;
 
 		/* Setup dummy mbuf. This is copied to a real mbuf if transmitted out the physical port. */
-		m.pkt.data_len = desc->len;
-		m.pkt.data = (void*)(uintptr_t)buff_addr;
+		m.data_len = desc->len;
+		m.data = (void*)(uintptr_t)buff_addr;
 
 		PRINT_PACKET(dev, (uintptr_t)buff_addr, desc->len, 0);
 
diff --git a/examples/vhost_xen/main.c b/examples/vhost_xen/main.c
index eafc0aa..2cf0029 100644
--- a/examples/vhost_xen/main.c
+++ b/examples/vhost_xen/main.c
@@ -677,7 +677,7 @@ virtio_dev_rx(struct virtio_net *dev, struct rte_mbuf **pkts, uint32_t count)
 		vq->used->ring[res_cur_idx & (vq->size - 1)].len = packet_len;
 
 		/* Copy mbuf data to buffer */
-		rte_memcpy((void *)(uintptr_t)buff_addr, (const void*)buff->pkt.data, rte_pktmbuf_data_len(buff));
+		rte_memcpy((void *)(uintptr_t)buff_addr, (const void*)buff->data, rte_pktmbuf_data_len(buff));
 
 		res_cur_idx++;
 		packet_success++;
@@ -808,7 +808,7 @@ virtio_tx_local(struct virtio_net *dev, struct rte_mbuf *m)
 	struct ether_hdr *pkt_hdr;
 	uint64_t ret = 0;
 
-	pkt_hdr = (struct ether_hdr *)m->pkt.data;
+	pkt_hdr = (struct ether_hdr *)m->data;
 
 	/*get the used devices list*/
 	dev_ll = ll_root_used;
@@ -879,22 +879,22 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 	if(!mbuf)
 		return;
 
-	mbuf->pkt.data_len = m->pkt.data_len + VLAN_HLEN;
-	mbuf->pkt.pkt_len = mbuf->pkt.data_len;
+	mbuf->data_len = m->data_len + VLAN_HLEN;
+	mbuf->pkt_len = mbuf->data_len;
 
 	/* Copy ethernet header to mbuf. */
-	rte_memcpy((void*)mbuf->pkt.data, (const void*)m->pkt.data, ETH_HLEN);
+	rte_memcpy((void*)mbuf->data, (const void*)m->data, ETH_HLEN);
 
 
 	/* Setup vlan header. Bytes need to be re-ordered for network with htons()*/
-	vlan_hdr = (struct vlan_ethhdr *) mbuf->pkt.data;
+	vlan_hdr = (struct vlan_ethhdr *) mbuf->data;
 	vlan_hdr->h_vlan_encapsulated_proto = vlan_hdr->h_vlan_proto;
 	vlan_hdr->h_vlan_proto = htons(ETH_P_8021Q);
 	vlan_hdr->h_vlan_TCI = htons(vlan_tag);
 
 	/* Copy the remaining packet contents to the mbuf. */
-	rte_memcpy((void*) ((uint8_t*)mbuf->pkt.data + VLAN_ETH_HLEN),
-		(const void*) ((uint8_t*)m->pkt.data + ETH_HLEN), (m->pkt.data_len - ETH_HLEN));
+	rte_memcpy((void*) ((uint8_t*)mbuf->data + VLAN_ETH_HLEN),
+		(const void*) ((uint8_t*)m->data + ETH_HLEN), (m->data_len - ETH_HLEN));
 	tx_q->m_table[len] = mbuf;
 	len++;
 	if (enable_stats) {
@@ -980,9 +980,9 @@ virtio_dev_tx(struct virtio_net* dev, struct rte_mempool *mbuf_pool)
 		rte_prefetch0((void*)(uintptr_t)buff_addr);
 
 		/* Setup dummy mbuf. This is copied to a real mbuf if transmitted out the physical port. */
-		m.pkt.data_len = desc->len;
-		m.pkt.data = (void*)(uintptr_t)buff_addr;
-		m.pkt.nb_segs = 1; 
+		m.data_len = desc->len;
+		m.data = (void*)(uintptr_t)buff_addr;
+		m.nb_segs = 1; 
 
 		virtio_tx_route(dev, &m, mbuf_pool, 0);
 
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index b2e2f0f..c229525 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -104,12 +104,12 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->buf_len = (uint16_t)buf_len;
 
 	/* keep some headroom between start of buffer and data */
-	m->pkt.data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
+	m->data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
 
 	/* init some constant fields */
 	m->pool = mp;
-	m->pkt.nb_segs = 1;
-	m->pkt.in_port = 0xff;
+	m->nb_segs = 1;
+	m->in_port = 0xff;
 }
 
 /* do some sanity checks on a mbuf: panic if it fails */
@@ -140,10 +140,10 @@ rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header)
 	if (is_header == 0)
 		return;
 
-	nb_segs = m->pkt.nb_segs;
+	nb_segs = m->nb_segs;
 	m_seg = m;
 	while (m_seg && nb_segs != 0) {
-		m_seg = m_seg->pkt.next;
+		m_seg = m_seg->next;
 		nb_segs --;
 	}
 	if (nb_segs != 0)
@@ -162,22 +162,22 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 	printf("dump mbuf at 0x%p, phys=%"PRIx64", buf_len=%u\n",
 	       m, (uint64_t)m->buf_physaddr, (unsigned)m->buf_len);
 	printf("  pkt_len=%"PRIu32", ol_flags=%"PRIx16", nb_segs=%u, "
-	       "in_port=%u\n", m->pkt.pkt_len, m->ol_flags,
-	       (unsigned)m->pkt.nb_segs, (unsigned)m->pkt.in_port);
-	nb_segs = m->pkt.nb_segs;
+	       "in_port=%u\n", m->pkt_len, m->ol_flags,
+	       (unsigned)m->nb_segs, (unsigned)m->in_port);
+	nb_segs = m->nb_segs;
 
 	while (m && nb_segs != 0) {
 		__rte_mbuf_sanity_check(m, 0);
 
 		printf("  segment at 0x%p, data=0x%p, data_len=%u\n",
-		       m, m->pkt.data, (unsigned)m->pkt.data_len);
+		       m, m->data, (unsigned)m->data_len);
 		len = dump_len;
-		if (len > m->pkt.data_len)
-			len = m->pkt.data_len;
+		if (len > m->data_len)
+			len = m->data_len;
 		if (len != 0)
-			rte_hexdump(NULL, m->pkt.data, len);
+			rte_hexdump(NULL, m->data, len);
 		dump_len -= len;
-		m = m->pkt.next;
+		m = m->next;
 		nb_segs --;
 	}
 }
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 22e1ac1..803b223 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -125,32 +125,6 @@ union rte_vlan_macip {
 #define TX_MACIP_LEN_CMP_MASK   (TX_MAC_LEN_CMP_MASK | TX_IP_LEN_CMP_MASK)
 
 /**
- * A packet message buffer.
- */
-struct rte_pktmbuf {
-	/* valid for any segment */
-	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
-	void* data;             /**< Start address of data in segment buffer. */
-	uint16_t data_len;      /**< Amount of data in segment buffer. */
-
-	/* these fields are valid for first segment only */
-	uint8_t nb_segs;        /**< Number of segments. */
-	uint8_t in_port;        /**< Input port. */
-	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
-
-	/* offload features */
-	union rte_vlan_macip vlan_macip;
-	union {
-		uint32_t rss;       /**< RSS hash result if RSS enabled */
-		struct {
-			uint16_t hash;
-			uint16_t id;
-		} fdir;             /**< Filter identifier if FDIR enabled */
-		uint32_t sched;     /**< Hierarchical scheduler */
-	} hash;                 /**< hash information */
-};
-
-/**
  * The generic rte_mbuf, containing a packet mbuf.
  */
 struct rte_mbuf {
@@ -177,7 +151,26 @@ struct rte_mbuf {
 	uint16_t reserved;             /**< Unused field. Required for padding. */
 	uint16_t ol_flags;            /**< Offload features. */
 
-	struct rte_pktmbuf pkt;
+	/* valid for any segment */
+	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
+	void* data;             /**< Start address of data in segment buffer. */
+	uint16_t data_len;      /**< Amount of data in segment buffer. */
+
+	/* these fields are valid for first segment only */
+	uint8_t nb_segs;        /**< Number of segments. */
+	uint8_t in_port;        /**< Input port. */
+	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
+
+	/* offload features, valid for first segment only */
+	union rte_vlan_macip vlan_macip;
+	union {
+		uint32_t rss;       /**< RSS hash result if RSS enabled */
+		struct {
+			uint16_t hash;
+			uint16_t id;
+		} fdir;             /**< Filter identifier if FDIR enabled */
+		uint32_t sched;     /**< Hierarchical scheduler */
+	} hash;                 /**< hash information */
 } __rte_cache_aligned;
 
 /**
@@ -444,18 +437,18 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 {
 	uint32_t buf_ofs;
 
-	m->pkt.next = NULL;
-	m->pkt.pkt_len = 0;
-	m->pkt.vlan_macip.data = 0;
-	m->pkt.nb_segs = 1;
-	m->pkt.in_port = 0xff;
+	m->next = NULL;
+	m->pkt_len = 0;
+	m->vlan_macip.data = 0;
+	m->nb_segs = 1;
+	m->in_port = 0xff;
 
 	m->ol_flags = 0;
 	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->pkt.data = (char*) m->buf_addr + buf_ofs;
+	m->data = (char*) m->buf_addr + buf_ofs;
 
-	m->pkt.data_len = 0;
+	m->data_len = 0;
 	__rte_mbuf_sanity_check(m, 1);
 }
 
@@ -509,11 +502,16 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->buf_addr = md->buf_addr;
 	mi->buf_len = md->buf_len;
 
-	mi->pkt = md->pkt;
+	mi->next = md->next;
+	mi->data = md->data;
+	mi->data_len = md->data_len;
+	mi->in_port = md->in_port;
+	mi->vlan_macip = md->vlan_macip;
+	mi->hash = md->hash;
 
-	mi->pkt.next = NULL;
-	mi->pkt.pkt_len = mi->pkt.data_len;
-	mi->pkt.nb_segs = 1;
+	mi->next = NULL;
+	mi->pkt_len = mi->data_len;
+	mi->nb_segs = 1;
 	mi->ol_flags = md->ol_flags;
 
 	__rte_mbuf_sanity_check(mi, 1);
@@ -543,9 +541,9 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 
 	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->pkt.data = (char*) m->buf_addr + buf_ofs;
+	m->data = (char*) m->buf_addr + buf_ofs;
 
-	m->pkt.data_len = 0;
+	m->data_len = 0;
 }
 
 #endif /* RTE_MBUF_REFCNT */
@@ -612,7 +610,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)
 	__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
-		m_next = m->pkt.next;
+		m_next = m->next;
 		rte_pktmbuf_free_seg(m);
 		m = m_next;
 	}
@@ -648,21 +646,21 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
 		return (NULL);
 
 	mi = mc;
-	prev = &mi->pkt.next;
-	pktlen = md->pkt.pkt_len;
+	prev = &mi->next;
+	pktlen = md->pkt_len;
 	nseg = 0;
 
 	do {
 		nseg++;
 		rte_pktmbuf_attach(mi, md);
 		*prev = mi;
-		prev = &mi->pkt.next;
-	} while ((md = md->pkt.next) != NULL &&
+		prev = &mi->next;
+	} while ((md = md->next) != NULL &&
 	    (mi = rte_pktmbuf_alloc(mp)) != NULL);
 
 	*prev = NULL;
-	mc->pkt.nb_segs = nseg;
-	mc->pkt.pkt_len = pktlen;
+	mc->nb_segs = nseg;
+	mc->pkt_len = pktlen;
 
 	/* Allocation of new indirect segment failed */
 	if (unlikely (mi == NULL)) {
@@ -691,7 +689,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
-	} while ((m = m->pkt.next) != NULL);
+	} while ((m = m->next) != NULL);
 }
 
 #endif /* RTE_MBUF_REFCNT */
@@ -707,7 +705,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	return (uint16_t) ((char*) m->pkt.data - (char*) m->buf_addr);
+	return (uint16_t) ((char*) m->data - (char*) m->buf_addr);
 }
 
 /**
@@ -722,7 +720,7 @@ static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
 	return (uint16_t)(m->buf_len - rte_pktmbuf_headroom(m) -
-			  m->pkt.data_len);
+			  m->data_len);
 }
 
 /**
@@ -738,8 +736,8 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
 	struct rte_mbuf *m2 = (struct rte_mbuf *)m;
 
 	__rte_mbuf_sanity_check(m, 1);
-	while (m2->pkt.next != NULL)
-		m2 = m2->pkt.next;
+	while (m2->next != NULL)
+		m2 = m2->next;
 	return m2;
 }
 
@@ -755,7 +753,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
  * @param t
  *   The type to cast the result into.
  */
-#define rte_pktmbuf_mtod(m, t) ((t)((m)->pkt.data))
+#define rte_pktmbuf_mtod(m, t) ((t)((m)->data))
 
 /**
  * A macro that returns the length of the packet.
@@ -765,7 +763,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
  * @param m
  *   The packet mbuf.
  */
-#define rte_pktmbuf_pkt_len(m) ((m)->pkt.pkt_len)
+#define rte_pktmbuf_pkt_len(m) ((m)->pkt_len)
 
 /**
  * A macro that returns the length of the segment.
@@ -775,7 +773,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
  * @param m
  *   The packet mbuf.
  */
-#define rte_pktmbuf_data_len(m) ((m)->pkt.data_len)
+#define rte_pktmbuf_data_len(m) ((m)->data_len)
 
 /**
  * Prepend len bytes to an mbuf data area.
@@ -800,11 +798,11 @@ static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
 	if (unlikely(len > rte_pktmbuf_headroom(m)))
 		return NULL;
 
-	m->pkt.data = (char*) m->pkt.data - len;
-	m->pkt.data_len = (uint16_t)(m->pkt.data_len + len);
-	m->pkt.pkt_len  = (m->pkt.pkt_len + len);
+	m->data = (char*) m->data - len;
+	m->data_len = (uint16_t)(m->data_len + len);
+	m->pkt_len  = (m->pkt_len + len);
 
-	return (char*) m->pkt.data;
+	return (char*) m->data;
 }
 
 /**
@@ -833,9 +831,9 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
 	if (unlikely(len > rte_pktmbuf_tailroom(m_last)))
 		return NULL;
 
-	tail = (char*) m_last->pkt.data + m_last->pkt.data_len;
-	m_last->pkt.data_len = (uint16_t)(m_last->pkt.data_len + len);
-	m->pkt.pkt_len  = (m->pkt.pkt_len + len);
+	tail = (char*) m_last->data + m_last->data_len;
+	m_last->data_len = (uint16_t)(m_last->data_len + len);
+	m->pkt_len  = (m->pkt_len + len);
 	return (char*) tail;
 }
 
@@ -857,13 +855,13 @@ static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)
 {
 	__rte_mbuf_sanity_check(m, 1);
 
-	if (unlikely(len > m->pkt.data_len))
+	if (unlikely(len > m->data_len))
 		return NULL;
 
-	m->pkt.data_len = (uint16_t)(m->pkt.data_len - len);
-	m->pkt.data = ((char*) m->pkt.data + len);
-	m->pkt.pkt_len  = (m->pkt.pkt_len - len);
-	return (char*) m->pkt.data;
+	m->data_len = (uint16_t)(m->data_len - len);
+	m->data = ((char*) m->data + len);
+	m->pkt_len  = (m->pkt_len - len);
+	return (char*) m->data;
 }
 
 /**
@@ -887,11 +885,11 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
 	__rte_mbuf_sanity_check(m, 1);
 
 	m_last = rte_pktmbuf_lastseg(m);
-	if (unlikely(len > m_last->pkt.data_len))
+	if (unlikely(len > m_last->data_len))
 		return -1;
 
-	m_last->pkt.data_len = (uint16_t)(m_last->pkt.data_len - len);
-	m->pkt.pkt_len  = (m->pkt.pkt_len - len);
+	m_last->data_len = (uint16_t)(m_last->data_len - len);
+	m->pkt_len  = (m->pkt_len - len);
 	return 0;
 }
 
@@ -907,7 +905,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
 static inline int rte_pktmbuf_is_contiguous(const struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	return !!(m->pkt.nb_segs == 1);
+	return !!(m->nb_segs == 1);
 }
 
 /**
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index 31f480a..b9e66eb 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -91,7 +91,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb)             \
 	(uint64_t) ((mb)->buf_physaddr +       \
-	(uint64_t) ((char *)((mb)->pkt.data) - (char *)(mb)->buf_addr))
+	(uint64_t) ((char *)((mb)->data) - (char *)(mb)->buf_addr))
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -421,7 +421,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		tx_ol_req = (uint16_t)(ol_flags & (PKT_TX_IP_CKSUM |
 							PKT_TX_L4_MASK));
 		if (tx_ol_req) {
-			hdrlen = tx_pkt->pkt.vlan_macip;
+			hdrlen = tx_pkt->vlan_macip;
 			/* If new context to be built or reuse the exist ctx. */
 			ctx = what_ctx_update(txq, tx_ol_req, hdrlen);
 
@@ -434,7 +434,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		 * This will always be the number of segments + the number of
 		 * Context descriptors required to transmit the packet
 		 */
-		nb_used = (uint16_t)(tx_pkt->pkt.nb_segs + new_ctx);
+		nb_used = (uint16_t)(tx_pkt->nb_segs + new_ctx);
 
 		/*
 		 * The number of descriptors that must be allocated for a
@@ -454,7 +454,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			" tx_first=%u tx_last=%u\n",
 			(unsigned) txq->port_id,
 			(unsigned) txq->queue_id,
-			(unsigned) tx_pkt->pkt.pkt_len,
+			(unsigned) tx_pkt->pkt_len,
 			(unsigned) tx_id,
 			(unsigned) tx_last);
 
@@ -516,7 +516,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* Set VLAN Tag offload fields. */
 		if (ol_flags & PKT_TX_VLAN_PKT) {
 			cmd_type_len |= E1000_TXD_CMD_VLE;
-			popts_spec = tx_pkt->pkt.vlan_macip.f.vlan_tci <<
+			popts_spec = tx_pkt->vlan_macip.f.vlan_tci <<
 				E1000_TXD_VLAN_SHIFT;
 		}
 
@@ -566,7 +566,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			/*
 			 * Set up Transmit Data Descriptor.
 			 */
-			slen = m_seg->pkt.data_len;
+			slen = m_seg->data_len;
 			buf_dma_addr = RTE_MBUF_DATA_DMA_ADDR(m_seg);
 
 			txd->buffer_addr = rte_cpu_to_le_64(buf_dma_addr);
@@ -576,7 +576,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->pkt.next;
+			m_seg = m_seg->next;
 		} while (m_seg != NULL);
 
 		/*
@@ -771,20 +771,20 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.length) -
 				rxq->crc_len);
-		rxm->pkt.data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->pkt.data);
-		rxm->pkt.nb_segs = 1;
-		rxm->pkt.next = NULL;
-		rxm->pkt.pkt_len = pkt_len;
-		rxm->pkt.data_len = pkt_len;
-		rxm->pkt.in_port = rxq->port_id;
+		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch(rxm->data);
+		rxm->nb_segs = 1;
+		rxm->next = NULL;
+		rxm->pkt_len = pkt_len;
+		rxm->data_len = pkt_len;
+		rxm->in_port = rxq->port_id;
 
 		rxm->ol_flags = rx_desc_status_to_pkt_flags(status);
 		rxm->ol_flags = (uint16_t)(rxm->ol_flags |
 				rx_desc_error_to_pkt_flags(rxd.errors));
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->pkt.vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
+		rxm->vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
@@ -940,8 +940,8 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 * Set data length & data buffer address of mbuf.
 		 */
 		data_len = rte_le_to_cpu_16(rxd.length);
-		rxm->pkt.data_len = data_len;
-		rxm->pkt.data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_len = data_len;
+		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -953,12 +953,12 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		if (first_seg == NULL) {
 			first_seg = rxm;
-			first_seg->pkt.pkt_len = data_len;
-			first_seg->pkt.nb_segs = 1;
+			first_seg->pkt_len = data_len;
+			first_seg->nb_segs = 1;
 		} else {
-			first_seg->pkt.pkt_len += data_len;
-			first_seg->pkt.nb_segs++;
-			last_seg->pkt.next = rxm;
+			first_seg->pkt_len += data_len;
+			first_seg->nb_segs++;
+			last_seg->next = rxm;
 		}
 
 		/*
@@ -981,18 +981,18 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 *     mbuf, subtract the length of that CRC part from the
 		 *     data length of the previous mbuf.
 		 */
-		rxm->pkt.next = NULL;
+		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
-			first_seg->pkt.pkt_len -= ETHER_CRC_LEN;
+			first_seg->pkt_len -= ETHER_CRC_LEN;
 			if (data_len <= ETHER_CRC_LEN) {
 				rte_pktmbuf_free_seg(rxm);
-				first_seg->pkt.nb_segs--;
-				last_seg->pkt.data_len = (uint16_t)
-					(last_seg->pkt.data_len -
+				first_seg->nb_segs--;
+				last_seg->data_len = (uint16_t)
+					(last_seg->data_len -
 					 (ETHER_CRC_LEN - data_len));
-				last_seg->pkt.next = NULL;
+				last_seg->next = NULL;
 			} else
-				rxm->pkt.data_len =
+				rxm->data_len =
 					(uint16_t) (data_len - ETHER_CRC_LEN);
 		}
 
@@ -1003,17 +1003,17 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 *      - IP checksum flag,
 		 *      - error flags.
 		 */
-		first_seg->pkt.in_port = rxq->port_id;
+		first_seg->in_port = rxq->port_id;
 
 		first_seg->ol_flags = rx_desc_status_to_pkt_flags(status);
 		first_seg->ol_flags = (uint16_t)(first_seg->ol_flags |
 					rx_desc_error_to_pkt_flags(rxd.errors));
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->pkt.vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
+		rxm->vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->pkt.data);
+		rte_packet_prefetch(first_seg->data);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index 62ff7bc..da33171 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -85,7 +85,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
 	(uint64_t) ((mb)->buf_physaddr +		   \
-			(uint64_t) ((char *)((mb)->pkt.data) -     \
+			(uint64_t) ((char *)((mb)->data) -     \
 				(char *)(mb)->buf_addr))
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
@@ -354,7 +354,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 
 	for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
 		tx_pkt = *tx_pkts++;
-		pkt_len = tx_pkt->pkt.pkt_len;
+		pkt_len = tx_pkt->pkt_len;
 
 		RTE_MBUF_PREFETCH_TO_FREE(txe->mbuf);
 
@@ -366,10 +366,10 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		 * for the packet, starting from the current position (tx_id)
 		 * in the ring.
 		 */
-		tx_last = (uint16_t) (tx_id + tx_pkt->pkt.nb_segs - 1);
+		tx_last = (uint16_t) (tx_id + tx_pkt->nb_segs - 1);
 
 		ol_flags = tx_pkt->ol_flags;
-		vlan_macip_lens = tx_pkt->pkt.vlan_macip.data;
+		vlan_macip_lens = tx_pkt->vlan_macip.data;
 		tx_ol_req = (uint16_t)(ol_flags & PKT_TX_OFFLOAD_MASK);
 
 		/* If a Context Descriptor need be built . */
@@ -516,7 +516,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			/*
 			 * Set up transmit descriptor.
 			 */
-			slen = (uint16_t) m_seg->pkt.data_len;
+			slen = (uint16_t) m_seg->data_len;
 			buf_dma_addr = RTE_MBUF_DATA_DMA_ADDR(m_seg);
 			txd->read.buffer_addr =
 				rte_cpu_to_le_64(buf_dma_addr);
@@ -527,7 +527,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->pkt.next;
+			m_seg = m_seg->next;
 		} while (m_seg != NULL);
 
 		/*
@@ -742,18 +742,18 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
 				      rxq->crc_len);
-		rxm->pkt.data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->pkt.data);
-		rxm->pkt.nb_segs = 1;
-		rxm->pkt.next = NULL;
-		rxm->pkt.pkt_len = pkt_len;
-		rxm->pkt.data_len = pkt_len;
-		rxm->pkt.in_port = rxq->port_id;
-
-		rxm->pkt.hash.rss = rxd.wb.lower.hi_dword.rss;
+		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch(rxm->data);
+		rxm->nb_segs = 1;
+		rxm->next = NULL;
+		rxm->pkt_len = pkt_len;
+		rxm->data_len = pkt_len;
+		rxm->in_port = rxq->port_id;
+
+		rxm->hash.rss = rxd.wb.lower.hi_dword.rss;
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->pkt.vlan_macip.f.vlan_tci =
+		rxm->vlan_macip.f.vlan_tci =
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
@@ -918,8 +918,8 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 * Set data length & data buffer address of mbuf.
 		 */
 		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
-		rxm->pkt.data_len = data_len;
-		rxm->pkt.data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_len = data_len;
+		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -931,12 +931,12 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		if (first_seg == NULL) {
 			first_seg = rxm;
-			first_seg->pkt.pkt_len = data_len;
-			first_seg->pkt.nb_segs = 1;
+			first_seg->pkt_len = data_len;
+			first_seg->nb_segs = 1;
 		} else {
-			first_seg->pkt.pkt_len += data_len;
-			first_seg->pkt.nb_segs++;
-			last_seg->pkt.next = rxm;
+			first_seg->pkt_len += data_len;
+			first_seg->nb_segs++;
+			last_seg->next = rxm;
 		}
 
 		/*
@@ -959,18 +959,18 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 *     mbuf, subtract the length of that CRC part from the
 		 *     data length of the previous mbuf.
 		 */
-		rxm->pkt.next = NULL;
+		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
-			first_seg->pkt.pkt_len -= ETHER_CRC_LEN;
+			first_seg->pkt_len -= ETHER_CRC_LEN;
 			if (data_len <= ETHER_CRC_LEN) {
 				rte_pktmbuf_free_seg(rxm);
-				first_seg->pkt.nb_segs--;
-				last_seg->pkt.data_len = (uint16_t)
-					(last_seg->pkt.data_len -
+				first_seg->nb_segs--;
+				last_seg->data_len = (uint16_t)
+					(last_seg->data_len -
 					 (ETHER_CRC_LEN - data_len));
-				last_seg->pkt.next = NULL;
+				last_seg->next = NULL;
 			} else
-				rxm->pkt.data_len =
+				rxm->data_len =
 					(uint16_t) (data_len - ETHER_CRC_LEN);
 		}
 
@@ -983,14 +983,14 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 *      - VLAN TCI, if any,
 		 *      - error flags.
 		 */
-		first_seg->pkt.in_port = rxq->port_id;
-		first_seg->pkt.hash.rss = rxd.wb.lower.hi_dword.rss;
+		first_seg->in_port = rxq->port_id;
+		first_seg->hash.rss = rxd.wb.lower.hi_dword.rss;
 
 		/*
 		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
 		 * set in the pkt_flags field.
 		 */
-		first_seg->pkt.vlan_macip.f.vlan_tci =
+		first_seg->vlan_macip.f.vlan_tci =
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
@@ -1001,7 +1001,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->ol_flags = pkt_flags;
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->pkt.data);
+		rte_packet_prefetch(first_seg->data);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 76448ab..6cb1640 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -170,7 +170,7 @@ tx4(volatile union ixgbe_adv_tx_desc *txdp, struct rte_mbuf **pkts)
 
 	for (i = 0; i < 4; ++i, ++txdp, ++pkts) {
 		buf_dma_addr = RTE_MBUF_DATA_DMA_ADDR(*pkts);
-		pkt_len = (*pkts)->pkt.data_len;
+		pkt_len = (*pkts)->data_len;
 
 		/* write data to descriptor */
 		txdp->read.buffer_addr = buf_dma_addr;
@@ -189,7 +189,7 @@ tx1(volatile union ixgbe_adv_tx_desc *txdp, struct rte_mbuf **pkts)
 	uint32_t pkt_len;
 
 	buf_dma_addr = RTE_MBUF_DATA_DMA_ADDR(*pkts);
-	pkt_len = (*pkts)->pkt.data_len;
+	pkt_len = (*pkts)->data_len;
 
 	/* write data to descriptor */
 	txdp->read.buffer_addr = buf_dma_addr;
@@ -562,7 +562,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	for (nb_tx = 0; nb_tx < nb_pkts; nb_tx++) {
 		new_ctx = 0;
 		tx_pkt = *tx_pkts++;
-		pkt_len = tx_pkt->pkt.pkt_len;
+		pkt_len = tx_pkt->pkt_len;
 
 		RTE_MBUF_PREFETCH_TO_FREE(txe->mbuf);
 
@@ -571,7 +571,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		 * are needed for offload functionality.
 		 */
 		ol_flags = tx_pkt->ol_flags;
-		vlan_macip_lens = tx_pkt->pkt.vlan_macip.data;
+		vlan_macip_lens = tx_pkt->vlan_macip.data;
 
 		/* If hardware offload required */
 		tx_ol_req = (uint16_t)(ol_flags & PKT_TX_OFFLOAD_MASK);
@@ -589,7 +589,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		 * This will always be the number of segments + the number of
 		 * Context descriptors required to transmit the packet
 		 */
-		nb_used = (uint16_t)(tx_pkt->pkt.nb_segs + new_ctx);
+		nb_used = (uint16_t)(tx_pkt->nb_segs + new_ctx);
 
 		/*
 		 * The number of descriptors that must be allocated for a
@@ -749,7 +749,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			/*
 			 * Set up Transmit Data Descriptor.
 			 */
-			slen = m_seg->pkt.data_len;
+			slen = m_seg->data_len;
 			buf_dma_addr = RTE_MBUF_DATA_DMA_ADDR(m_seg);
 			txd->read.buffer_addr =
 				rte_cpu_to_le_64(buf_dma_addr);
@@ -760,7 +760,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			txe->last_id = tx_last;
 			tx_id = txe->next_id;
 			txe = txn;
-			m_seg = m_seg->pkt.next;
+			m_seg = m_seg->next;
 		} while (m_seg != NULL);
 
 		/*
@@ -929,10 +929,10 @@ ixgbe_rx_scan_hw_ring(struct igb_rx_queue *rxq)
 			mb = rxep[j].mbuf;
 			pkt_len = (uint16_t)(rxdp[j].wb.upper.length -
 							rxq->crc_len);
-			mb->pkt.data_len = pkt_len;
-			mb->pkt.pkt_len = pkt_len;
-			mb->pkt.vlan_macip.f.vlan_tci = rxdp[j].wb.upper.vlan;
-			mb->pkt.hash.rss = rxdp[j].wb.lower.hi_dword.rss;
+			mb->data_len = pkt_len;
+			mb->pkt_len = pkt_len;
+			mb->vlan_macip.f.vlan_tci = rxdp[j].wb.upper.vlan;
+			mb->hash.rss = rxdp[j].wb.lower.hi_dword.rss;
 
 			/* convert descriptor fields to rte mbuf flags */
 			mb->ol_flags  = rx_desc_hlen_type_rss_to_pkt_flags(
@@ -987,10 +987,10 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 		/* populate the static rte mbuf fields */
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
-		mb->pkt.next = NULL;
-		mb->pkt.data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
-		mb->pkt.nb_segs = 1;
-		mb->pkt.in_port = rxq->port_id;
+		mb->next = NULL;
+		mb->data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mb->nb_segs = 1;
+		mb->in_port = rxq->port_id;
 
 		/* populate the descriptors */
 		dma_addr = (uint64_t)mb->buf_physaddr + RTE_PKTMBUF_HEADROOM;
@@ -1239,17 +1239,17 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
 				      rxq->crc_len);
-		rxm->pkt.data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->pkt.data);
-		rxm->pkt.nb_segs = 1;
-		rxm->pkt.next = NULL;
-		rxm->pkt.pkt_len = pkt_len;
-		rxm->pkt.data_len = pkt_len;
-		rxm->pkt.in_port = rxq->port_id;
+		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch(rxm->data);
+		rxm->nb_segs = 1;
+		rxm->next = NULL;
+		rxm->pkt_len = pkt_len;
+		rxm->data_len = pkt_len;
+		rxm->in_port = rxq->port_id;
 
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->pkt.vlan_macip.f.vlan_tci =
+		rxm->vlan_macip.f.vlan_tci =
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
@@ -1260,12 +1260,12 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->ol_flags = pkt_flags;
 
 		if (likely(pkt_flags & PKT_RX_RSS_HASH))
-			rxm->pkt.hash.rss = rxd.wb.lower.hi_dword.rss;
+			rxm->hash.rss = rxd.wb.lower.hi_dword.rss;
 		else if (pkt_flags & PKT_RX_FDIR) {
-			rxm->pkt.hash.fdir.hash =
+			rxm->hash.fdir.hash =
 				(uint16_t)((rxd.wb.lower.hi_dword.csum_ip.csum)
 					   & IXGBE_ATR_HASH_MASK);
-			rxm->pkt.hash.fdir.id = rxd.wb.lower.hi_dword.csum_ip.ip_id;
+			rxm->hash.fdir.id = rxd.wb.lower.hi_dword.csum_ip.ip_id;
 		}
 		/*
 		 * Store the mbuf address into the next entry of the array
@@ -1422,8 +1422,8 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 * Set data length & data buffer address of mbuf.
 		 */
 		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
-		rxm->pkt.data_len = data_len;
-		rxm->pkt.data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_len = data_len;
+		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1435,13 +1435,13 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		if (first_seg == NULL) {
 			first_seg = rxm;
-			first_seg->pkt.pkt_len = data_len;
-			first_seg->pkt.nb_segs = 1;
+			first_seg->pkt_len = data_len;
+			first_seg->nb_segs = 1;
 		} else {
-			first_seg->pkt.pkt_len = (uint16_t)(first_seg->pkt.pkt_len
+			first_seg->pkt_len = (uint16_t)(first_seg->pkt_len
 					+ data_len);
-			first_seg->pkt.nb_segs++;
-			last_seg->pkt.next = rxm;
+			first_seg->nb_segs++;
+			last_seg->next = rxm;
 		}
 
 		/*
@@ -1464,18 +1464,18 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 *     mbuf, subtract the length of that CRC part from the
 		 *     data length of the previous mbuf.
 		 */
-		rxm->pkt.next = NULL;
+		rxm->next = NULL;
 		if (unlikely(rxq->crc_len > 0)) {
-			first_seg->pkt.pkt_len -= ETHER_CRC_LEN;
+			first_seg->pkt_len -= ETHER_CRC_LEN;
 			if (data_len <= ETHER_CRC_LEN) {
 				rte_pktmbuf_free_seg(rxm);
-				first_seg->pkt.nb_segs--;
-				last_seg->pkt.data_len = (uint16_t)
-					(last_seg->pkt.data_len -
+				first_seg->nb_segs--;
+				last_seg->data_len = (uint16_t)
+					(last_seg->data_len -
 					 (ETHER_CRC_LEN - data_len));
-				last_seg->pkt.next = NULL;
+				last_seg->next = NULL;
 			} else
-				rxm->pkt.data_len =
+				rxm->data_len =
 					(uint16_t) (data_len - ETHER_CRC_LEN);
 		}
 
@@ -1488,13 +1488,13 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 *      - VLAN TCI, if any,
 		 *      - error flags.
 		 */
-		first_seg->pkt.in_port = rxq->port_id;
+		first_seg->in_port = rxq->port_id;
 
 		/*
 		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
 		 * set in the pkt_flags field.
 		 */
-		first_seg->pkt.vlan_macip.f.vlan_tci =
+		first_seg->vlan_macip.f.vlan_tci =
 				rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
@@ -1505,17 +1505,17 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->ol_flags = pkt_flags;
 
 		if (likely(pkt_flags & PKT_RX_RSS_HASH))
-			first_seg->pkt.hash.rss = rxd.wb.lower.hi_dword.rss;
+			first_seg->hash.rss = rxd.wb.lower.hi_dword.rss;
 		else if (pkt_flags & PKT_RX_FDIR) {
-			first_seg->pkt.hash.fdir.hash =
+			first_seg->hash.fdir.hash =
 				(uint16_t)((rxd.wb.lower.hi_dword.csum_ip.csum)
 					   & IXGBE_ATR_HASH_MASK);
-			first_seg->pkt.hash.fdir.id =
+			first_seg->hash.fdir.id =
 				rxd.wb.lower.hi_dword.csum_ip.ip_id;
 		}
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->pkt.data);
+		rte_packet_prefetch(first_seg->data);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
@@ -3083,10 +3083,10 @@ ixgbe_alloc_rx_queue_mbufs(struct igb_rx_queue *rxq)
 		}
 
 		rte_mbuf_refcnt_set(mbuf, 1);
-		mbuf->pkt.next = NULL;
-		mbuf->pkt.data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
-		mbuf->pkt.nb_segs = 1;
-		mbuf->pkt.in_port = rxq->port_id;
+		mbuf->next = NULL;
+		mbuf->data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mbuf->nb_segs = 1;
+		mbuf->in_port = rxq->port_id;
 
 		dma_addr =
 			rte_cpu_to_le_64(RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mbuf));
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 446eeb7..e32a417 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -45,7 +45,7 @@
 #endif
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->pkt.data) - \
+	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
 	(char *)(mb)->buf_addr))
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
diff --git a/lib/librte_pmd_pcap/rte_eth_pcap.c b/lib/librte_pmd_pcap/rte_eth_pcap.c
index 680dfdc..aa80478 100644
--- a/lib/librte_pmd_pcap/rte_eth_pcap.c
+++ b/lib/librte_pmd_pcap/rte_eth_pcap.c
@@ -151,9 +151,9 @@ eth_pcap_rx(void *queue,
 
 		if (header.len <= buf_size) {
 			/* pcap packet will fit in the mbuf, go ahead and copy */
-			rte_memcpy(mbuf->pkt.data, packet, header.len);
-			mbuf->pkt.data_len = (uint16_t)header.len;
-			mbuf->pkt.pkt_len = mbuf->pkt.data_len;
+			rte_memcpy(mbuf->data, packet, header.len);
+			mbuf->data_len = (uint16_t)header.len;
+			mbuf->pkt_len = mbuf->data_len;
 			bufs[i] = mbuf;
 			num_rx++;
 		} else {
@@ -200,9 +200,9 @@ eth_pcap_tx_dumper(void *queue,
 	for (i = 0; i < nb_pkts; i++) {
 		mbuf = bufs[i];
 		calculate_timestamp(&header.ts);
-		header.len = mbuf->pkt.data_len;
+		header.len = mbuf->data_len;
 		header.caplen = header.len;
-		pcap_dump((u_char*) dumper_q->dumper, &header, mbuf->pkt.data);
+		pcap_dump((u_char*) dumper_q->dumper, &header, mbuf->data);
 		rte_pktmbuf_free(mbuf);
 		num_tx++;
 	}
@@ -237,8 +237,8 @@ eth_pcap_tx(void *queue,
 
 	for (i = 0; i < nb_pkts; i++) {
 		mbuf = bufs[i];
-		ret = pcap_sendpacket(tx_queue->pcap, (u_char*) mbuf->pkt.data,
-				mbuf->pkt.data_len);
+		ret = pcap_sendpacket(tx_queue->pcap, (u_char*) mbuf->data,
+				mbuf->data_len);
 		if(likely(!ret))
 			num_tx++;
 		rte_pktmbuf_free(mbuf);
diff --git a/lib/librte_pmd_virtio/virtio_rxtx.c b/lib/librte_pmd_virtio/virtio_rxtx.c
index 0db3ba0..7deaeb6 100644
--- a/lib/librte_pmd_virtio/virtio_rxtx.c
+++ b/lib/librte_pmd_virtio/virtio_rxtx.c
@@ -267,13 +267,13 @@ virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 			hw->eth_stats.ierrors++;
 			continue;
 		}
-		rxm->pkt.in_port = rxvq->port_id;
-		rxm->pkt.data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rxm->pkt.nb_segs = 1;
-		rxm->pkt.next = NULL;
-		rxm->pkt.pkt_len  = (uint32_t)(len[i] - sizeof(struct virtio_net_hdr));
-		rxm->pkt.data_len = (uint16_t)(len[i] - sizeof(struct virtio_net_hdr));
-		VIRTIO_DUMP_PACKET(rxm, rxm->pkt.data_len);
+		rxm->in_port = rxvq->port_id;
+		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->nb_segs = 1;
+		rxm->next = NULL;
+		rxm->pkt_len  = (uint32_t)(len[i] - sizeof(struct virtio_net_hdr));
+		rxm->data_len = (uint16_t)(len[i] - sizeof(struct virtio_net_hdr));
+		VIRTIO_DUMP_PACKET(rxm, rxm->data_len);
 		rx_pkts[nb_rx++] = rxm;
 		hw->eth_stats.ibytes += len[i] - sizeof(struct virtio_net_hdr);
 	}
@@ -346,7 +346,7 @@ virtio_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 				break;
 			}
 	 		nb_tx++;
-			hw->eth_stats.obytes += txm->pkt.data_len;
+			hw->eth_stats.obytes += txm->data_len;
 		} else {
 			PMD_TX_LOG(ERR, "No free tx descriptors to transmit\n");
 			break;
diff --git a/lib/librte_pmd_virtio/virtqueue.h b/lib/librte_pmd_virtio/virtqueue.h
index b67c223..210944e 100644
--- a/lib/librte_pmd_virtio/virtqueue.h
+++ b/lib/librte_pmd_virtio/virtqueue.h
@@ -59,7 +59,7 @@
 #define VIRTQUEUE_MAX_NAME_SZ 32
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->pkt.data) - \
+	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
 	(char *)(mb)->buf_addr))
 
 #define VTNET_SQ_RQ_QUEUE_IDX 0
@@ -330,7 +330,7 @@ virtqueue_enqueue_xmit(struct virtqueue *txvq, struct rte_mbuf *cookie)
 	start_dp[idx].flags = VRING_DESC_F_NEXT;
 	idx = start_dp[idx].next;
 	start_dp[idx].addr  = RTE_MBUF_DATA_DMA_ADDR(cookie);
-	start_dp[idx].len   = cookie->pkt.data_len;
+	start_dp[idx].len   = cookie->data_len;
 	start_dp[idx].flags = 0;
 	idx = start_dp[idx].next;
 	txvq->vq_desc_head_idx = idx;
@@ -363,7 +363,7 @@ virtqueue_dequeue_burst_rx(struct virtqueue *vq, struct rte_mbuf **rx_pkts, uint
 			break;
 		}
 		rte_prefetch0(cookie);
-		rte_packet_prefetch(cookie->pkt.data);
+		rte_packet_prefetch(cookie->data);
 		rx_pkts[i]  = cookie;
 		vq->vq_used_cons_idx++;
 		vq_ring_free_chain(vq, desc_idx);
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index d91404a..60f26ba 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -80,7 +80,7 @@
 
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->pkt.data) - \
+	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
 	(char *)(mb)->buf_addr))
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
@@ -276,7 +276,7 @@ vmxnet3_xmit_pkts( void *tx_queue, struct rte_mbuf **tx_pkts,
 
 			txm = tx_pkts[nb_tx];
 			/* Don't support scatter packets yet, free them if met */
-			if (txm->pkt.nb_segs != 1) {
+			if (txm->nb_segs != 1) {
 				PMD_TX_LOG(DEBUG, "Don't support scatter packets yet, drop!\n");
 				rte_pktmbuf_free(tx_pkts[nb_tx]);
 				txq->stats.drop_total++;
@@ -286,7 +286,7 @@ vmxnet3_xmit_pkts( void *tx_queue, struct rte_mbuf **tx_pkts,
 			}
 
 			/* Needs to minus ether header len */
-			if(txm->pkt.data_len > (hw->cur_mtu + ETHER_HDR_LEN)) {
+			if(txm->data_len > (hw->cur_mtu + ETHER_HDR_LEN)) {
 				PMD_TX_LOG(DEBUG, "Packet data_len higher than MTU\n");
 				rte_pktmbuf_free(tx_pkts[nb_tx]);
 				txq->stats.drop_total++;
@@ -301,7 +301,7 @@ vmxnet3_xmit_pkts( void *tx_queue, struct rte_mbuf **tx_pkts,
 			tbi = txq->cmd_ring.buf_info + txq->cmd_ring.next2fill;
 			tbi->bufPA = RTE_MBUF_DATA_DMA_ADDR(txm);
 			txd->addr = tbi->bufPA;
-			txd->len = txm->pkt.data_len;
+			txd->len = txm->data_len;
 
 			/* Mark the last descriptor as End of Packet. */
 			txd->cq = 1;
@@ -537,21 +537,21 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 					rte_pktmbuf_mtod(rxm, void *));
 #endif
 				//Copy vlan tag in packet buffer
-				rxm->pkt.vlan_macip.f.vlan_tci =
+				rxm->vlan_macip.f.vlan_tci =
 					rte_le_to_cpu_16((uint16_t)rcd->tci);
 
 			} else
 				rxm->ol_flags = 0;
 
 			/* Initialize newly received packet buffer */
-			rxm->pkt.in_port = rxq->port_id;
-			rxm->pkt.nb_segs = 1;
-			rxm->pkt.next = NULL;
-			rxm->pkt.pkt_len = (uint16_t)rcd->len;
-			rxm->pkt.data_len = (uint16_t)rcd->len;
-			rxm->pkt.in_port = rxq->port_id;
-			rxm->pkt.vlan_macip.f.vlan_tci = 0;
-			rxm->pkt.data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+			rxm->in_port = rxq->port_id;
+			rxm->nb_segs = 1;
+			rxm->next = NULL;
+			rxm->pkt_len = (uint16_t)rcd->len;
+			rxm->data_len = (uint16_t)rcd->len;
+			rxm->in_port = rxq->port_id;
+			rxm->vlan_macip.f.vlan_tci = 0;
+			rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
 
 			rx_pkts[nb_rx++] = rxm;
 
diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
index 5cd1cdb..40108c4 100644
--- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
+++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
@@ -108,12 +108,12 @@ eth_xenvirt_rx(void *q, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 	for (i = 0; i < num ; i ++) {
 		rxm = rx_pkts[i];
 		PMD_RX_LOG(DEBUG, "packet len:%d\n", len[i]);
-		rxm->pkt.next = NULL;
-		rxm->pkt.data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rxm->pkt.data_len = (uint16_t)(len[i] - sizeof(struct virtio_net_hdr));
-		rxm->pkt.nb_segs = 1;
-		rxm->pkt.in_port = pi->port_id;
-		rxm->pkt.pkt_len  = (uint32_t)(len[i] - sizeof(struct virtio_net_hdr));
+		rxm->next = NULL;
+		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_len = (uint16_t)(len[i] - sizeof(struct virtio_net_hdr));
+		rxm->nb_segs = 1;
+		rxm->in_port = pi->port_id;
+		rxm->pkt_len  = (uint32_t)(len[i] - sizeof(struct virtio_net_hdr));
 	}
 	/* allocate new mbuf for the used descriptor */
 	while (likely(!virtqueue_full(rxvq))) {
diff --git a/lib/librte_pmd_xenvirt/virtqueue.h b/lib/librte_pmd_xenvirt/virtqueue.h
index 3844448..f36030f 100644
--- a/lib/librte_pmd_xenvirt/virtqueue.h
+++ b/lib/librte_pmd_xenvirt/virtqueue.h
@@ -54,7 +54,7 @@
  * rather than gpa<->hva in virito spec.
  */
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	((uint64_t)((mb)->pkt.data))
+	((uint64_t)((mb)->data))
 
 enum { VTNET_RQ = 0, VTNET_TQ = 1, VTNET_CQ = 2 };
 
@@ -238,7 +238,7 @@ virtqueue_enqueue_xmit(struct virtqueue *txvq, struct rte_mbuf *cookie)
 	start_dp[idx].addr  = (uintptr_t)NULL;
 	idx = start_dp[idx].next;
 	start_dp[idx].addr  = RTE_MBUF_DATA_DMA_ADDR(cookie);
-	start_dp[idx].len   = cookie->pkt.data_len;
+	start_dp[idx].len   = cookie->data_len;
 	start_dp[idx].flags = 0;
 	idx = start_dp[idx].next;
 	txvq->vq_desc_head_idx = idx;
diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c
index 24e8bdf..7f0bf67 100644
--- a/lib/librte_sched/rte_sched.c
+++ b/lib/librte_sched/rte_sched.c
@@ -1032,7 +1032,7 @@ rte_sched_port_update_subport_stats(struct rte_sched_port *port, uint32_t qindex
 {
 	struct rte_sched_subport *s = port->subport + (qindex / rte_sched_port_queues_per_subport(port));
 	uint32_t tc_index = (qindex >> 2) & 0x3;
-	uint32_t pkt_len = pkt->pkt.pkt_len;
+	uint32_t pkt_len = pkt->pkt_len;
 	
 	s->stats.n_pkts_tc[tc_index] += 1;
 	s->stats.n_bytes_tc[tc_index] += pkt_len;
@@ -1043,7 +1043,7 @@ rte_sched_port_update_subport_stats_on_drop(struct rte_sched_port *port, uint32_
 {
 	struct rte_sched_subport *s = port->subport + (qindex / rte_sched_port_queues_per_subport(port));
 	uint32_t tc_index = (qindex >> 2) & 0x3;
-	uint32_t pkt_len = pkt->pkt.pkt_len;
+	uint32_t pkt_len = pkt->pkt_len;
 	
 	s->stats.n_pkts_tc_dropped[tc_index] += 1;
 	s->stats.n_bytes_tc_dropped[tc_index] += pkt_len;
@@ -1053,7 +1053,7 @@ static inline void
 rte_sched_port_update_queue_stats(struct rte_sched_port *port, uint32_t qindex, struct rte_mbuf *pkt)
 {
 	struct rte_sched_queue_extra *qe = port->queue_extra + qindex;
-	uint32_t pkt_len = pkt->pkt.pkt_len;
+	uint32_t pkt_len = pkt->pkt_len;
 	
 	qe->stats.n_pkts += 1;
 	qe->stats.n_bytes += pkt_len;
@@ -1063,7 +1063,7 @@ static inline void
 rte_sched_port_update_queue_stats_on_drop(struct rte_sched_port *port, uint32_t qindex, struct rte_mbuf *pkt)
 {
 	struct rte_sched_queue_extra *qe = port->queue_extra + qindex;
-	uint32_t pkt_len = pkt->pkt.pkt_len;
+	uint32_t pkt_len = pkt->pkt_len;
 	
 	qe->stats.n_pkts_dropped += 1;
 	qe->stats.n_bytes_dropped += pkt_len;
@@ -1580,7 +1580,7 @@ grinder_credits_check(struct rte_sched_port *port, uint32_t pos)
 	struct rte_sched_pipe *pipe = grinder->pipe;
 	struct rte_mbuf *pkt = grinder->pkt;
 	uint32_t tc_index = grinder->tc_index;
-	uint32_t pkt_len = pkt->pkt.pkt_len + port->frame_overhead;
+	uint32_t pkt_len = pkt->pkt_len + port->frame_overhead;
 	uint32_t subport_tb_credits = subport->tb_credits;
 	uint32_t subport_tc_credits = subport->tc_credits[tc_index];
 	uint32_t pipe_tb_credits = pipe->tb_credits;
@@ -1616,7 +1616,7 @@ grinder_credits_check(struct rte_sched_port *port, uint32_t pos)
 	struct rte_sched_pipe *pipe = grinder->pipe;
 	struct rte_mbuf *pkt = grinder->pkt;
 	uint32_t tc_index = grinder->tc_index;
-	uint32_t pkt_len = pkt->pkt.pkt_len + port->frame_overhead;
+	uint32_t pkt_len = pkt->pkt_len + port->frame_overhead;
 	uint32_t subport_tb_credits = subport->tb_credits;
 	uint32_t subport_tc_credits = subport->tc_credits[tc_index];
 	uint32_t pipe_tb_credits = pipe->tb_credits;
@@ -1657,7 +1657,7 @@ grinder_schedule(struct rte_sched_port *port, uint32_t pos)
 	struct rte_sched_grinder *grinder = port->grinder + pos;
 	struct rte_sched_queue *queue = grinder->queue[grinder->qpos];
 	struct rte_mbuf *pkt = grinder->pkt;
-	uint32_t pkt_len = pkt->pkt.pkt_len + port->frame_overhead;
+	uint32_t pkt_len = pkt->pkt_len + port->frame_overhead;
 
 #if RTE_SCHED_TS_CREDITS_CHECK
 	if (!grinder_credits_check(port, pos)) {
diff --git a/lib/librte_sched/rte_sched.h b/lib/librte_sched/rte_sched.h
index 1c4ebc5..5c4c0cd 100644
--- a/lib/librte_sched/rte_sched.h
+++ b/lib/librte_sched/rte_sched.h
@@ -106,7 +106,7 @@ extern "C" {
    2. Start of Frame Delimiter (SFD):       1 byte;
    3. Frame Check Sequence (FCS):           4 bytes;
    4. Inter Frame Gap (IFG):               12 bytes.
-The FCS is considered overhead only if not included in the packet length (field pkt.pkt_len
+The FCS is considered overhead only if not included in the packet length (field pkt_len
 of struct rte_mbuf). */
 #ifndef RTE_SCHED_FRAME_OVERHEAD_DEFAULT
 #define RTE_SCHED_FRAME_OVERHEAD_DEFAULT      24
@@ -196,7 +196,7 @@ struct rte_sched_port_params {
 };
 
 /** Path through the scheduler hierarchy used by the scheduler enqueue operation to
-identify the destination queue for the current packet. Stored in the field pkt.hash.sched
+identify the destination queue for the current packet. Stored in the field hash.sched
 of struct rte_mbuf of each packet, typically written by the classification stage and read by 
 scheduler enqueue.*/
 struct rte_sched_port_hierarchy {
@@ -352,7 +352,7 @@ static inline void
 rte_sched_port_pkt_write(struct rte_mbuf *pkt, 
 	uint32_t subport, uint32_t pipe, uint32_t traffic_class, uint32_t queue, enum rte_meter_color color)
 {
-	struct rte_sched_port_hierarchy *sched = (struct rte_sched_port_hierarchy *) &pkt->pkt.hash.sched;
+	struct rte_sched_port_hierarchy *sched = (struct rte_sched_port_hierarchy *) &pkt->hash.sched;
 	
 	sched->color = (uint32_t) color;
 	sched->subport = subport;
@@ -381,7 +381,7 @@ rte_sched_port_pkt_write(struct rte_mbuf *pkt,
 static inline void
 rte_sched_port_pkt_read_tree_path(struct rte_mbuf *pkt, uint32_t *subport, uint32_t *pipe, uint32_t *traffic_class, uint32_t *queue)
 {
-	struct rte_sched_port_hierarchy *sched = (struct rte_sched_port_hierarchy *) &pkt->pkt.hash.sched;
+	struct rte_sched_port_hierarchy *sched = (struct rte_sched_port_hierarchy *) &pkt->hash.sched;
 	
 	*subport = sched->subport;
 	*pipe = sched->pipe;
@@ -392,7 +392,7 @@ rte_sched_port_pkt_read_tree_path(struct rte_mbuf *pkt, uint32_t *subport, uint3
 static inline enum rte_meter_color
 rte_sched_port_pkt_read_color(struct rte_mbuf *pkt)
 {
-	struct rte_sched_port_hierarchy *sched = (struct rte_sched_port_hierarchy *) &pkt->pkt.hash.sched;
+	struct rte_sched_port_hierarchy *sched = (struct rte_sched_port_hierarchy *) &pkt->hash.sched;
 
 	return (enum rte_meter_color) sched->color;
 }
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (3 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 04/11] mbuf: remove the rte_pktmbuf structure Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-09 15:39   ` Shaw, Jeffrey B
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset Olivier Matz
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

The physical address is never greater than (1 << 48) = 256 TB.
We can win 2 bytes in the mbuf structure by merging the physical
address and the buffer length in the same bitfield.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 lib/librte_mbuf/rte_mbuf.c | 3 ++-
 lib/librte_mbuf/rte_mbuf.h | 7 ++++---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index c229525..9879095 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -104,7 +104,8 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->buf_len = (uint16_t)buf_len;
 
 	/* keep some headroom between start of buffer and data */
-	m->data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
+	m->data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM,
+		(uint16_t)m->buf_len);
 
 	/* init some constant fields */
 	m->pool = mp;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 803b223..275f6b2 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -130,8 +130,8 @@ union rte_vlan_macip {
 struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	void *buf_addr;           /**< Virtual address of segment buffer. */
-	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
-	uint16_t buf_len;         /**< Length of segment buffer. */
+	uint64_t buf_physaddr:48; /**< Physical address of segment buffer. */
+	uint64_t buf_len:16;      /**< Length of segment buffer. */
 #ifdef RTE_MBUF_REFCNT
 	/**
 	 * 16-bit Reference counter.
@@ -148,8 +148,9 @@ struct rte_mbuf {
 #else
 	uint16_t refcnt_reserved;     /**< Do not use this field */
 #endif
-	uint16_t reserved;             /**< Unused field. Required for padding. */
+
 	uint16_t ol_flags;            /**< Offload features. */
+	uint32_t reserved;             /**< Unused field. Required for padding. */
 
 	/* valid for any segment */
 	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (4 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-12 14:12   ` Thomas Monjalon
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 07/11] mbuf: add functions to get the name of an ol_flag Olivier Matz
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

The mbuf structure already contains a pointer to the beginning of the
buffer (m->buf_addr). It is not needed to use 8 bytes again to store
another pointer to the beginning of the data.

Using a 16 bits unsigned integer is enough as we know that a mbuf is
never longer than 64KB. We gain 6 bytes in the structure thanks to
this modification.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/csumonly.c               |  2 +-
 app/test-pmd/macfwd-retry.c           |  2 +-
 app/test-pmd/macfwd.c                 |  2 +-
 app/test-pmd/rxonly.c                 |  2 +-
 app/test-pmd/testpmd.c                |  2 +-
 app/test-pmd/txonly.c                 |  7 ++--
 app/test/test_mbuf.c                  |  6 ++--
 examples/exception_path/main.c        |  3 +-
 examples/vhost/main.c                 | 21 +++++++-----
 examples/vhost_xen/main.c             |  2 +-
 lib/librte_mbuf/rte_mbuf.c            |  7 ++--
 lib/librte_mbuf/rte_mbuf.h            | 62 ++++++++++++++++-------------------
 lib/librte_pmd_e1000/em_rxtx.c        | 12 +++----
 lib/librte_pmd_e1000/igb_rxtx.c       | 13 ++++----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     | 13 ++++----
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     |  3 +-
 lib/librte_pmd_virtio/virtio_rxtx.c   |  2 +-
 lib/librte_pmd_virtio/virtqueue.h     |  5 ++-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c |  5 ++-
 19 files changed, 85 insertions(+), 86 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index ee82eb6..3313b87 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -263,7 +263,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		pkt_ol_flags = mb->ol_flags;
 		ol_flags = (uint16_t) (pkt_ol_flags & (~PKT_TX_L4_MASK));
 
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
 		if (eth_type == ETHER_TYPE_VLAN) {
 			/* Only allow single VLAN label here */
diff --git a/app/test-pmd/macfwd-retry.c b/app/test-pmd/macfwd-retry.c
index 687ff8d..7749c9e 100644
--- a/app/test-pmd/macfwd-retry.c
+++ b/app/test-pmd/macfwd-retry.c
@@ -119,7 +119,7 @@ pkt_burst_mac_retry_forward(struct fwd_stream *fs)
 	fs->rx_packets += nb_rx;
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index 8d7612c..ab74d0c 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -110,7 +110,7 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 	txp = &ports[fs->tx_port];
 	for (i = 0; i < nb_rx; i++) {
 		mb = pkts_burst[i];
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		ether_addr_copy(&peer_eth_addrs[fs->peer_addr],
 				&eth_hdr->d_addr);
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index b77c8ce..5751b0b 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -149,7 +149,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 			rte_pktmbuf_free(mb);
 			continue;
 		}
-		eth_hdr = (struct ether_hdr *) mb->data;
+		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = RTE_BE_TO_CPU_16(eth_hdr->ether_type);
 		ol_flags = mb->ol_flags;
 		print_ether_addr("  src=", &eth_hdr->s_addr);
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 1964020..572c3aa 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -395,7 +395,7 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 			mb_ctor_arg->seg_buf_offset);
 	mb->buf_len      = mb_ctor_arg->seg_buf_size;
 	mb->ol_flags     = 0;
-	mb->data         = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+	mb->data_off     = RTE_PKTMBUF_HEADROOM;
 	mb->nb_segs      = 1;
 	mb->vlan_macip.data = 0;
 	mb->hash.rss     = 0;
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 3baa0c8..c28f3dd 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -111,13 +111,13 @@ copy_buf_to_pkt_segs(void* buf, unsigned len, struct rte_mbuf *pkt,
 		seg = seg->next;
 	}
 	copy_len = seg->data_len - offset;
-	seg_buf = ((char *) seg->data + offset);
+	seg_buf = (rte_pktmbuf_mtod(seg, char *) + offset);
 	while (len > copy_len) {
 		rte_memcpy(seg_buf, buf, (size_t) copy_len);
 		len -= copy_len;
 		buf = ((char*) buf + copy_len);
 		seg = seg->next;
-		seg_buf = seg->data;
+		seg_buf = rte_pktmbuf_mtod(seg, char *);
 	}
 	rte_memcpy(seg_buf, buf, (size_t) len);
 }
@@ -126,7 +126,8 @@ static inline void
 copy_buf_to_pkt(void* buf, unsigned len, struct rte_mbuf *pkt, unsigned offset)
 {
 	if (offset + len <= pkt->data_len) {
-		rte_memcpy(((char *) pkt->data + offset), buf, (size_t) len);
+		rte_memcpy((rte_pktmbuf_mtod(pkt, char *) + offset),
+			buf, (size_t) len);
 		return;
 	}
 	copy_buf_to_pkt_segs(buf, len, pkt, offset);
diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c
index 320d76f..f6ce8ac 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -432,7 +432,7 @@ test_pktmbuf_pool_ptr(void)
 			printf("rte_pktmbuf_alloc() failed (%u)\n", i);
 			ret = -1;
 		}
-		m[i]->data = RTE_PTR_ADD(m[i]->data, 64);
+		m[i]->data_off += 64;
 	}
 
 	/* free them */
@@ -451,8 +451,8 @@ test_pktmbuf_pool_ptr(void)
 			printf("rte_pktmbuf_alloc() failed (%u)\n", i);
 			ret = -1;
 		}
-		if (m[i]->data != RTE_PTR_ADD(m[i]->buf_addr, RTE_PKTMBUF_HEADROOM)) {
-			printf ("data pointer not set properly\n");
+		if (m[i]->data_off != RTE_PKTMBUF_HEADROOM) {
+			printf ("invalid data_off\n");
 			ret = -1;
 		}
 	}
diff --git a/examples/exception_path/main.c b/examples/exception_path/main.c
index d9a85b5..ca65511 100644
--- a/examples/exception_path/main.c
+++ b/examples/exception_path/main.c
@@ -302,7 +302,8 @@ main_loop(__attribute__((unused)) void *arg)
 			if (m == NULL)
 				continue;
 
-			ret = read(tap_fd, m->data, MAX_PACKET_SZ);
+			ret = read(tap_fd, rte_pktmbuf_mtod(m, void *),
+				MAX_PACKET_SZ);
 			lcore_stats[lcore_id].rx++;
 			if (unlikely(ret < 0)) {
 				FATAL_ERROR("Reading from %s interface failed",
diff --git a/examples/vhost/main.c b/examples/vhost/main.c
index 26cfc8e..c65f24d 100644
--- a/examples/vhost/main.c
+++ b/examples/vhost/main.c
@@ -815,7 +815,9 @@ virtio_dev_rx(struct virtio_net *dev, struct rte_mbuf **pkts, uint32_t count)
 		vq->used->ring[res_cur_idx & (vq->size - 1)].len = packet_len;
 
 		/* Copy mbuf data to buffer */
-		rte_memcpy((void *)(uintptr_t)buff_addr, (const void*)buff->data, rte_pktmbuf_data_len(buff));
+		rte_memcpy((void *)(uintptr_t)buff_addr,
+			rte_pktmbuf_mtod(buff, const void *),
+			rte_pktmbuf_data_len(buff));
 
 		res_cur_idx++;
 		packet_success++;
@@ -877,7 +879,7 @@ link_vmdq(struct virtio_net *dev, struct rte_mbuf *m)
 	int i, ret;
 
 	/* Learn MAC address of guest device from packet */
-	pkt_hdr = (struct ether_hdr *)m->data;
+	pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	dev_ll = ll_root_used;
 
@@ -965,7 +967,7 @@ virtio_tx_local(struct virtio_net *dev, struct rte_mbuf *m)
 	struct ether_hdr *pkt_hdr;
 	uint64_t ret = 0;
 
-	pkt_hdr = (struct ether_hdr *)m->data;
+	pkt_hdr = rte_pktmbuf_mtod(m, struct ether_hdr *);
 
 	/*get the used devices list*/
 	dev_ll = ll_root_used;
@@ -1042,18 +1044,21 @@ virtio_tx_route(struct virtio_net* dev, struct rte_mbuf *m, struct rte_mempool *
 	mbuf->pkt_len = mbuf->data_len;
 
 	/* Copy ethernet header to mbuf. */
-	rte_memcpy((void*)mbuf->data, (const void*)m->data, ETH_HLEN);
+	rte_memcpy(rte_pktmbuf_mtod(mbuf, void *),
+		rte_pktmbuf_mtod(m, const void *),
+		ETH_HLEN);
 
 
 	/* Setup vlan header. Bytes need to be re-ordered for network with htons()*/
-	vlan_hdr = (struct vlan_ethhdr *) mbuf->data;
+	vlan_hdr = rte_pktmbuf_mtod(mbuf, struct vlan_ethhdr *);
 	vlan_hdr->h_vlan_encapsulated_proto = vlan_hdr->h_vlan_proto;
 	vlan_hdr->h_vlan_proto = htons(ETH_P_8021Q);
 	vlan_hdr->h_vlan_TCI = htons(vlan_tag);
 
 	/* Copy the remaining packet contents to the mbuf. */
-	rte_memcpy((void*) ((uint8_t*)mbuf->data + VLAN_ETH_HLEN),
-		(const void*) ((uint8_t*)m->data + ETH_HLEN), (m->data_len - ETH_HLEN));
+	rte_memcpy((void *)(rte_pktmbuf_mtod(mbuf, uint8_t *) + VLAN_ETH_HLEN),
+		(const void *)(rte_pktmbuf_mtod(m, uint8_t *) + ETH_HLEN),
+		(m->data_len - ETH_HLEN));
 	tx_q->m_table[len] = mbuf;
 	len++;
 	if (enable_stats) {
@@ -1144,7 +1149,7 @@ virtio_dev_tx(struct virtio_net* dev, struct rte_mempool *mbuf_pool)
 
 		/* Setup dummy mbuf. This is copied to a real mbuf if transmitted out the physical port. */
 		m.data_len = desc->len;
-		m.data = (void*)(uintptr_t)buff_addr;
+		m.data_off = 0;
 
 		PRINT_PACKET(dev, (uintptr_t)buff_addr, desc->len, 0);
 
diff --git a/examples/vhost_xen/main.c b/examples/vhost_xen/main.c
index 2cf0029..6da1b22 100644
--- a/examples/vhost_xen/main.c
+++ b/examples/vhost_xen/main.c
@@ -981,7 +981,7 @@ virtio_dev_tx(struct virtio_net* dev, struct rte_mempool *mbuf_pool)
 
 		/* Setup dummy mbuf. This is copied to a real mbuf if transmitted out the physical port. */
 		m.data_len = desc->len;
-		m.data = (void*)(uintptr_t)buff_addr;
+		m.data_off = 0;
 		m.nb_segs = 1; 
 
 		virtio_tx_route(dev, &m, mbuf_pool, 0);
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 9879095..43e6d32 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -104,8 +104,7 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->buf_len = (uint16_t)buf_len;
 
 	/* keep some headroom between start of buffer and data */
-	m->data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM,
-		(uint16_t)m->buf_len);
+	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
 
 	/* init some constant fields */
 	m->pool = mp;
@@ -171,12 +170,12 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 		__rte_mbuf_sanity_check(m, 0);
 
 		printf("  segment at 0x%p, data=0x%p, data_len=%u\n",
-		       m, m->data, (unsigned)m->data_len);
+			m, rte_pktmbuf_mtod(m, void *), (unsigned)m->data_len);
 		len = dump_len;
 		if (len > m->data_len)
 			len = m->data_len;
 		if (len != 0)
-			rte_hexdump(NULL, m->data, len);
+			rte_hexdump(NULL, rte_pktmbuf_mtod(m, void *), len);
 		dump_len -= len;
 		m = m->next;
 		nb_segs --;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 275f6b2..8fa781b 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -132,6 +132,13 @@ struct rte_mbuf {
 	void *buf_addr;           /**< Virtual address of segment buffer. */
 	uint64_t buf_physaddr:48; /**< Physical address of segment buffer. */
 	uint64_t buf_len:16;      /**< Length of segment buffer. */
+
+	/* valid for any segment */
+	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
+	uint16_t data_off;
+	uint16_t data_len;        /**< Amount of data in segment buffer. */
+	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
+
 #ifdef RTE_MBUF_REFCNT
 	/**
 	 * 16-bit Reference counter.
@@ -142,36 +149,30 @@ struct rte_mbuf {
 	 * config option.
 	 */
 	union {
-		rte_atomic16_t refcnt_atomic;   /**< Atomically accessed refcnt */
-		uint16_t refcnt;                /**< Non-atomically accessed refcnt */
+		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
+		uint16_t refcnt;  /**< Non-atomically accessed refcnt */
 	};
 #else
-	uint16_t refcnt_reserved;     /**< Do not use this field */
+	uint16_t refcnt_reserved; /**< Do not use this field */
 #endif
 
-	uint16_t ol_flags;            /**< Offload features. */
-	uint32_t reserved;             /**< Unused field. Required for padding. */
-
-	/* valid for any segment */
-	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
-	void* data;             /**< Start address of data in segment buffer. */
-	uint16_t data_len;      /**< Amount of data in segment buffer. */
-
 	/* these fields are valid for first segment only */
-	uint8_t nb_segs;        /**< Number of segments. */
-	uint8_t in_port;        /**< Input port. */
-	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len. */
+	uint8_t nb_segs;          /**< Number of segments. */
+	uint8_t in_port;          /**< Input port. */
+	uint16_t ol_flags;        /**< Offload features. */
+	uint16_t reserved;        /**< Unused field. Required for padding. */
 
 	/* offload features, valid for first segment only */
 	union rte_vlan_macip vlan_macip;
 	union {
-		uint32_t rss;       /**< RSS hash result if RSS enabled */
+		uint32_t rss;     /**< RSS hash result if RSS enabled */
 		struct {
 			uint16_t hash;
 			uint16_t id;
-		} fdir;             /**< Filter identifier if FDIR enabled */
-		uint32_t sched;     /**< Hierarchical scheduler */
-	} hash;                 /**< hash information */
+		} fdir;           /**< Filter identifier if FDIR enabled */
+		uint32_t sched;   /**< Hierarchical scheduler */
+	} hash;                   /**< hash information */
+	uint64_t reserved2;       /**< Unused field. Required for padding. */
 } __rte_cache_aligned;
 
 /**
@@ -436,8 +437,6 @@ void rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg);
  */
 static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 {
-	uint32_t buf_ofs;
-
 	m->next = NULL;
 	m->pkt_len = 0;
 	m->vlan_macip.data = 0;
@@ -445,9 +444,8 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 	m->in_port = 0xff;
 
 	m->ol_flags = 0;
-	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
+	m->data_off = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->data = (char*) m->buf_addr + buf_ofs;
 
 	m->data_len = 0;
 	__rte_mbuf_sanity_check(m, 1);
@@ -504,7 +502,7 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->buf_len = md->buf_len;
 
 	mi->next = md->next;
-	mi->data = md->data;
+	mi->data_off = md->data_off;
 	mi->data_len = md->data_len;
 	mi->in_port = md->in_port;
 	mi->vlan_macip = md->vlan_macip;
@@ -533,16 +531,14 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
 {
 	const struct rte_mempool *mp = m->pool;
 	void *buf = RTE_MBUF_TO_BADDR(m);
-	uint32_t buf_ofs;
 	uint32_t buf_len = mp->elt_size - sizeof(*m);
 	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + sizeof (*m);
 
 	m->buf_addr = buf;
 	m->buf_len = (uint16_t)buf_len;
 
-	buf_ofs = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
+	m->data_off = (RTE_PKTMBUF_HEADROOM <= m->buf_len) ?
 			RTE_PKTMBUF_HEADROOM : m->buf_len;
-	m->data = (char*) m->buf_addr + buf_ofs;
 
 	m->data_len = 0;
 }
@@ -706,7 +702,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
 static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
 {
 	__rte_mbuf_sanity_check(m, 1);
-	return (uint16_t) ((char*) m->data - (char*) m->buf_addr);
+	return m->data_off;
 }
 
 /**
@@ -754,7 +750,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)
  * @param t
  *   The type to cast the result into.
  */
-#define rte_pktmbuf_mtod(m, t) ((t)((m)->data))
+#define rte_pktmbuf_mtod(m, t) ((t)((char *)(m)->buf_addr + (m)->data_off))
 
 /**
  * A macro that returns the length of the packet.
@@ -799,11 +795,11 @@ static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
 	if (unlikely(len > rte_pktmbuf_headroom(m)))
 		return NULL;
 
-	m->data = (char*) m->data - len;
+	m->data_off -= len;
 	m->data_len = (uint16_t)(m->data_len + len);
 	m->pkt_len  = (m->pkt_len + len);
 
-	return (char*) m->data;
+	return (char*) m->buf_addr + m->data_off;
 }
 
 /**
@@ -832,7 +828,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
 	if (unlikely(len > rte_pktmbuf_tailroom(m_last)))
 		return NULL;
 
-	tail = (char*) m_last->data + m_last->data_len;
+	tail = (char*) m_last->buf_addr + m_last->data_off + m_last->data_len;
 	m_last->data_len = (uint16_t)(m_last->data_len + len);
 	m->pkt_len  = (m->pkt_len + len);
 	return (char*) tail;
@@ -860,9 +856,9 @@ static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)
 		return NULL;
 
 	m->data_len = (uint16_t)(m->data_len - len);
-	m->data = ((char*) m->data + len);
+	m->data_off += len;
 	m->pkt_len  = (m->pkt_len - len);
-	return (char*) m->data;
+	return (char *)m->buf_addr + m->data_off;
 }
 
 /**
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index b9e66eb..b050a6b 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -90,8 +90,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 }
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb)             \
-	(uint64_t) ((mb)->buf_physaddr +       \
-	(uint64_t) ((char *)((mb)->data) - (char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -771,8 +770,8 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.length) -
 				rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -941,7 +940,7 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1013,7 +1012,8 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index da33171..ab0ff01 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -84,9 +84,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 }
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr +		   \
-			(uint64_t) ((char *)((mb)->data) -     \
-				(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -742,8 +740,8 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
 				      rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -919,7 +917,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1001,7 +999,8 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->ol_flags = pkt_flags;
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr +
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 6cb1640..3d3316d 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -988,7 +988,7 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
 		mb->next = NULL;
-		mb->data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mb->data_off = RTE_PKTMBUF_HEADROOM;
 		mb->nb_segs = 1;
 		mb->in_port = rxq->port_id;
 
@@ -1239,8 +1239,8 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		pkt_len = (uint16_t) (rte_le_to_cpu_16(rxd.wb.upper.length) -
 				      rxq->crc_len);
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
-		rte_packet_prefetch(rxm->data);
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
+		rte_packet_prefetch((char *)rxm->buf_addr + rxm->data_off);
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len = pkt_len;
@@ -1423,7 +1423,7 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 */
 		data_len = rte_le_to_cpu_16(rxd.wb.upper.length);
 		rxm->data_len = data_len;
-		rxm->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 		/*
 		 * If this is the first buffer of the received packet,
@@ -1515,7 +1515,8 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		}
 
 		/* Prefetch data of first segment, if configured to do so. */
-		rte_packet_prefetch(first_seg->data);
+		rte_packet_prefetch((char *)first_seg->buf_addr + 
+			first_seg->data_off);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
@@ -3084,7 +3085,7 @@ ixgbe_alloc_rx_queue_mbufs(struct igb_rx_queue *rxq)
 
 		rte_mbuf_refcnt_set(mbuf, 1);
 		mbuf->next = NULL;
-		mbuf->data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
+		mbuf->data_off = RTE_PKTMBUF_HEADROOM;
 		mbuf->nb_segs = 1;
 		mbuf->in_port = rxq->port_id;
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index e32a417..66e741f 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -45,8 +45,7 @@
 #endif
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
diff --git a/lib/librte_pmd_virtio/virtio_rxtx.c b/lib/librte_pmd_virtio/virtio_rxtx.c
index 7deaeb6..ac1b7e0 100644
--- a/lib/librte_pmd_virtio/virtio_rxtx.c
+++ b/lib/librte_pmd_virtio/virtio_rxtx.c
@@ -268,7 +268,7 @@ virtio_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 			continue;
 		}
 		rxm->in_port = rxvq->port_id;
-		rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+		rxm->data_off = RTE_PKTMBUF_HEADROOM;
 		rxm->nb_segs = 1;
 		rxm->next = NULL;
 		rxm->pkt_len  = (uint32_t)(len[i] - sizeof(struct virtio_net_hdr));
diff --git a/lib/librte_pmd_virtio/virtqueue.h b/lib/librte_pmd_virtio/virtqueue.h
index 210944e..11f908c 100644
--- a/lib/librte_pmd_virtio/virtqueue.h
+++ b/lib/librte_pmd_virtio/virtqueue.h
@@ -59,8 +59,7 @@
 #define VIRTQUEUE_MAX_NAME_SZ 32
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define VTNET_SQ_RQ_QUEUE_IDX 0
 #define VTNET_SQ_TQ_QUEUE_IDX 1
@@ -363,7 +362,7 @@ virtqueue_dequeue_burst_rx(struct virtqueue *vq, struct rte_mbuf **rx_pkts, uint
 			break;
 		}
 		rte_prefetch0(cookie);
-		rte_packet_prefetch(cookie->data);
+		rte_packet_prefetch((char *)cookie->buf_addr + cookie->data_off);
 		rx_pkts[i]  = cookie;
 		vq->vq_used_cons_idx++;
 		vq_ring_free_chain(vq, desc_idx);
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index 60f26ba..b5450b2 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -80,8 +80,7 @@
 
 
 #define RTE_MBUF_DATA_DMA_ADDR(mb) \
-	(uint64_t) ((mb)->buf_physaddr + (uint64_t)((char *)((mb)->data) - \
-	(char *)(mb)->buf_addr))
+	(uint64_t) ((mb)->buf_physaddr + (mb)->data_off)
 
 #define RTE_MBUF_DATA_DMA_ADDR_DEFAULT(mb) \
 	(uint64_t) ((mb)->buf_physaddr + RTE_PKTMBUF_HEADROOM)
@@ -551,7 +550,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 			rxm->data_len = (uint16_t)rcd->len;
 			rxm->in_port = rxq->port_id;
 			rxm->vlan_macip.f.vlan_tci = 0;
-			rxm->data = (char *)rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
+			rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 			rx_pkts[nb_rx++] = rxm;
 
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 07/11] mbuf: add functions to get the name of an ol_flag
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (5 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 08/11] mbuf: change ol_flags to 32 bits Olivier Matz
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

In test-pmd (rxonly.c), the code is able to dump the list of ol_flags.
The issue is that the list of flags in the application has to be
synchronized with the flags defined in rte_mbuf.h.

This patch introduces 2 new functions rte_get_rx_ol_flag_name()
and rte_get_tx_ol_flag_name() that returns the name of a flag from
its mask. It also fixes rxonly.c to use this new functions and to
display the proper flags.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/rxonly.c      | 33 ++++++++++-----------------------
 lib/librte_mbuf/rte_mbuf.h | 46 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 54 insertions(+), 25 deletions(-)

diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 5751b0b..94f71c7 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -69,23 +69,6 @@
 
 #include "testpmd.h"
 
-#define MAX_PKT_RX_FLAGS 11
-static const char *pkt_rx_flag_names[MAX_PKT_RX_FLAGS] = {
-	"VLAN_PKT",
-	"RSS_HASH",
-	"PKT_RX_FDIR",
-	"IP_CKSUM",
-	"IP_CKSUM_BAD",
-
-	"IPV4_HDR",
-	"IPV4_HDR_EXT",
-	"IPV6_HDR",
-	"IPV6_HDR_EXT",
-
-	"IEEE1588_PTP",
-	"IEEE1588_TMST",
-};
-
 static inline void
 print_ether_addr(const char *what, struct ether_addr *eth_addr)
 {
@@ -169,12 +152,16 @@ pkt_burst_receive(struct fwd_stream *fs)
 				mb->vlan_macip.f.vlan_tci);
 		printf("\n");
 		if (ol_flags != 0) {
-			int rxf;
-
-			for (rxf = 0; rxf < MAX_PKT_RX_FLAGS; rxf++) {
-				if (ol_flags & (1 << rxf))
-					printf("  PKT_RX_%s\n",
-					       pkt_rx_flag_names[rxf]);
+			uint16_t rxf;
+			const char *name;
+
+			for (rxf = 0; rxf < sizeof(mb->ol_flags) * 8; rxf++) {
+				if ((ol_flags & (1 << rxf)) == 0)
+					continue;
+				name = rte_get_rx_ol_flag_name(1 << rxf);
+				if (name == NULL)
+					continue;
+				printf("  %s\n", name);
 			}
 		}
 		rte_pktmbuf_free(mb);
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 8fa781b..55a993a 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -99,9 +99,51 @@ extern "C" {
 #define PKT_TX_IEEE1588_TMST 0x8000 /**< TX IEEE1588 packet to timestamp. */
 
 /**
- * Bit Mask to indicate what bits required for building TX context
+ * Get the name of a RX offload flag
+ *
+ * @param mask
+ *   The mask describing the flag (only one bit must be set)
+ * @return
+ *   The name of this flag, or NULL if it's not a valid RX flag.
  */
-#define PKT_TX_OFFLOAD_MASK (PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK)
+static inline const char *rte_get_rx_ol_flag_name(uint16_t mask)
+{
+	switch (mask) {
+	case PKT_RX_VLAN_PKT: return "PKT_RX_VLAN_PKT";
+	case PKT_RX_RSS_HASH: return "PKT_RX_RSS_HASH";
+	case PKT_RX_FDIR: return "PKT_RX_FDIR";
+	case PKT_RX_L4_CKSUM_BAD: return "PKT_RX_L4_CKSUM_BAD";
+	case PKT_RX_IP_CKSUM_BAD: return "PKT_RX_IP_CKSUM_BAD";
+	case PKT_RX_IPV4_HDR: return "PKT_RX_IPV4_HDR";
+	case PKT_RX_IPV4_HDR_EXT: return "PKT_RX_IPV4_HDR_EXT";
+	case PKT_RX_IPV6_HDR: return "PKT_RX_IPV6_HDR";
+	case PKT_RX_IPV6_HDR_EXT: return "PKT_RX_IPV6_HDR_EXT";
+	case PKT_RX_IEEE1588_PTP: return "PKT_RX_IEEE1588_PTP";
+	case PKT_RX_IEEE1588_TMST: return "PKT_RX_IEEE1588_TMST";
+	default: return NULL;
+	}
+}
+
+/**
+ * Get the name of a TX offload flag
+ *
+ * @param mask
+ *   The mask describing the flag (only one bit must be set)
+ * @return
+ *   The name of this flag, or NULL if it's not a valid TX flag.
+ */
+static inline const char *rte_get_tx_ol_flag_name(uint16_t mask)
+{
+	switch (mask) {
+	case PKT_TX_VLAN_PKT: return "PKT_TX_VLAN_PKT";
+	case PKT_TX_IP_CKSUM: return "PKT_TX_IP_CKSUM";
+	case PKT_TX_TCP_CKSUM: return "PKT_TX_TCP_CKSUM";
+	case PKT_TX_SCTP_CKSUM: return "PKT_TX_SCTP_CKSUM";
+	case PKT_TX_UDP_CKSUM: return "PKT_TX_UDP_CKSUM";
+	case PKT_TX_IEEE1588_TMST: return "PKT_TX_IEEE1588_TMST";
+	default: return NULL;
+	}
+}
 
 /** Offload features */
 union rte_vlan_macip {
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 08/11] mbuf: change ol_flags to 32 bits
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (6 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 07/11] mbuf: add functions to get the name of an ol_flag Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 09/11] mbuf: rename vlan_macip_len in hw_offload and increase its size Olivier Matz
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

There is no room to add other offload flags in the current 16 bits
fields.  Since we have more room in the mbuf structure, we can change
the ol_flags to 32 bits.

A next commit will add the support of TSO (TCP Segmentation Offload)
which require a new ol_flags, justifying this commit.

Thanks to this modification, another possible improvement (which is not
part of this series) could be to change the checksum flags from:
  PKT_RX_L4_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD
to:
  PKT_RX_L4_CKSUM, PKT_RX_IP_CKSUM, PKT_RX_L4_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD
in order to detect if the checksum has been processed by hw or not.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/cmdline.c                             | 13 +++-
 app/test-pmd/config.c                              | 10 +--
 app/test-pmd/csumonly.c                            | 26 ++++----
 app/test-pmd/rxonly.c                              |  4 +-
 app/test-pmd/testpmd.h                             | 11 +---
 app/test-pmd/txonly.c                              |  2 +-
 .../bsdapp/eal/include/exec-env/rte_kni_common.h   |  2 +-
 .../linuxapp/eal/include/exec-env/rte_kni_common.h |  2 +-
 lib/librte_mbuf/rte_mbuf.c                         |  2 +-
 lib/librte_mbuf/rte_mbuf.h                         | 52 +++++++--------
 lib/librte_pmd_e1000/em_rxtx.c                     | 35 +++++-----
 lib/librte_pmd_e1000/igb_rxtx.c                    | 71 ++++++++++----------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c                  | 77 +++++++++++-----------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h                  |  2 +-
 14 files changed, 157 insertions(+), 152 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index c507c46..a95b279 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -2264,8 +2264,17 @@ cmd_tx_cksum_set_parsed(void *parsed_result,
 		       __attribute__((unused)) void *data)
 {
 	struct cmd_tx_cksum_set_result *res = parsed_result;
-
-	tx_cksum_set(res->port_id, res->cksum_mask);
+	uint32_t ol_flags = 0;
+
+	if (res->cksum_mask & 0x1)
+		ol_flags |= PKT_TX_IP_CKSUM;
+	if (res->cksum_mask & 0x2)
+		ol_flags |= PKT_TX_TCP_CKSUM;
+	if (res->cksum_mask & 0x4)
+		ol_flags |= PKT_TX_UDP_CKSUM;
+	if (res->cksum_mask & 0x8)
+		ol_flags |= PKT_TX_SCTP_CKSUM;
+	tx_cksum_set(res->port_id, ol_flags);
 }
 
 cmdline_parse_token_string_t cmd_tx_cksum_set_tx_cksum =
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 1feb133..cd82f60 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1442,14 +1442,16 @@ set_qmap(portid_t port_id, uint8_t is_rx, uint16_t queue_id, uint8_t map_value)
 }
 
 void
-tx_cksum_set(portid_t port_id, uint8_t cksum_mask)
+tx_cksum_set(portid_t port_id, uint32_t ol_flags)
 {
-	uint16_t tx_ol_flags;
+	uint32_t cksum_mask = PKT_TX_IP_CKSUM | PKT_TX_L4_MASK;
+
 	if (port_id_is_invalid(port_id))
 		return;
+
 	/* Clear last 4 bits and then set L3/4 checksum mask again */
-	tx_ol_flags = (uint16_t) (ports[port_id].tx_ol_flags & 0xFFF0);
-	ports[port_id].tx_ol_flags = (uint16_t) ((cksum_mask & 0xf) | tx_ol_flags);
+	ports[port_id].tx_ol_flags &= ~cksum_mask;
+	ports[port_id].tx_ol_flags |= (ol_flags & cksum_mask);
 }
 
 void
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 3313b87..69b90a7 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -217,9 +217,9 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint16_t nb_rx;
 	uint16_t nb_tx;
 	uint16_t i;
-	uint16_t ol_flags;
-	uint16_t pkt_ol_flags;
-	uint16_t tx_ol_flags;
+	uint32_t ol_flags;
+	uint32_t pkt_ol_flags;
+	uint32_t tx_ol_flags;
 	uint16_t l4_proto;
 	uint16_t eth_type;
 	uint8_t  l2_len;
@@ -261,7 +261,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		mb = pkts_burst[i];
 		l2_len  = sizeof(struct ether_hdr);
 		pkt_ol_flags = mb->ol_flags;
-		ol_flags = (uint16_t) (pkt_ol_flags & (~PKT_TX_L4_MASK));
+		ol_flags = (pkt_ol_flags & (~PKT_TX_L4_MASK));
 
 		eth_hdr = rte_pktmbuf_mtod(mb, struct ether_hdr *);
 		eth_type = rte_be_to_cpu_16(eth_hdr->ether_type);
@@ -274,8 +274,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		}
 
 		/* Update the L3/L4 checksum error packet count  */
-		rx_bad_ip_csum += (uint16_t) ((pkt_ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
-		rx_bad_l4_csum += (uint16_t) ((pkt_ol_flags & PKT_RX_L4_CKSUM_BAD) != 0);
+		rx_bad_ip_csum += ((pkt_ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
+		rx_bad_l4_csum += ((pkt_ol_flags & PKT_RX_L4_CKSUM_BAD) != 0);
 
 		/*
 		 * Try to figure out L3 packet type by SW.
@@ -308,7 +308,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			/* Do not delete, this is required by HW*/
 			ipv4_hdr->hdr_checksum = 0;
 
-			if (tx_ol_flags & 0x1) {
+			if (tx_ol_flags & PKT_TX_IP_CKSUM) {
 				/* HW checksum */
 				ol_flags |= PKT_TX_IP_CKSUM;
 			}
@@ -321,7 +321,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			if (l4_proto == IPPROTO_UDP) {
 				udp_hdr = (struct udp_hdr*) (rte_pktmbuf_mtod(mb,
 						unsigned char *) + l2_len + l3_len);
-				if (tx_ol_flags & 0x2) {
+				if (tx_ol_flags & PKT_TX_UDP_CKSUM) {
 					/* HW Offload */
 					ol_flags |= PKT_TX_UDP_CKSUM;
 					/* Pseudo header sum need be set properly */
@@ -337,7 +337,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			else if (l4_proto == IPPROTO_TCP){
 				tcp_hdr = (struct tcp_hdr*) (rte_pktmbuf_mtod(mb,
 						unsigned char *) + l2_len + l3_len);
-				if (tx_ol_flags & 0x4) {
+				if (tx_ol_flags & PKT_TX_TCP_CKSUM) {
 					ol_flags |= PKT_TX_TCP_CKSUM;
 					tcp_hdr->cksum = get_ipv4_psd_sum(ipv4_hdr);
 				}
@@ -351,7 +351,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 				sctp_hdr = (struct sctp_hdr*) (rte_pktmbuf_mtod(mb,
 						unsigned char *) + l2_len + l3_len);
 
-				if (tx_ol_flags & 0x8) {
+				if (tx_ol_flags & PKT_TX_SCTP_CKSUM) {
 					ol_flags |= PKT_TX_SCTP_CKSUM;
 					sctp_hdr->cksum = 0;
 
@@ -377,7 +377,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			if (l4_proto == IPPROTO_UDP) {
 				udp_hdr = (struct udp_hdr*) (rte_pktmbuf_mtod(mb,
 						unsigned char *) + l2_len + l3_len);
-				if (tx_ol_flags & 0x2) {
+				if (tx_ol_flags & PKT_TX_UDP_CKSUM) {
 					/* HW Offload */
 					ol_flags |= PKT_TX_UDP_CKSUM;
 					udp_hdr->dgram_cksum = get_ipv6_psd_sum(ipv6_hdr);
@@ -393,7 +393,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 			else if (l4_proto == IPPROTO_TCP) {
 				tcp_hdr = (struct tcp_hdr*) (rte_pktmbuf_mtod(mb,
 						unsigned char *) + l2_len + l3_len);
-				if (tx_ol_flags & 0x4) {
+				if (tx_ol_flags & PKT_TX_TCP_CKSUM) {
 					ol_flags |= PKT_TX_TCP_CKSUM;
 					tcp_hdr->cksum = get_ipv6_psd_sum(ipv6_hdr);
 				}
@@ -407,7 +407,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 				sctp_hdr = (struct sctp_hdr*) (rte_pktmbuf_mtod(mb,
 						unsigned char *) + l2_len + l3_len);
 
-				if (tx_ol_flags & 0x8) {
+				if (tx_ol_flags & PKT_TX_SCTP_CKSUM) {
 					ol_flags |= PKT_TX_SCTP_CKSUM;
 					sctp_hdr->cksum = 0;
 					/* Sanity check, only number of 4 bytes supported by HW */
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 94f71c7..0bf4440 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -92,7 +92,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 	struct rte_mbuf  *mb;
 	struct ether_hdr *eth_hdr;
 	uint16_t eth_type;
-	uint16_t ol_flags;
+	uint32_t ol_flags;
 	uint16_t nb_rx;
 	uint16_t i;
 #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
@@ -152,7 +152,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 				mb->vlan_macip.f.vlan_tci);
 		printf("\n");
 		if (ol_flags != 0) {
-			uint16_t rxf;
+			uint32_t rxf;
 			const char *name;
 
 			for (rxf = 0; rxf < sizeof(mb->ol_flags) * 8; rxf++) {
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index bb10d3b..77dcc30 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -122,12 +122,7 @@ struct fwd_stream {
 
 /**
  * The data structure associated with each port.
- * tx_ol_flags is slightly different from ol_flags of rte_mbuf.
- *   Bit  0: Insert IP checksum
- *   Bit  1: Insert UDP checksum
- *   Bit  2: Insert TCP checksum
- *   Bit  3: Insert SCTP checksum
- *   Bit 11: Insert VLAN Label
+ * tx_ol_flags use the same flags than ol_flags of rte_mbuf.
  */
 struct rte_port {
 	struct rte_eth_dev_info dev_info;   /**< PCI info + driver name */
@@ -138,7 +133,7 @@ struct rte_port {
 	struct fwd_stream       *rx_stream; /**< Port RX stream, if unique */
 	struct fwd_stream       *tx_stream; /**< Port TX stream, if unique */
 	unsigned int            socket_id;  /**< For NUMA support */
-	uint16_t                tx_ol_flags;/**< Offload Flags of TX packets. */
+	uint32_t                tx_ol_flags;/**< Offload Flags of TX packets. */
 	uint16_t                tx_vlan_id; /**< Tag Id. in TX VLAN packets. */
 	void                    *fwd_ctx;   /**< Forwarding mode context */
 	uint64_t                rx_bad_ip_csum; /**< rx pkts with bad ip checksum  */
@@ -484,7 +479,7 @@ void tx_vlan_reset(portid_t port_id);
 
 void set_qmap(portid_t port_id, uint8_t is_rx, uint16_t queue_id, uint8_t map_value);
 
-void tx_cksum_set(portid_t port_id, uint8_t cksum_mask);
+void tx_cksum_set(portid_t port_id, uint32_t ol_flags);
 
 void set_verbose_level(uint16_t vb_level);
 void set_tx_pkt_segments(unsigned *seg_lengths, unsigned nb_segs);
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index c28f3dd..5d93209 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -203,7 +203,7 @@ pkt_burst_transmit(struct fwd_stream *fs)
 	uint16_t nb_tx;
 	uint16_t nb_pkt;
 	uint16_t vlan_tci;
-	uint16_t ol_flags;
+	uint32_t ol_flags;
 	uint8_t  i;
 #ifdef RTE_TEST_PMD_RECORD_CORE_CYCLES
 	uint64_t start_tsc;
diff --git a/lib/librte_eal/bsdapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/bsdapp/eal/include/exec-env/rte_kni_common.h
index ad73feb..267dbc0 100755
--- a/lib/librte_eal/bsdapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/bsdapp/eal/include/exec-env/rte_kni_common.h
@@ -111,7 +111,7 @@ struct rte_kni_mbuf {
 	void *pool;
 	void *buf_addr;
 	char pad0[14];
-	uint16_t ol_flags;      /**< Offload features. */
+	uint32_t ol_flags;      /**< Offload features. */
 	void *next;
 	void *data;             /**< Start address of data in segment buffer. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
diff --git a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
index ad73feb..267dbc0 100755
--- a/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
+++ b/lib/librte_eal/linuxapp/eal/include/exec-env/rte_kni_common.h
@@ -111,7 +111,7 @@ struct rte_kni_mbuf {
 	void *pool;
 	void *buf_addr;
 	char pad0[14];
-	uint16_t ol_flags;      /**< Offload features. */
+	uint32_t ol_flags;      /**< Offload features. */
 	void *next;
 	void *data;             /**< Start address of data in segment buffer. */
 	uint16_t data_len;      /**< Amount of data in segment buffer. */
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c
index 43e6d32..0128eec 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -161,7 +161,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 
 	printf("dump mbuf at 0x%p, phys=%"PRIx64", buf_len=%u\n",
 	       m, (uint64_t)m->buf_physaddr, (unsigned)m->buf_len);
-	printf("  pkt_len=%"PRIu32", ol_flags=%"PRIx16", nb_segs=%u, "
+	printf("  pkt_len=%"PRIu32", ol_flags=%"PRIx32", nb_segs=%u, "
 	       "in_port=%u\n", m->pkt_len, m->ol_flags,
 	       (unsigned)m->nb_segs, (unsigned)m->in_port);
 	nb_segs = m->nb_segs;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 55a993a..1cd51c2 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -69,34 +69,33 @@ extern "C" {
  * Packet Offload Features Flags. It also carry packet type information.
  * Critical resources. Both rx/tx shared these bits. Be cautious on any change
  */
-#define PKT_RX_VLAN_PKT      0x0001 /**< RX packet is a 802.1q VLAN packet. */
-#define PKT_RX_RSS_HASH      0x0002 /**< RX packet with RSS hash result. */
-#define PKT_RX_FDIR          0x0004 /**< RX packet with FDIR infos. */
-#define PKT_RX_L4_CKSUM_BAD  0x0008 /**< L4 cksum of RX pkt. is not OK. */
-#define PKT_RX_IP_CKSUM_BAD  0x0010 /**< IP cksum of RX pkt. is not OK. */
-#define PKT_RX_IPV4_HDR      0x0020 /**< RX packet with IPv4 header. */
-#define PKT_RX_IPV4_HDR_EXT  0x0040 /**< RX packet with extended IPv4 header. */
-#define PKT_RX_IPV6_HDR      0x0080 /**< RX packet with IPv6 header. */
-#define PKT_RX_IPV6_HDR_EXT  0x0100 /**< RX packet with extended IPv6 header. */
-#define PKT_RX_IEEE1588_PTP  0x0200 /**< RX IEEE1588 L2 Ethernet PT Packet. */
-#define PKT_RX_IEEE1588_TMST 0x0400 /**< RX IEEE1588 L2/L4 timestamped packet.*/
-
-#define PKT_TX_VLAN_PKT      0x0800 /**< TX packet is a 802.1q VLAN packet. */
-#define PKT_TX_IP_CKSUM      0x1000 /**< IP cksum of TX pkt. computed by NIC. */
+#define PKT_RX_VLAN_PKT      0x00000001 /**< RX packet is a 802.1q VLAN packet. */
+#define PKT_RX_RSS_HASH      0x00000002 /**< RX packet with RSS hash result. */
+#define PKT_RX_FDIR          0x00000004 /**< RX packet with FDIR infos. */
+#define PKT_RX_L4_CKSUM_BAD  0x00000008 /**< L4 cksum of RX pkt. is not OK. */
+#define PKT_RX_IP_CKSUM_BAD  0x00000010 /**< IP cksum of RX pkt. is not OK. */
+#define PKT_RX_IPV4_HDR      0x00000020 /**< RX packet with IPv4 header. */
+#define PKT_RX_IPV4_HDR_EXT  0x00000040 /**< RX packet with extended IPv4 header. */
+#define PKT_RX_IPV6_HDR      0x00000080 /**< RX packet with IPv6 header. */
+#define PKT_RX_IPV6_HDR_EXT  0x00000100 /**< RX packet with extended IPv6 header. */
+#define PKT_RX_IEEE1588_PTP  0x00000200 /**< RX IEEE1588 L2 Ethernet PT Packet. */
+#define PKT_RX_IEEE1588_TMST 0x00000400 /**< RX IEEE1588 L2/L4 timestamped packet.*/
+
+#define PKT_TX_VLAN_PKT      0x00010000 /**< TX packet is a 802.1q VLAN packet. */
+#define PKT_TX_IP_CKSUM      0x00020000 /**< IP cksum of TX pkt. computed by NIC. */
 /*
- * Bit 14~13 used for L4 packet type with checksum enabled.
+ * Bits used for L4 packet type with checksum enabled.
  *     00: Reserved
  *     01: TCP checksum
  *     10: SCTP checksum
  *     11: UDP checksum
  */
-#define PKT_TX_L4_MASK       0x6000 /**< Mask bits for L4 checksum offload request. */
-#define PKT_TX_L4_NO_CKSUM   0x0000 /**< Disable L4 cksum of TX pkt. */
-#define PKT_TX_TCP_CKSUM     0x2000 /**< TCP cksum of TX pkt. computed by NIC. */
-#define PKT_TX_SCTP_CKSUM    0x4000 /**< SCTP cksum of TX pkt. computed by NIC. */
-#define PKT_TX_UDP_CKSUM     0x6000 /**< UDP cksum of TX pkt. computed by NIC. */
-/* Bit 15 */
-#define PKT_TX_IEEE1588_TMST 0x8000 /**< TX IEEE1588 packet to timestamp. */
+#define PKT_TX_L4_MASK       0x000C0000 /**< Mask bits for L4 checksum offload request. */
+#define PKT_TX_L4_NO_CKSUM   0x00000000 /**< Disable L4 cksum of TX pkt. */
+#define PKT_TX_TCP_CKSUM     0x00040000 /**< TCP cksum of TX pkt. computed by NIC. */
+#define PKT_TX_SCTP_CKSUM    0x00080000 /**< SCTP cksum of TX pkt. computed by NIC. */
+#define PKT_TX_UDP_CKSUM     0x000C0000 /**< UDP cksum of TX pkt. computed by NIC. */
+#define PKT_TX_IEEE1588_TMST 0x00100000 /**< TX IEEE1588 packet to timestamp. */
 
 /**
  * Get the name of a RX offload flag
@@ -106,7 +105,7 @@ extern "C" {
  * @return
  *   The name of this flag, or NULL if it's not a valid RX flag.
  */
-static inline const char *rte_get_rx_ol_flag_name(uint16_t mask)
+static inline const char *rte_get_rx_ol_flag_name(uint32_t mask)
 {
 	switch (mask) {
 	case PKT_RX_VLAN_PKT: return "PKT_RX_VLAN_PKT";
@@ -132,7 +131,7 @@ static inline const char *rte_get_rx_ol_flag_name(uint16_t mask)
  * @return
  *   The name of this flag, or NULL if it's not a valid TX flag.
  */
-static inline const char *rte_get_tx_ol_flag_name(uint16_t mask)
+static inline const char *rte_get_tx_ol_flag_name(uint32_t mask)
 {
 	switch (mask) {
 	case PKT_TX_VLAN_PKT: return "PKT_TX_VLAN_PKT";
@@ -201,8 +200,7 @@ struct rte_mbuf {
 	/* these fields are valid for first segment only */
 	uint8_t nb_segs;          /**< Number of segments. */
 	uint8_t in_port;          /**< Input port. */
-	uint16_t ol_flags;        /**< Offload features. */
-	uint16_t reserved;        /**< Unused field. Required for padding. */
+	uint32_t ol_flags;        /**< Offload features. */
 
 	/* offload features, valid for first segment only */
 	union rte_vlan_macip vlan_macip;
@@ -214,7 +212,7 @@ struct rte_mbuf {
 		} fdir;           /**< Filter identifier if FDIR enabled */
 		uint32_t sched;   /**< Hierarchical scheduler */
 	} hash;                   /**< hash information */
-	uint64_t reserved2;       /**< Unused field. Required for padding. */
+	uint64_t reserved;        /**< Unused field. Required for padding. */
 } __rte_cache_aligned;
 
 /**
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index b050a6b..015c0af 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -147,7 +147,7 @@ enum {
  * Structure to check if new context need be built
  */
 struct em_ctx_info {
-	uint16_t flags;               /**< ol_flags related to context build. */
+	uint32_t flags;               /**< ol_flags related to context build. */
 	uint32_t cmp_mask;            /**< compare mask */
 	union rte_vlan_macip hdrlen;  /**< L2 and L3 header lenghts */
 };
@@ -217,7 +217,7 @@ struct em_tx_queue {
 static inline void
 em_set_xmit_ctx(struct em_tx_queue* txq,
 		volatile struct e1000_context_desc *ctx_txd,
-		uint16_t flags,
+		uint32_t flags,
 		union rte_vlan_macip hdrlen)
 {
 	uint32_t cmp_mask, cmd_len;
@@ -283,7 +283,7 @@ em_set_xmit_ctx(struct em_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_ctx_update(struct em_tx_queue *txq, uint16_t flags,
+what_ctx_update(struct em_tx_queue *txq, uint32_t flags,
 		union rte_vlan_macip hdrlen)
 {
 	/* If match with the current context */
@@ -356,7 +356,7 @@ em_xmit_cleanup(struct em_tx_queue *txq)
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_upper(uint16_t ol_flags)
+tx_desc_cksum_flags_to_upper(uint32_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, E1000_TXD_POPTS_TXSM << 8};
 	static const uint32_t l3_olinfo[2] = {0, E1000_TXD_POPTS_IXSM << 8};
@@ -382,12 +382,12 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t popts_spec;
 	uint32_t cmd_type_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint32_t ol_flags;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
 	uint16_t nb_used;
-	uint16_t tx_ol_req;
+	uint32_t tx_ol_req;
 	uint32_t ctx;
 	uint32_t new_ctx;
 	union rte_vlan_macip hdrlen;
@@ -417,8 +417,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		ol_flags = tx_pkt->ol_flags;
 
 		/* If hardware offload required */
-		tx_ol_req = (uint16_t)(ol_flags & (PKT_TX_IP_CKSUM |
-							PKT_TX_L4_MASK));
+		tx_ol_req = (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_L4_MASK));
 		if (tx_ol_req) {
 			hdrlen = tx_pkt->vlan_macip;
 			/* If new context to be built or reuse the exist ctx. */
@@ -620,22 +619,22 @@ end_of_tx:
  *
  **********************************************************************/
 
-static inline uint16_t
+static inline uint32_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
 	/* Check if VLAN present */
-	pkt_flags = (uint16_t)((rx_status & E1000_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = ((rx_status & E1000_RXD_STAT_VP) ?
+		PKT_RX_VLAN_PKT : 0);
 
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint32_t
 rx_desc_error_to_pkt_flags(uint32_t rx_error)
 {
-	uint16_t pkt_flags = 0;
+	uint32_t pkt_flags = 0;
 
 	if (rx_error & E1000_RXD_ERR_IPE)
 		pkt_flags |= PKT_RX_IP_CKSUM_BAD;
@@ -779,8 +778,8 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->in_port = rxq->port_id;
 
 		rxm->ol_flags = rx_desc_status_to_pkt_flags(status);
-		rxm->ol_flags = (uint16_t)(rxm->ol_flags |
-				rx_desc_error_to_pkt_flags(rxd.errors));
+		rxm->ol_flags = rxm->ol_flags |
+			rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
 		rxm->vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
@@ -1005,8 +1004,8 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		first_seg->in_port = rxq->port_id;
 
 		first_seg->ol_flags = rx_desc_status_to_pkt_flags(status);
-		first_seg->ol_flags = (uint16_t)(first_seg->ol_flags |
-					rx_desc_error_to_pkt_flags(rxd.errors));
+		first_seg->ol_flags = first_seg->ol_flags |
+			rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
 		rxm->vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index ab0ff01..322dfa0 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -144,7 +144,7 @@ enum igb_advctx_num {
  * Strucutre to check if new context need be built
  */
 struct igb_advctx_info {
-	uint16_t flags;           /**< ol_flags related to context build. */
+	uint32_t flags;           /**< ol_flags related to context build. */
 	uint32_t cmp_mask;        /**< compare mask for vlan_macip_lens */
 	union rte_vlan_macip vlan_macip_lens; /**< vlan, mac & ip length. */
 };
@@ -212,7 +212,7 @@ struct igb_tx_queue {
 static inline void
 igbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct e1000_adv_tx_context_desc *ctx_txd,
-		uint16_t ol_flags, uint32_t vlan_macip_lens)
+		uint32_t ol_flags, uint32_t vlan_macip_lens)
 {
 	uint32_t type_tucmd_mlhl;
 	uint32_t mss_l4len_idx;
@@ -277,7 +277,7 @@ igbe_set_xmit_ctx(struct igb_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
+what_advctx_update(struct igb_tx_queue *txq, uint32_t flags,
 		uint32_t vlan_macip_lens)
 {
 	/* If match with the current context */
@@ -300,7 +300,7 @@ what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
+tx_desc_cksum_flags_to_olinfo(uint32_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, E1000_ADVTXD_POPTS_TXSM};
 	static const uint32_t l3_olinfo[2] = {0, E1000_ADVTXD_POPTS_IXSM};
@@ -312,7 +312,7 @@ tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
 }
 
 static inline uint32_t
-tx_desc_vlan_flags_to_cmdtype(uint16_t ol_flags)
+tx_desc_vlan_flags_to_cmdtype(uint32_t ol_flags)
 {
 	static uint32_t vlan_cmd[2] = {0, E1000_ADVTXD_DCMD_VLE};
 	return vlan_cmd[(ol_flags & PKT_TX_VLAN_PKT) != 0];
@@ -334,12 +334,12 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t cmd_type_len;
 	uint32_t pkt_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint32_t ol_flags;
 	uint16_t tx_end;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
-	uint16_t tx_ol_req;
+	uint32_t tx_ol_req;
 	uint32_t new_ctx = 0;
 	uint32_t ctx = 0;
 	uint32_t vlan_macip_lens;
@@ -368,7 +368,8 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 
 		ol_flags = tx_pkt->ol_flags;
 		vlan_macip_lens = tx_pkt->vlan_macip.data;
-		tx_ol_req = (uint16_t)(ol_flags & PKT_TX_OFFLOAD_MASK);
+		tx_ol_req = ol_flags &
+			(PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK);
 
 		/* If a Context Descriptor need be built . */
 		if (tx_ol_req) {
@@ -555,12 +556,12 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
  *  RX functions
  *
  **********************************************************************/
-static inline uint16_t
+static inline uint32_t
 rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 {
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
-	static uint16_t ip_pkt_types_map[16] = {
+	static uint32_t ip_pkt_types_map[16] = {
 		0, PKT_RX_IPV4_HDR, PKT_RX_IPV4_HDR_EXT, PKT_RX_IPV4_HDR_EXT,
 		PKT_RX_IPV6_HDR, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
@@ -573,34 +574,34 @@ rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 		0, 0, 0, 0,
 	};
 
-	pkt_flags = (uint16_t)((hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ?
-				ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = ((hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ?
+		ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
+		ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
 #else
-	pkt_flags = (uint16_t)((hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ? 0 :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = ((hl_tp_rs & E1000_RXDADV_PKTTYPE_ETQF) ? 0 :
+		ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
 #endif
-	return (uint16_t)(pkt_flags | (((hl_tp_rs & 0x0F) == 0) ?
-						0 : PKT_RX_RSS_HASH));
+	return (pkt_flags | (((hl_tp_rs & 0x0F) == 0) ?
+			0 : PKT_RX_RSS_HASH));
 }
 
-static inline uint16_t
+static inline uint32_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
 	/* Check if VLAN present */
-	pkt_flags = (uint16_t)((rx_status & E1000_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = ((rx_status & E1000_RXD_STAT_VP) ?
+		PKT_RX_VLAN_PKT : 0);
 
 #if defined(RTE_LIBRTE_IEEE1588)
 	if (rx_status & E1000_RXD_STAT_TMST)
-		pkt_flags = (uint16_t)(pkt_flags | PKT_RX_IEEE1588_TMST);
+		pkt_flags = (pkt_flags | PKT_RX_IEEE1588_TMST);
 #endif
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint32_t
 rx_desc_error_to_pkt_flags(uint32_t rx_status)
 {
 	/*
@@ -608,7 +609,7 @@ rx_desc_error_to_pkt_flags(uint32_t rx_status)
 	 * Bit 29: L4I, L4I integrity error
 	 */
 
-	static uint16_t error_to_pkt_flags_map[4] = {
+	static uint32_t error_to_pkt_flags_map[4] = {
 		0,  PKT_RX_L4_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD,
 		PKT_RX_IP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD
 	};
@@ -635,7 +636,7 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t rx_id;
 	uint16_t nb_rx;
 	uint16_t nb_hold;
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -755,10 +756,10 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags |
+			rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags |
+			rx_desc_error_to_pkt_flags(staterr);
 		rxm->ol_flags = pkt_flags;
 
 		/*
@@ -815,7 +816,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t nb_rx;
 	uint16_t nb_hold;
 	uint16_t data_len;
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -992,10 +993,10 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags |
+			rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags |
+			rx_desc_error_to_pkt_flags(staterr);
 		first_seg->ol_flags = pkt_flags;
 
 		/* Prefetch data of first segment, if configured to do so. */
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 3d3316d..7096ea6 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -350,7 +350,7 @@ ixgbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 static inline void
 ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct ixgbe_adv_tx_context_desc *ctx_txd,
-		uint16_t ol_flags, uint32_t vlan_macip_lens)
+		uint32_t ol_flags, uint32_t vlan_macip_lens)
 {
 	uint32_t type_tucmd_mlhl;
 	uint32_t mss_l4len_idx;
@@ -413,7 +413,7 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
  * or create a new context descriptor.
  */
 static inline uint32_t
-what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
+what_advctx_update(struct igb_tx_queue *txq, uint32_t flags,
 		uint32_t vlan_macip_lens)
 {
 	/* If match with the current used context */
@@ -436,7 +436,7 @@ what_advctx_update(struct igb_tx_queue *txq, uint16_t flags,
 }
 
 static inline uint32_t
-tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
+tx_desc_cksum_flags_to_olinfo(uint32_t ol_flags)
 {
 	static const uint32_t l4_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_TXSM};
 	static const uint32_t l3_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_IXSM};
@@ -448,7 +448,7 @@ tx_desc_cksum_flags_to_olinfo(uint16_t ol_flags)
 }
 
 static inline uint32_t
-tx_desc_vlan_flags_to_cmdtype(uint16_t ol_flags)
+tx_desc_vlan_flags_to_cmdtype(uint32_t ol_flags)
 {
 	static const uint32_t vlan_cmd[2] = {0, IXGBE_ADVTXD_DCMD_VLE};
 	return vlan_cmd[(ol_flags & PKT_TX_VLAN_PKT) != 0];
@@ -537,12 +537,12 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t cmd_type_len;
 	uint32_t pkt_len;
 	uint16_t slen;
-	uint16_t ol_flags;
+	uint32_t ol_flags;
 	uint16_t tx_id;
 	uint16_t tx_last;
 	uint16_t nb_tx;
 	uint16_t nb_used;
-	uint16_t tx_ol_req;
+	uint32_t tx_ol_req;
 	uint32_t vlan_macip_lens;
 	uint32_t ctx = 0;
 	uint32_t new_ctx;
@@ -574,7 +574,8 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		vlan_macip_lens = tx_pkt->vlan_macip.data;
 
 		/* If hardware offload required */
-		tx_ol_req = (uint16_t)(ol_flags & PKT_TX_OFFLOAD_MASK);
+		tx_ol_req = ol_flags &
+			(PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK);
 		if (tx_ol_req) {
 			/* If new context need be built or reuse the exist ctx. */
 			ctx = what_advctx_update(txq, tx_ol_req,
@@ -804,19 +805,19 @@ end_of_tx:
  *  RX functions
  *
  **********************************************************************/
-static inline uint16_t
+static inline uint32_t
 rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 {
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
-	static uint16_t ip_pkt_types_map[16] = {
+	static uint32_t ip_pkt_types_map[16] = {
 		0, PKT_RX_IPV4_HDR, PKT_RX_IPV4_HDR_EXT, PKT_RX_IPV4_HDR_EXT,
 		PKT_RX_IPV6_HDR, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
 		PKT_RX_IPV6_HDR_EXT, 0, 0, 0,
 	};
 
-	static uint16_t ip_rss_types_map[16] = {
+	static uint32_t ip_rss_types_map[16] = {
 		0, PKT_RX_RSS_HASH, PKT_RX_RSS_HASH, PKT_RX_RSS_HASH,
 		0, PKT_RX_RSS_HASH, 0, PKT_RX_RSS_HASH,
 		PKT_RX_RSS_HASH, 0, 0, 0,
@@ -829,45 +830,45 @@ rx_desc_hlen_type_rss_to_pkt_flags(uint32_t hl_tp_rs)
 		0, 0, 0, 0,
 	};
 
-	pkt_flags = (uint16_t) ((hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ?
-				ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = ((hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ?
+		ip_pkt_etqf_map[(hl_tp_rs >> 4) & 0x07] :
+		ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
 #else
-	pkt_flags = (uint16_t) ((hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ? 0 :
-				ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
+	pkt_flags = ((hl_tp_rs & IXGBE_RXDADV_PKTTYPE_ETQF) ? 0 :
+		ip_pkt_types_map[(hl_tp_rs >> 4) & 0x0F]);
 
 #endif
-	return (uint16_t)(pkt_flags | ip_rss_types_map[hl_tp_rs & 0xF]);
+	return (pkt_flags | ip_rss_types_map[hl_tp_rs & 0xF]);
 }
 
-static inline uint16_t
+static inline uint32_t
 rx_desc_status_to_pkt_flags(uint32_t rx_status)
 {
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
 	/*
 	 * Check if VLAN present only.
 	 * Do not check whether L3/L4 rx checksum done by NIC or not,
 	 * That can be found from rte_eth_rxmode.hw_ip_checksum flag
 	 */
-	pkt_flags = (uint16_t)((rx_status & IXGBE_RXD_STAT_VP) ?
-						PKT_RX_VLAN_PKT : 0);
+	pkt_flags = ((rx_status & IXGBE_RXD_STAT_VP) ?
+		PKT_RX_VLAN_PKT : 0);
 
 #ifdef RTE_LIBRTE_IEEE1588
 	if (rx_status & IXGBE_RXD_STAT_TMST)
-		pkt_flags = (uint16_t)(pkt_flags | PKT_RX_IEEE1588_TMST);
+		pkt_flags = pkt_flags | PKT_RX_IEEE1588_TMST;
 #endif
 	return pkt_flags;
 }
 
-static inline uint16_t
+static inline uint32_t
 rx_desc_error_to_pkt_flags(uint32_t rx_status)
 {
 	/*
 	 * Bit 31: IPE, IPv4 checksum error
 	 * Bit 30: L4I, L4I integrity error
 	 */
-	static uint16_t error_to_pkt_flags_map[4] = {
+	static uint32_t error_to_pkt_flags_map[4] = {
 		0,  PKT_RX_L4_CKSUM_BAD, PKT_RX_IP_CKSUM_BAD,
 		PKT_RX_IP_CKSUM_BAD | PKT_RX_L4_CKSUM_BAD
 	};
@@ -938,10 +939,10 @@ ixgbe_rx_scan_hw_ring(struct igb_rx_queue *rxq)
 			mb->ol_flags  = rx_desc_hlen_type_rss_to_pkt_flags(
 					rxdp[j].wb.lower.lo_dword.data);
 			/* reuse status field from scan list */
-			mb->ol_flags = (uint16_t)(mb->ol_flags |
-					rx_desc_status_to_pkt_flags(s[j]));
-			mb->ol_flags = (uint16_t)(mb->ol_flags |
-					rx_desc_error_to_pkt_flags(s[j]));
+			mb->ol_flags = mb->ol_flags |
+				rx_desc_status_to_pkt_flags(s[j]);
+			mb->ol_flags = mb->ol_flags |
+				rx_desc_error_to_pkt_flags(s[j]);
 		}
 
 		/* Move mbuf pointers from the S/W ring to the stage */
@@ -1134,7 +1135,7 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t rx_id;
 	uint16_t nb_rx;
 	uint16_t nb_hold;
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -1253,10 +1254,10 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags |
+			rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags |
+			rx_desc_error_to_pkt_flags(staterr);
 		rxm->ol_flags = pkt_flags;
 
 		if (likely(pkt_flags & PKT_RX_RSS_HASH))
@@ -1321,7 +1322,7 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 	uint16_t nb_rx;
 	uint16_t nb_hold;
 	uint16_t data_len;
-	uint16_t pkt_flags;
+	uint32_t pkt_flags;
 
 	nb_rx = 0;
 	nb_hold = 0;
@@ -1498,10 +1499,10 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 				rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_status_to_pkt_flags(staterr));
-		pkt_flags = (uint16_t)(pkt_flags |
-				rx_desc_error_to_pkt_flags(staterr));
+		pkt_flags = pkt_flags |
+			rx_desc_status_to_pkt_flags(staterr);
+		pkt_flags = pkt_flags |
+			rx_desc_error_to_pkt_flags(staterr);
 		first_seg->ol_flags = pkt_flags;
 
 		if (likely(pkt_flags & PKT_RX_RSS_HASH))
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 66e741f..571d2ca 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -150,7 +150,7 @@ enum ixgbe_advctx_num {
  */
 
 struct ixgbe_advctx_info {
-	uint16_t flags;           /**< ol_flags for context build. */
+	uint32_t flags;           /**< ol_flags for context build. */
 	uint32_t cmp_mask;        /**< compare mask for vlan_macip_lens */
 	union rte_vlan_macip vlan_macip_lens; /**< vlan, mac ip length. */
 };
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 09/11] mbuf: rename vlan_macip_len in hw_offload and increase its size
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (7 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 08/11] mbuf: change ol_flags to 32 bits Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 10/11] testpmd: modify source address to validate checksum calculation Olivier Matz
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

To implement the TCP segmentation offload, we will need to add
some more meta information in the mbuf, like the length of the
L4 header, the MSS, ...

To prepare this modification, this patch renames vlan_macip_len in
hw_offload and change its length from 32 bits to 64 bits.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/csumonly.c               |  4 +--
 app/test-pmd/macfwd.c                 |  6 ++--
 app/test-pmd/rxonly.c                 |  2 +-
 app/test-pmd/testpmd.c                |  2 +-
 app/test-pmd/txonly.c                 |  6 ++--
 examples/ip_reassembly/ipv4_rsmbl.h   | 10 +++----
 examples/ip_reassembly/main.c         |  4 +--
 lib/librte_mbuf/rte_mbuf.h            | 34 ++++++++++-----------
 lib/librte_pmd_e1000/em_rxtx.c        | 50 +++++++++++++++++--------------
 lib/librte_pmd_e1000/igb_rxtx.c       | 56 ++++++++++++++++++++---------------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c     | 54 +++++++++++++++++++--------------
 lib/librte_pmd_ixgbe/ixgbe_rxtx.h     |  3 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c |  4 +--
 13 files changed, 126 insertions(+), 109 deletions(-)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 69b90a7..9caad8f 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -430,8 +430,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		}
 
 		/* Combine the packet header write. VLAN is not consider here */
-		mb->vlan_macip.f.l2_len = l2_len;
-		mb->vlan_macip.f.l3_len = l3_len;
+		mb->hw_offload.l2_len = l2_len;
+		mb->hw_offload.l3_len = l3_len;
 		mb->ol_flags = ol_flags;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
diff --git a/app/test-pmd/macfwd.c b/app/test-pmd/macfwd.c
index ab74d0c..d137f92 100644
--- a/app/test-pmd/macfwd.c
+++ b/app/test-pmd/macfwd.c
@@ -116,9 +116,9 @@ pkt_burst_mac_forward(struct fwd_stream *fs)
 		ether_addr_copy(&ports[fs->tx_port].eth_addr,
 				&eth_hdr->s_addr);
 		mb->ol_flags = txp->tx_ol_flags;
-		mb->vlan_macip.f.l2_len = sizeof(struct ether_hdr);
-		mb->vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);
-		mb->vlan_macip.f.vlan_tci = txp->tx_vlan_id;
+		mb->hw_offload.l2_len = sizeof(struct ether_hdr);
+		mb->hw_offload.l3_len = sizeof(struct ipv4_hdr);
+		mb->hw_offload.vlan_tci = txp->tx_vlan_id;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
 	fs->tx_packets += nb_tx;
diff --git a/app/test-pmd/rxonly.c b/app/test-pmd/rxonly.c
index 0bf4440..6283482 100644
--- a/app/test-pmd/rxonly.c
+++ b/app/test-pmd/rxonly.c
@@ -149,7 +149,7 @@ pkt_burst_receive(struct fwd_stream *fs)
 			       mb->hash.fdir.hash, mb->hash.fdir.id);
 		if (ol_flags & PKT_RX_VLAN_PKT)
 			printf(" - VLAN tci=0x%x",
-				mb->vlan_macip.f.vlan_tci);
+				mb->hw_offload.vlan_tci);
 		printf("\n");
 		if (ol_flags != 0) {
 			uint32_t rxf;
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 572c3aa..3085be5 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -397,7 +397,7 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 	mb->ol_flags     = 0;
 	mb->data_off     = RTE_PKTMBUF_HEADROOM;
 	mb->nb_segs      = 1;
-	mb->vlan_macip.data = 0;
+	mb->hw_offload.u64 = 0;
 	mb->hash.rss     = 0;
 }
 
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c
index 5d93209..97e381a 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -264,9 +264,9 @@ pkt_burst_transmit(struct fwd_stream *fs)
 		pkt->nb_segs = tx_pkt_nb_segs;
 		pkt->pkt_len = tx_pkt_length;
 		pkt->ol_flags = ol_flags;
-		pkt->vlan_macip.f.vlan_tci  = vlan_tci;
-		pkt->vlan_macip.f.l2_len = sizeof(struct ether_hdr);
-		pkt->vlan_macip.f.l3_len = sizeof(struct ipv4_hdr);
+		pkt->hw_offload.vlan_tci  = vlan_tci;
+		pkt->hw_offload.l2_len = sizeof(struct ether_hdr);
+		pkt->hw_offload.l3_len = sizeof(struct ipv4_hdr);
 		pkts_burst[nb_pkt] = pkt;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_pkt);
diff --git a/examples/ip_reassembly/ipv4_rsmbl.h b/examples/ip_reassembly/ipv4_rsmbl.h
index 9b647fb..c653993 100644
--- a/examples/ip_reassembly/ipv4_rsmbl.h
+++ b/examples/ip_reassembly/ipv4_rsmbl.h
@@ -168,8 +168,8 @@ ipv4_frag_chain(struct rte_mbuf *mn, struct rte_mbuf *mp)
 	struct rte_mbuf *ms;
 
 	/* adjust start of the last fragment data. */
-	rte_pktmbuf_adj(mp, (uint16_t)(mp->vlan_macip.f.l2_len +
-		mp->vlan_macip.f.l3_len));
+	rte_pktmbuf_adj(mp, (uint16_t)(mp->hw_offload.l2_len +
+		mp->hw_offload.l3_len));
 				
 	/* chain two fragments. */
 	ms = rte_pktmbuf_lastseg(mn);
@@ -233,10 +233,10 @@ ipv4_frag_reassemble(const struct ipv4_frag_pkt *fp)
 
 	/* update ipv4 header for the reassmebled packet */
 	ip_hdr = (struct ipv4_hdr*)(rte_pktmbuf_mtod(m, uint8_t *) +
-		m->vlan_macip.f.l2_len);
+		m->hw_offload.l2_len);
 
 	ip_hdr->total_length = rte_cpu_to_be_16((uint16_t)(fp->total_size +
-		m->vlan_macip.f.l3_len));
+		m->hw_offload.l3_len));
 	ip_hdr->fragment_offset = (uint16_t)(ip_hdr->fragment_offset &
 		rte_cpu_to_be_16(IPV4_HDR_DF_FLAG));
 	ip_hdr->hdr_checksum = 0;
@@ -377,7 +377,7 @@ ipv4_frag_mbuf(struct ipv4_frag_tbl *tbl, struct ipv4_frag_death_row *dr,
 
 	ip_ofs *= IPV4_HDR_OFFSET_UNITS;
 	ip_len = (uint16_t)(rte_be_to_cpu_16(ip_hdr->total_length) -
-		mb->vlan_macip.f.l3_len);
+		mb->hw_offload.l3_len);
 
 	IPV4_FRAG_LOG(DEBUG, "%s:%d:\n"
 		"mbuf: %p, tms: %" PRIu64
diff --git a/examples/ip_reassembly/main.c b/examples/ip_reassembly/main.c
index 5c5626a..a817d3d 100644
--- a/examples/ip_reassembly/main.c
+++ b/examples/ip_reassembly/main.c
@@ -680,8 +680,8 @@ l3fwd_simple_forward(struct rte_mbuf *m, uint8_t portid, uint32_t queue,
 			dr = &qconf->death_row;
 
 			/* prepare mbuf: setup l2_len/l3_len. */
-			m->vlan_macip.f.l2_len = sizeof(*eth_hdr);
-			m->vlan_macip.f.l3_len = sizeof(*ipv4_hdr);
+			m->hw_offload.l2_len = sizeof(*eth_hdr);
+			m->hw_offload.l3_len = sizeof(*ipv4_hdr);
 
 			/* process this fragment. */
 			if ((mo = ipv4_frag_mbuf(tbl, dr, m, tms, ipv4_hdr,
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index 1cd51c2..d71c86c 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -145,26 +145,22 @@ static inline const char *rte_get_tx_ol_flag_name(uint32_t mask)
 }
 
 /** Offload features */
-union rte_vlan_macip {
-	uint32_t data;
+union rte_hw_offload {
+	uint64_t u64;
 	struct {
-		uint16_t l3_len:9; /**< L3 (IP) Header Length. */
-		uint16_t l2_len:7; /**< L2 (MAC) Header Length. */
+#define HW_OFFLOAD_L2_LEN_MASK 0x7f
+#define HW_OFFLOAD_L3_LEN_MASK 0x1ff
+#define HW_OFFLOAD_L4_LEN_MASK 0xff
+		uint32_t l2_len:7; /**< L2 (MAC) Header Length. */
+		uint32_t l3_len:9; /**< L3 (IP) Header Length. */
+		uint32_t reserved:16;
+
 		uint16_t vlan_tci;
 		/**< VLAN Tag Control Identifier (CPU order). */
-	} f;
+		uint16_t reserved2;
+	};
 };
 
-/*
- * Compare mask for vlan_macip_len.data,
- * should be in sync with rte_vlan_macip.f layout.
- * */
-#define TX_VLAN_CMP_MASK        0xFFFF0000  /**< VLAN length - 16-bits. */
-#define TX_MAC_LEN_CMP_MASK     0x0000FE00  /**< MAC length - 7-bits. */
-#define TX_IP_LEN_CMP_MASK      0x000001FF  /**< IP  length - 9-bits. */
-/**< MAC+IP  length. */
-#define TX_MACIP_LEN_CMP_MASK   (TX_MAC_LEN_CMP_MASK | TX_IP_LEN_CMP_MASK)
-
 /**
  * The generic rte_mbuf, containing a packet mbuf.
  */
@@ -203,7 +199,7 @@ struct rte_mbuf {
 	uint32_t ol_flags;        /**< Offload features. */
 
 	/* offload features, valid for first segment only */
-	union rte_vlan_macip vlan_macip;
+	union rte_hw_offload hw_offload;
 	union {
 		uint32_t rss;     /**< RSS hash result if RSS enabled */
 		struct {
@@ -212,7 +208,7 @@ struct rte_mbuf {
 		} fdir;           /**< Filter identifier if FDIR enabled */
 		uint32_t sched;   /**< Hierarchical scheduler */
 	} hash;                   /**< hash information */
-	uint64_t reserved;        /**< Unused field. Required for padding. */
+	uint32_t reserved;        /**< Unused field. Required for padding. */
 } __rte_cache_aligned;
 
 /**
@@ -479,7 +475,7 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 {
 	m->next = NULL;
 	m->pkt_len = 0;
-	m->vlan_macip.data = 0;
+	m->hw_offload.u64 = 0;
 	m->nb_segs = 1;
 	m->in_port = 0xff;
 
@@ -545,7 +541,7 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->data_off = md->data_off;
 	mi->data_len = md->data_len;
 	mi->in_port = md->in_port;
-	mi->vlan_macip = md->vlan_macip;
+	mi->hw_offload.u64 = md->hw_offload.u64;
 	mi->hash = md->hash;
 
 	mi->next = NULL;
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c
index 015c0af..69bd666 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -148,8 +148,8 @@ enum {
  */
 struct em_ctx_info {
 	uint32_t flags;               /**< ol_flags related to context build. */
-	uint32_t cmp_mask;            /**< compare mask */
-	union rte_vlan_macip hdrlen;  /**< L2 and L3 header lenghts */
+	union rte_hw_offload hw_offload;   /**< l2/l3/l4 length, vlan, mss. */
+	union rte_hw_offload offload_mask; /**< compare mask for hw_offload */
 };
 
 /**
@@ -217,18 +217,18 @@ struct em_tx_queue {
 static inline void
 em_set_xmit_ctx(struct em_tx_queue* txq,
 		volatile struct e1000_context_desc *ctx_txd,
-		uint32_t flags,
-		union rte_vlan_macip hdrlen)
+		uint32_t flags, union rte_hw_offload hw_offload)
 {
-	uint32_t cmp_mask, cmd_len;
+	uint32_t cmd_len;
 	uint16_t ipcse, l2len;
 	struct e1000_context_desc ctx;
+	union rte_hw_offload offload_mask;
 
-	cmp_mask = 0;
+	offload_mask.u64 = 0;
 	cmd_len = E1000_TXD_CMD_DEXT | E1000_TXD_DTYP_C;
 
-	l2len = hdrlen.f.l2_len;
-	ipcse = (uint16_t)(l2len + hdrlen.f.l3_len);
+	l2len = hw_offload.l2_len;
+	ipcse = (uint16_t)(l2len + hw_offload.l3_len);
 
 	/* setup IPCS* fields */
 	ctx.lower_setup.ip_fields.ipcss = (uint8_t)l2len;
@@ -243,7 +243,8 @@ em_set_xmit_ctx(struct em_tx_queue* txq,
 		ctx.lower_setup.ip_fields.ipcse =
 			(uint16_t)rte_cpu_to_le_16(ipcse - 1);
 		cmd_len |= E1000_TXD_CMD_IP;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 	} else {
 		ctx.lower_setup.ip_fields.ipcse = 0;
 	}
@@ -256,13 +257,15 @@ em_set_xmit_ctx(struct em_tx_queue* txq,
 	case PKT_TX_UDP_CKSUM:
 		ctx.upper_setup.tcp_fields.tucso = (uint8_t)(ipcse +
 				offsetof(struct udp_hdr, dgram_cksum));
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 		break;
 	case PKT_TX_TCP_CKSUM:
 		ctx.upper_setup.tcp_fields.tucso = (uint8_t)(ipcse +
 				offsetof(struct tcp_hdr, cksum));
 		cmd_len |= E1000_TXD_CMD_TCP;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 		break;
 	default:
 		ctx.upper_setup.tcp_fields.tucso = 0;
@@ -274,8 +277,9 @@ em_set_xmit_ctx(struct em_tx_queue* txq,
 	*ctx_txd = ctx;
 
 	txq->ctx_cache.flags = flags;
-	txq->ctx_cache.cmp_mask = cmp_mask;
-	txq->ctx_cache.hdrlen = hdrlen;
+	txq->ctx_cache.hw_offload.u64  =
+		offload_mask.u64 & hw_offload.u64;
+	txq->ctx_cache.offload_mask    = offload_mask;
 }
 
 /*
@@ -284,12 +288,12 @@ em_set_xmit_ctx(struct em_tx_queue* txq,
  */
 static inline uint32_t
 what_ctx_update(struct em_tx_queue *txq, uint32_t flags,
-		union rte_vlan_macip hdrlen)
+		union rte_hw_offload hw_offload)
 {
 	/* If match with the current context */
 	if (likely (txq->ctx_cache.flags == flags &&
-			((txq->ctx_cache.hdrlen.data ^ hdrlen.data) &
-			txq->ctx_cache.cmp_mask) == 0))
+			((txq->ctx_cache.hw_offload.u64 ^ hw_offload.u64) &
+			txq->ctx_cache.offload_mask.u64) == 0))
 		return (EM_CTX_0);
 
 	/* Mismatch */
@@ -390,7 +394,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t tx_ol_req;
 	uint32_t ctx;
 	uint32_t new_ctx;
-	union rte_vlan_macip hdrlen;
+	union rte_hw_offload hw_offload;
 
 	txq = tx_queue;
 	sw_ring = txq->sw_ring;
@@ -419,9 +423,9 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* If hardware offload required */
 		tx_ol_req = (ol_flags & (PKT_TX_IP_CKSUM | PKT_TX_L4_MASK));
 		if (tx_ol_req) {
-			hdrlen = tx_pkt->vlan_macip;
+			hw_offload.u64 = tx_pkt->hw_offload.u64;
 			/* If new context to be built or reuse the exist ctx. */
-			ctx = what_ctx_update(txq, tx_ol_req, hdrlen);
+			ctx = what_ctx_update(txq, tx_ol_req, hw_offload);
 
 			/* Only allocate context descriptor if required*/
 			new_ctx = (ctx == EM_CTX_NUM);
@@ -514,7 +518,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		/* Set VLAN Tag offload fields. */
 		if (ol_flags & PKT_TX_VLAN_PKT) {
 			cmd_type_len |= E1000_TXD_CMD_VLE;
-			popts_spec = tx_pkt->vlan_macip.f.vlan_tci <<
+			popts_spec = tx_pkt->hw_offload.vlan_tci <<
 				E1000_TXD_VLAN_SHIFT;
 		}
 
@@ -537,7 +541,7 @@ eth_em_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 				}
 
 				em_set_xmit_ctx(txq, ctx_txd, tx_ol_req,
-					hdrlen);
+					hw_offload);
 
 				txe->last_id = tx_last;
 				tx_id = txe->next_id;
@@ -782,7 +786,7 @@ eth_em_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 			rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
+		rxm->hw_offload.vlan_tci = rte_le_to_cpu_16(rxd.special);
 
 		/*
 		 * Store the mbuf address into the next entry of the array
@@ -1008,7 +1012,7 @@ eth_em_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 			rx_desc_error_to_pkt_flags(rxd.errors);
 
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->vlan_macip.f.vlan_tci = rte_le_to_cpu_16(rxd.special);
+		rxm->hw_offload.vlan_tci = rte_le_to_cpu_16(rxd.special);
 
 		/* Prefetch data of first segment, if configured to do so. */
 		rte_packet_prefetch((char *)first_seg->buf_addr +
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c
index 322dfa0..2db496f 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -145,8 +145,8 @@ enum igb_advctx_num {
  */
 struct igb_advctx_info {
 	uint32_t flags;           /**< ol_flags related to context build. */
-	uint32_t cmp_mask;        /**< compare mask for vlan_macip_lens */
-	union rte_vlan_macip vlan_macip_lens; /**< vlan, mac & ip length. */
+	union rte_hw_offload hw_offload;   /**< l2/l3/l4 length, vlan, mss. */
+	union rte_hw_offload offload_mask; /**< compare mask for hw_offload */
 };
 
 /**
@@ -212,26 +212,28 @@ struct igb_tx_queue {
 static inline void
 igbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct e1000_adv_tx_context_desc *ctx_txd,
-		uint32_t ol_flags, uint32_t vlan_macip_lens)
+		uint32_t ol_flags, union rte_hw_offload hw_offload)
 {
 	uint32_t type_tucmd_mlhl;
 	uint32_t mss_l4len_idx;
 	uint32_t ctx_idx, ctx_curr;
-	uint32_t cmp_mask;
+	uint32_t vlan_macip_lens;
+	union rte_hw_offload offload_mask;
 
 	ctx_curr = txq->ctx_curr;
 	ctx_idx = ctx_curr + txq->ctx_start;
 
-	cmp_mask = 0;
+	offload_mask.u64 = 0;
 	type_tucmd_mlhl = 0;
 
 	if (ol_flags & PKT_TX_VLAN_PKT) {
-		cmp_mask |= TX_VLAN_CMP_MASK;
+		offload_mask.vlan_tci = 0xffff;
 	}
 
 	if (ol_flags & PKT_TX_IP_CKSUM) {
 		type_tucmd_mlhl = E1000_ADVTXD_TUCMD_IPV4;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 	}
 
 	/* Specify which HW CTX to upload. */
@@ -241,19 +243,22 @@ igbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_UDP |
 				E1000_ADVTXD_DTYP_CTXT | E1000_ADVTXD_DCMD_DEXT;
 		mss_l4len_idx |= sizeof(struct udp_hdr) << E1000_ADVTXD_L4LEN_SHIFT;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 		break;
 	case PKT_TX_TCP_CKSUM:
 		type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_TCP |
 				E1000_ADVTXD_DTYP_CTXT | E1000_ADVTXD_DCMD_DEXT;
 		mss_l4len_idx |= sizeof(struct tcp_hdr) << E1000_ADVTXD_L4LEN_SHIFT;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 		break;
 	case PKT_TX_SCTP_CKSUM:
 		type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_SCTP |
 				E1000_ADVTXD_DTYP_CTXT | E1000_ADVTXD_DCMD_DEXT;
 		mss_l4len_idx |= sizeof(struct sctp_hdr) << E1000_ADVTXD_L4LEN_SHIFT;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 		break;
 	default:
 		type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_RSV |
@@ -262,11 +267,14 @@ igbe_set_xmit_ctx(struct igb_tx_queue* txq,
 	}
 
 	txq->ctx_cache[ctx_curr].flags           = ol_flags;
-	txq->ctx_cache[ctx_curr].cmp_mask        = cmp_mask;
-	txq->ctx_cache[ctx_curr].vlan_macip_lens.data =
-		vlan_macip_lens & cmp_mask;
+	txq->ctx_cache[ctx_curr].hw_offload.u64  =
+		offload_mask.u64 & hw_offload.u64;
+	txq->ctx_cache[ctx_curr].offload_mask    = offload_mask;
 
 	ctx_txd->type_tucmd_mlhl = rte_cpu_to_le_32(type_tucmd_mlhl);
+	vlan_macip_lens = hw_offload.l3_len;
+	vlan_macip_lens |= (hw_offload.l2_len << E1000_ADVTXD_MACLEN_SHIFT);
+	vlan_macip_lens |= ((uint32_t)hw_offload.vlan_tci << E1000_ADVTXD_VLAN_SHIFT);
 	ctx_txd->vlan_macip_lens = rte_cpu_to_le_32(vlan_macip_lens);
 	ctx_txd->mss_l4len_idx   = rte_cpu_to_le_32(mss_l4len_idx);
 	ctx_txd->seqnum_seed     = 0;
@@ -278,20 +286,20 @@ igbe_set_xmit_ctx(struct igb_tx_queue* txq,
  */
 static inline uint32_t
 what_advctx_update(struct igb_tx_queue *txq, uint32_t flags,
-		uint32_t vlan_macip_lens)
+		union rte_hw_offload hw_offload)
 {
 	/* If match with the current context */
 	if (likely((txq->ctx_cache[txq->ctx_curr].flags == flags) &&
-		(txq->ctx_cache[txq->ctx_curr].vlan_macip_lens.data ==
-		(txq->ctx_cache[txq->ctx_curr].cmp_mask & vlan_macip_lens)))) {
+		(txq->ctx_cache[txq->ctx_curr].hw_offload.u64 ==
+		(txq->ctx_cache[txq->ctx_curr].offload_mask.u64 & hw_offload.u64)))) {
 			return txq->ctx_curr;
 	}
 
 	/* If match with the second context */
 	txq->ctx_curr ^= 1;
 	if (likely((txq->ctx_cache[txq->ctx_curr].flags == flags) &&
-		(txq->ctx_cache[txq->ctx_curr].vlan_macip_lens.data ==
-		(txq->ctx_cache[txq->ctx_curr].cmp_mask & vlan_macip_lens)))) {
+		(txq->ctx_cache[txq->ctx_curr].hw_offload.u64 ==
+		(txq->ctx_cache[txq->ctx_curr].offload_mask.u64 & hw_offload.u64)))) {
 			return txq->ctx_curr;
 	}
 
@@ -342,7 +350,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint32_t tx_ol_req;
 	uint32_t new_ctx = 0;
 	uint32_t ctx = 0;
-	uint32_t vlan_macip_lens;
+	union rte_hw_offload hw_offload;
 
 	txq = tx_queue;
 	sw_ring = txq->sw_ring;
@@ -367,14 +375,14 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		tx_last = (uint16_t) (tx_id + tx_pkt->nb_segs - 1);
 
 		ol_flags = tx_pkt->ol_flags;
-		vlan_macip_lens = tx_pkt->vlan_macip.data;
+		hw_offload.u64 = tx_pkt->hw_offload.u64;
 		tx_ol_req = ol_flags &
 			(PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK);
 
 		/* If a Context Descriptor need be built . */
 		if (tx_ol_req) {
 			ctx = what_advctx_update(txq, tx_ol_req,
-				vlan_macip_lens);
+				hw_offload);
 			/* Only allocate context descriptor if required*/
 			new_ctx = (ctx == IGB_CTX_NUM);
 			ctx = txq->ctx_curr;
@@ -490,7 +498,7 @@ eth_igb_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 				}
 
 				igbe_set_xmit_ctx(txq, ctx_txd, tx_ol_req,
-				    vlan_macip_lens);
+				    hw_offload);
 
 				txe->last_id = tx_last;
 				tx_id = txe->next_id;
@@ -752,7 +760,7 @@ eth_igb_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		rxm->hash.rss = rxd.wb.lower.hi_dword.rss;
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->vlan_macip.f.vlan_tci =
+		rxm->hw_offload.vlan_tci =
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
@@ -989,7 +997,7 @@ eth_igb_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
 		 * set in the pkt_flags field.
 		 */
-		first_seg->vlan_macip.f.vlan_tci =
+		first_seg->hw_offload.vlan_tci =
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 7096ea6..d52482e 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -350,24 +350,26 @@ ixgbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 static inline void
 ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct ixgbe_adv_tx_context_desc *ctx_txd,
-		uint32_t ol_flags, uint32_t vlan_macip_lens)
+		uint32_t ol_flags, union rte_hw_offload hw_offload)
 {
 	uint32_t type_tucmd_mlhl;
 	uint32_t mss_l4len_idx;
 	uint32_t ctx_idx;
-	uint32_t cmp_mask;
+	uint32_t vlan_macip_lens;
+	union rte_hw_offload offload_mask;
 
 	ctx_idx = txq->ctx_curr;
-	cmp_mask = 0;
+	offload_mask.u64 = 0;
 	type_tucmd_mlhl = 0;
 
 	if (ol_flags & PKT_TX_VLAN_PKT) {
-		cmp_mask |= TX_VLAN_CMP_MASK;
+		offload_mask.vlan_tci = 0xffff;
 	}
 
 	if (ol_flags & PKT_TX_IP_CKSUM) {
 		type_tucmd_mlhl = IXGBE_ADVTXD_TUCMD_IPV4;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 	}
 
 	/* Specify which HW CTX to upload. */
@@ -377,19 +379,22 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
 		mss_l4len_idx |= sizeof(struct udp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 		break;
 	case PKT_TX_TCP_CKSUM:
 		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
 		mss_l4len_idx |= sizeof(struct tcp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 		break;
 	case PKT_TX_SCTP_CKSUM:
 		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_SCTP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
 		mss_l4len_idx |= sizeof(struct sctp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		cmp_mask |= TX_MACIP_LEN_CMP_MASK;
+		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
 		break;
 	default:
 		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_RSV |
@@ -398,11 +403,14 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 	}
 
 	txq->ctx_cache[ctx_idx].flags = ol_flags;
-	txq->ctx_cache[ctx_idx].cmp_mask = cmp_mask;
-	txq->ctx_cache[ctx_idx].vlan_macip_lens.data =
-		vlan_macip_lens & cmp_mask;
+	txq->ctx_cache[ctx_idx].hw_offload.u64  =
+		offload_mask.u64 & hw_offload.u64;
+	txq->ctx_cache[ctx_idx].offload_mask    = offload_mask;
 
 	ctx_txd->type_tucmd_mlhl = rte_cpu_to_le_32(type_tucmd_mlhl);
+	vlan_macip_lens = hw_offload.l3_len;
+	vlan_macip_lens |= (hw_offload.l2_len << IXGBE_ADVTXD_MACLEN_SHIFT);
+	vlan_macip_lens |= ((uint32_t)hw_offload.vlan_tci << IXGBE_ADVTXD_VLAN_SHIFT);
 	ctx_txd->vlan_macip_lens = rte_cpu_to_le_32(vlan_macip_lens);
 	ctx_txd->mss_l4len_idx   = rte_cpu_to_le_32(mss_l4len_idx);
 	ctx_txd->seqnum_seed     = 0;
@@ -414,20 +422,20 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
  */
 static inline uint32_t
 what_advctx_update(struct igb_tx_queue *txq, uint32_t flags,
-		uint32_t vlan_macip_lens)
+		union rte_hw_offload hw_offload)
 {
 	/* If match with the current used context */
 	if (likely((txq->ctx_cache[txq->ctx_curr].flags == flags) &&
-		(txq->ctx_cache[txq->ctx_curr].vlan_macip_lens.data ==
-		(txq->ctx_cache[txq->ctx_curr].cmp_mask & vlan_macip_lens)))) {
+		(txq->ctx_cache[txq->ctx_curr].hw_offload.u64 ==
+		(txq->ctx_cache[txq->ctx_curr].offload_mask.u64 & hw_offload.u64)))) {
 			return txq->ctx_curr;
 	}
 
 	/* What if match with the next context  */
 	txq->ctx_curr ^= 1;
 	if (likely((txq->ctx_cache[txq->ctx_curr].flags == flags) &&
-		(txq->ctx_cache[txq->ctx_curr].vlan_macip_lens.data ==
-		(txq->ctx_cache[txq->ctx_curr].cmp_mask & vlan_macip_lens)))) {
+		(txq->ctx_cache[txq->ctx_curr].hw_offload.u64 ==
+		(txq->ctx_cache[txq->ctx_curr].offload_mask.u64 & hw_offload.u64)))) {
 			return txq->ctx_curr;
 	}
 
@@ -543,9 +551,9 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 	uint16_t nb_tx;
 	uint16_t nb_used;
 	uint32_t tx_ol_req;
-	uint32_t vlan_macip_lens;
 	uint32_t ctx = 0;
 	uint32_t new_ctx;
+	union rte_hw_offload hw_offload;
 
 	txq = tx_queue;
 	sw_ring = txq->sw_ring;
@@ -571,7 +579,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		 * are needed for offload functionality.
 		 */
 		ol_flags = tx_pkt->ol_flags;
-		vlan_macip_lens = tx_pkt->vlan_macip.data;
+		hw_offload.u64 = tx_pkt->hw_offload.u64;
 
 		/* If hardware offload required */
 		tx_ol_req = ol_flags &
@@ -579,7 +587,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		if (tx_ol_req) {
 			/* If new context need be built or reuse the exist ctx. */
 			ctx = what_advctx_update(txq, tx_ol_req,
-				vlan_macip_lens);
+				hw_offload);
 			/* Only allocate context descriptor if required*/
 			new_ctx = (ctx == IXGBE_CTX_NUM);
 			ctx = txq->ctx_curr;
@@ -721,7 +729,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 				}
 
 				ixgbe_set_xmit_ctx(txq, ctx_txd, tx_ol_req,
-				    vlan_macip_lens);
+					hw_offload);
 
 				txe->last_id = tx_last;
 				tx_id = txe->next_id;
@@ -932,7 +940,7 @@ ixgbe_rx_scan_hw_ring(struct igb_rx_queue *rxq)
 							rxq->crc_len);
 			mb->data_len = pkt_len;
 			mb->pkt_len = pkt_len;
-			mb->vlan_macip.f.vlan_tci = rxdp[j].wb.upper.vlan;
+			mb->hw_offload.vlan_tci = rxdp[j].wb.upper.vlan;
 			mb->hash.rss = rxdp[j].wb.lower.hi_dword.rss;
 
 			/* convert descriptor fields to rte mbuf flags */
@@ -1250,7 +1258,7 @@ ixgbe_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		/* Only valid if PKT_RX_VLAN_PKT set in pkt_flags */
-		rxm->vlan_macip.f.vlan_tci =
+		rxm->hw_offload.vlan_tci =
 			rte_le_to_cpu_16(rxd.wb.upper.vlan);
 
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
@@ -1495,7 +1503,7 @@ ixgbe_recv_scattered_pkts(void *rx_queue, struct rte_mbuf **rx_pkts,
 		 * The vlan_tci field is only valid when PKT_RX_VLAN_PKT is
 		 * set in the pkt_flags field.
 		 */
-		first_seg->vlan_macip.f.vlan_tci =
+		first_seg->hw_offload.vlan_tci =
 				rte_le_to_cpu_16(rxd.wb.upper.vlan);
 		hlen_type_rss = rte_le_to_cpu_32(rxd.wb.lower.lo_dword.data);
 		pkt_flags = rx_desc_hlen_type_rss_to_pkt_flags(hlen_type_rss);
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
index 571d2ca..978bb19 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.h
@@ -152,7 +152,8 @@ enum ixgbe_advctx_num {
 struct ixgbe_advctx_info {
 	uint32_t flags;           /**< ol_flags for context build. */
 	uint32_t cmp_mask;        /**< compare mask for vlan_macip_lens */
-	union rte_vlan_macip vlan_macip_lens; /**< vlan, mac ip length. */
+	union rte_hw_offload hw_offload;   /**< l2/l3/l4 length, vlan, mss. */
+	union rte_hw_offload offload_mask; /**< compare mask for hw_offload */
 };
 
 /**
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index b5450b2..c85da80 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -536,7 +536,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 					rte_pktmbuf_mtod(rxm, void *));
 #endif
 				//Copy vlan tag in packet buffer
-				rxm->vlan_macip.f.vlan_tci =
+				rxm->hw_offload.vlan_tci =
 					rte_le_to_cpu_16((uint16_t)rcd->tci);
 
 			} else
@@ -549,7 +549,7 @@ vmxnet3_recv_pkts(void *rx_queue, struct rte_mbuf **rx_pkts, uint16_t nb_pkts)
 			rxm->pkt_len = (uint16_t)rcd->len;
 			rxm->data_len = (uint16_t)rcd->len;
 			rxm->in_port = rxq->port_id;
-			rxm->vlan_macip.f.vlan_tci = 0;
+			rxm->hw_offload.vlan_tci = 0;
 			rxm->data_off = RTE_PKTMBUF_HEADROOM;
 
 			rx_pkts[nb_rx++] = rxm;
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 10/11] testpmd: modify source address to validate checksum calculation
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (8 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 09/11] mbuf: rename vlan_macip_len in hw_offload and increase its size Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

Always modify the source address of the packet in order to validate
the calculation of the checksums (L3 or L4). This was already done
for IPv4 software checksum, add it for IPv4 hw checksum and IPv6.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/csumonly.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index 9caad8f..e93d75f 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -310,6 +310,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 
 			if (tx_ol_flags & PKT_TX_IP_CKSUM) {
 				/* HW checksum */
+				ipv4_hdr->src_addr--;
 				ol_flags |= PKT_TX_IP_CKSUM;
 			}
 			else {
@@ -373,6 +374,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 					unsigned char *) + l2_len);
 			l3_len = sizeof(struct ipv6_hdr) ;
 			l4_proto = ipv6_hdr->proto;
+			ipv6_hdr->src_addr[3]--;
 
 			if (l4_proto == IPPROTO_UDP) {
 				udp_hdr = (struct udp_hdr*) (rte_pktmbuf_mtod(mb,
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (9 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 10/11] testpmd: modify source address to validate checksum calculation Olivier Matz
@ 2014-05-09 14:50 ` Olivier Matz
  2014-05-12 14:30   ` Thomas Monjalon
  2014-05-15 15:09   ` Ananyev, Konstantin
  2014-05-09 17:04 ` [dpdk-dev] [PATCH RFC 00/11] " Stephen Hemminger
  2014-05-19 12:47 ` Thomas Monjalon
  12 siblings, 2 replies; 51+ messages in thread
From: Olivier Matz @ 2014-05-09 14:50 UTC (permalink / raw)
  To: dev

Implement TSO (TCP segmentation offload) in ixgbe driver. To delegate
the TCP segmentation to the hardware, the user has to:

- set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
  PKT_TX_IP_CKSUM and PKT_TX_TCP_CKSUM)
- fill the mbuf->hw_offload information: l2_len, l3_len, l4_len, mss
- calculate the pseudo header checksum and set it in the TCP header,
  as required when doing hardware TCP checksum offload
- set the IP checksum to 0

This approach seems generic enough to be used for other hw/drivers
in the future.

In the patch, the tx_desc_cksum_flags_to_olinfo() and
tx_desc_ol_flags_to_cmdtype() functions have been reworked to make them
clearer. This does not impact performance as gcc (version 4.8 in my
case) is smart enough to convert the tests into a code that does not
contain any branch instruction.

validation
==========

platform:

  Tester (linux)   <---->   DUT (DPDK)

Run testpmd on DUT:

  cd dpdk.org/
  make install T=x86_64-default-linuxapp-gcc
  cd x86_64-default-linuxapp-gcc/
  modprobe uio
  insmod kmod/igb_uio.ko
  python ../tools/igb_uio_bind.py -b igb_uio 0000:02:00.0
  echo 0 > /proc/sys/kernel/randomize_va_space
  echo 1000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
  echo 1000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
  mount -t hugetlbfs none /mnt/huge
  ./app/testpmd -c 0x55 -n 4 -m 800 -- -i --port-topology=chained

Disable all offload feature on Tester, and start capture:

  ethtool -K ixgbe0 rx off tx off tso off gso off gro off lro off
  ip l set ixgbe0 up
  tcpdump -n -e -i ixgbe0 -s 0 -w /tmp/cap

We use the following scapy script for testing:

  def test():
    ############### IPv4
    # checksum TCP
    p=Ether()/IP(src=RandIP(), dst=RandIP())/TCP(flags=0x10)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # checksum UDP
    p=Ether()/IP(src=RandIP(), dst=RandIP())/UDP()/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # bad IP checksum
    p=Ether()/IP(src=RandIP(), dst=RandIP(), chksum=0x1234)/TCP(flags=0x10)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # bad TCP checksum
    p=Ether()/IP(src=RandIP(), dst=RandIP())/TCP(flags=0x10, chksum=0x1234)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # large packet
    p=Ether()/IP(src=RandIP(), dst=RandIP())/TCP(flags=0x10)/Raw(RandString(1400))
    sendp(p, iface="ixgbe0", count=5)
    ############### IPv6v6
    # checksum TCP
    p=Ether()/IPv6(src=RandIP6(), dst=RandIP6())/TCP(flags=0x10)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # checksum UDP
    p=Ether()/IPv6(src=RandIP6(), dst=RandIP6())/UDP()/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # bad TCP checksum
    p=Ether()/IPv6(src=RandIP6(), dst=RandIP6())/TCP(flags=0x10, chksum=0x1234)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # large packet
    p=Ether()/IPv6(src=RandIP6(), dst=RandIP6())/TCP(flags=0x10)/Raw(RandString(1400))
    sendp(p, iface="ixgbe0", count=5)

Without hw cksum
----------------

On DUT:

  # disable hw cksum (use sw) in csumonly test, disable tso
  stop
  set fwd csum
  tx_checksum set 0x0 0
  tso set 0 0
  start

On tester:

  >>> test()

Then check the capture file.

With hw cksum
-------------

On DUT:

  # enable hw cksum in csumonly test, disable tso
  stop
  set fwd csum
  tx_checksum set 0xf 0
  tso set 0 0
  start

On tester:

  >>> test()

Then check the capture file.

With TSO
--------

On DUT:

  set fwd csum
  tx_checksum set 0xf 0
  tso set 800 0
  start

On tester:

  >>> test()

Then check the capture file.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/cmdline.c            |  45 +++++++++++
 app/test-pmd/config.c             |   8 ++
 app/test-pmd/csumonly.c           |  16 ++++
 app/test-pmd/testpmd.h            |   2 +
 lib/librte_mbuf/rte_mbuf.h        |   7 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 165 ++++++++++++++++++++++++++++----------
 6 files changed, 200 insertions(+), 43 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index a95b279..c628773 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -2305,6 +2305,50 @@ cmdline_parse_inst_t cmd_tx_cksum_set = {
 	},
 };
 
+/* *** ENABLE HARDWARE SEGMENTATION IN TX PACKETS *** */
+struct cmd_tso_set_result {
+	cmdline_fixed_string_t tso;
+	cmdline_fixed_string_t set;
+	uint16_t mss;
+	uint8_t port_id;
+};
+
+static void
+cmd_tso_set_parsed(void *parsed_result,
+		       __attribute__((unused)) struct cmdline *cl,
+		       __attribute__((unused)) void *data)
+{
+	struct cmd_tso_set_result *res = parsed_result;
+	tso_set(res->port_id, res->mss);
+}
+
+cmdline_parse_token_string_t cmd_tso_set_tso =
+	TOKEN_STRING_INITIALIZER(struct cmd_tso_set_result,
+				tso, "tso");
+cmdline_parse_token_string_t cmd_tso_set_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_tso_set_result,
+				set, "set");
+cmdline_parse_token_num_t cmd_tso_set_mss =
+	TOKEN_NUM_INITIALIZER(struct cmd_tso_set_result,
+				mss, UINT16);
+cmdline_parse_token_num_t cmd_tso_set_portid =
+	TOKEN_NUM_INITIALIZER(struct cmd_tso_set_result,
+				port_id, UINT8);
+
+cmdline_parse_inst_t cmd_tso_set = {
+	.f = cmd_tso_set_parsed,
+	.data = NULL,
+	.help_str = "Enable hardware segmentation (set MSS to 0 to disable): "
+	"tso set <MSS> <PORT>",
+	.tokens = {
+		(void *)&cmd_tso_set_tso,
+		(void *)&cmd_tso_set_set,
+		(void *)&cmd_tso_set_mss,
+		(void *)&cmd_tso_set_portid,
+		NULL,
+	},
+};
+
 /* *** ENABLE/DISABLE FLUSH ON RX STREAMS *** */
 struct cmd_set_flush_rx {
 	cmdline_fixed_string_t set;
@@ -5151,6 +5195,7 @@ cmdline_parse_ctx_t main_ctx[] = {
 	(cmdline_parse_inst_t *)&cmd_tx_vlan_set,
 	(cmdline_parse_inst_t *)&cmd_tx_vlan_reset,
 	(cmdline_parse_inst_t *)&cmd_tx_cksum_set,
+	(cmdline_parse_inst_t *)&cmd_tso_set,
 	(cmdline_parse_inst_t *)&cmd_link_flow_control_set,
 	(cmdline_parse_inst_t *)&cmd_priority_flow_control_set,
 	(cmdline_parse_inst_t *)&cmd_config_dcb,
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index cd82f60..a6d749d 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1455,6 +1455,14 @@ tx_cksum_set(portid_t port_id, uint32_t ol_flags)
 }
 
 void
+tso_set(portid_t port_id, uint16_t mss)
+{
+	if (port_id_is_invalid(port_id))
+		return;
+	ports[port_id].tx_mss = mss;
+}
+
+void
 fdir_add_signature_filter(portid_t port_id, uint8_t queue_id,
 			  struct rte_fdir_filter *fdir_filter)
 {
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index e93d75f..9983618 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -220,10 +220,12 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint32_t ol_flags;
 	uint32_t pkt_ol_flags;
 	uint32_t tx_ol_flags;
+	uint16_t tx_mss;
 	uint16_t l4_proto;
 	uint16_t eth_type;
 	uint8_t  l2_len;
 	uint8_t  l3_len;
+	uint8_t  l4_len;
 
 	uint32_t rx_bad_ip_csum;
 	uint32_t rx_bad_l4_csum;
@@ -255,6 +257,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 
 	txp = &ports[fs->tx_port];
 	tx_ol_flags = txp->tx_ol_flags;
+	tx_mss = txp->tx_mss;
 
 	for (i = 0; i < nb_rx; i++) {
 
@@ -272,6 +275,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 				((uintptr_t)&eth_hdr->ether_type +
 				sizeof(struct vlan_hdr)));
 		}
+		l4_len  = 0;
 
 		/* Update the L3/L4 checksum error packet count  */
 		rx_bad_ip_csum += ((pkt_ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
@@ -347,6 +351,11 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 					tcp_hdr->cksum = get_ipv4_udptcp_checksum(ipv4_hdr,
 							(uint16_t*)tcp_hdr);
 				}
+
+				if (tx_mss != 0) {
+					ol_flags |= PKT_TX_TCP_SEG;
+					l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+				}
 			}
 			else if (l4_proto == IPPROTO_SCTP) {
 				sctp_hdr = (struct sctp_hdr*) (rte_pktmbuf_mtod(mb,
@@ -404,6 +413,11 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 					tcp_hdr->cksum = get_ipv6_udptcp_checksum(ipv6_hdr,
 							(uint16_t*)tcp_hdr);
 				}
+
+				if (tx_mss != 0) {
+					ol_flags |= PKT_TX_TCP_SEG;
+					l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+				}
 			}
 			else if (l4_proto == IPPROTO_SCTP) {
 				sctp_hdr = (struct sctp_hdr*) (rte_pktmbuf_mtod(mb,
@@ -434,6 +448,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		/* Combine the packet header write. VLAN is not consider here */
 		mb->hw_offload.l2_len = l2_len;
 		mb->hw_offload.l3_len = l3_len;
+		mb->hw_offload.l4_len = l4_len;
+		mb->hw_offload.mss = tx_mss;
 		mb->ol_flags = ol_flags;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 77dcc30..6f567e7 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -134,6 +134,7 @@ struct rte_port {
 	struct fwd_stream       *tx_stream; /**< Port TX stream, if unique */
 	unsigned int            socket_id;  /**< For NUMA support */
 	uint32_t                tx_ol_flags;/**< Offload Flags of TX packets. */
+	uint16_t                tx_mss;     /**< MSS for segmentation offload. */
 	uint16_t                tx_vlan_id; /**< Tag Id. in TX VLAN packets. */
 	void                    *fwd_ctx;   /**< Forwarding mode context */
 	uint64_t                rx_bad_ip_csum; /**< rx pkts with bad ip checksum  */
@@ -480,6 +481,7 @@ void tx_vlan_reset(portid_t port_id);
 void set_qmap(portid_t port_id, uint8_t is_rx, uint16_t queue_id, uint8_t map_value);
 
 void tx_cksum_set(portid_t port_id, uint32_t ol_flags);
+void tso_set(portid_t port_id, uint16_t mss);
 
 void set_verbose_level(uint16_t vb_level);
 void set_tx_pkt_segments(unsigned *seg_lengths, unsigned nb_segs);
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index d71c86c..75298bd 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -96,6 +96,7 @@ extern "C" {
 #define PKT_TX_SCTP_CKSUM    0x00080000 /**< SCTP cksum of TX pkt. computed by NIC. */
 #define PKT_TX_UDP_CKSUM     0x000C0000 /**< UDP cksum of TX pkt. computed by NIC. */
 #define PKT_TX_IEEE1588_TMST 0x00100000 /**< TX IEEE1588 packet to timestamp. */
+#define PKT_TX_TCP_SEG       0x00200000 /**< TCP segmentation offload. */
 
 /**
  * Get the name of a RX offload flag
@@ -140,6 +141,7 @@ static inline const char *rte_get_tx_ol_flag_name(uint32_t mask)
 	case PKT_TX_SCTP_CKSUM: return "PKT_TX_SCTP_CKSUM";
 	case PKT_TX_UDP_CKSUM: return "PKT_TX_UDP_CKSUM";
 	case PKT_TX_IEEE1588_TMST: return "PKT_TX_IEEE1588_TMST";
+	case PKT_TX_TCP_SEG: return "PKT_TX_TCP_SEG";
 	default: return NULL;
 	}
 }
@@ -153,11 +155,12 @@ union rte_hw_offload {
 #define HW_OFFLOAD_L4_LEN_MASK 0xff
 		uint32_t l2_len:7; /**< L2 (MAC) Header Length. */
 		uint32_t l3_len:9; /**< L3 (IP) Header Length. */
-		uint32_t reserved:16;
+		uint32_t l4_len:8; /**< L4 (TCP/UDP) Header Length. */
+		uint32_t reserved:8;
 
 		uint16_t vlan_tci;
 		/**< VLAN Tag Control Identifier (CPU order). */
-		uint16_t reserved2;
+		uint16_t mss; /**< Maximum segment size. */
 	};
 };
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index d52482e..75ff16e 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -347,13 +347,59 @@ ixgbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 	return nb_tx;
 }
 
+/* When doing TSO, the IP length must not be included in the pseudo
+ * header checksum of the packet given to the hardware */
+static inline void
+ixgbe_fix_tcp_phdr_cksum(struct rte_mbuf *m)
+{
+	char *data;
+	uint16_t *cksum_ptr;
+	uint16_t prev_cksum;
+	uint16_t new_cksum;
+	uint16_t ip_len, ip_paylen;
+	uint32_t tmp;
+	uint8_t ip_version;
+
+	/* get phdr cksum at offset 16 of TCP header */
+	data = rte_pktmbuf_mtod(m, char *);
+	cksum_ptr = (uint16_t *)(data + m->hw_offload.l2_len +
+		m->hw_offload.l3_len + 16);
+	prev_cksum = *cksum_ptr;
+
+	/* get ip_version */
+	ip_version = (*(uint8_t *)(data + m->hw_offload.l2_len)) >> 4;
+
+	/* get ip_len at offset 2 of IP header or offset 4 of IPv6 header */
+	if (ip_version == 4) {
+		/* override ip cksum to 0 */
+		data[m->hw_offload.l2_len + 10] = 0;
+		data[m->hw_offload.l2_len + 11] = 0;
+
+		ip_len = *(uint16_t *)(data + m->hw_offload.l2_len + 2);
+		ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
+			m->hw_offload.l3_len);
+	} else {
+		ip_paylen = *(uint16_t *)(data + m->hw_offload.l2_len + 4);
+	}
+
+	/* calculate the new phdr checksum that doesn't include ip_paylen */
+	tmp = prev_cksum ^ 0xffff;
+	if (tmp < ip_paylen)
+		tmp += 0xffff;
+	tmp -= ip_paylen;
+	new_cksum = tmp;
+
+	/* replace it in the packet */
+	*cksum_ptr = new_cksum;
+}
+
 static inline void
 ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct ixgbe_adv_tx_context_desc *ctx_txd,
 		uint32_t ol_flags, union rte_hw_offload hw_offload)
 {
 	uint32_t type_tucmd_mlhl;
-	uint32_t mss_l4len_idx;
+	uint32_t mss_l4len_idx = 0;
 	uint32_t ctx_idx;
 	uint32_t vlan_macip_lens;
 	union rte_hw_offload offload_mask;
@@ -362,44 +408,61 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 	offload_mask.u64 = 0;
 	type_tucmd_mlhl = 0;
 
+	/* Specify which HW CTX to upload. */
+	mss_l4len_idx |= (ctx_idx << IXGBE_ADVTXD_IDX_SHIFT);
+
 	if (ol_flags & PKT_TX_VLAN_PKT) {
 		offload_mask.vlan_tci = 0xffff;
 	}
 
-	if (ol_flags & PKT_TX_IP_CKSUM) {
-		type_tucmd_mlhl = IXGBE_ADVTXD_TUCMD_IPV4;
+	/* check if TCP segmentation required for this packet */
+	if (ol_flags & PKT_TX_TCP_SEG) {
+		/* implies IP cksum and TCP cksum */
+		type_tucmd_mlhl = IXGBE_ADVTXD_TUCMD_IPV4 |
+			IXGBE_ADVTXD_TUCMD_L4T_TCP |
+			IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;;
+
 		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
 		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
-	}
+		offload_mask.l4_len = HW_OFFLOAD_L4_LEN_MASK;
+		offload_mask.mss = 0xffff;
+		mss_l4len_idx |= hw_offload.mss << IXGBE_ADVTXD_MSS_SHIFT;
+		mss_l4len_idx |= hw_offload.l4_len << IXGBE_ADVTXD_L4LEN_SHIFT;
+	} else { /* no TSO, check if hardware checksum is needed */
+		if (ol_flags & PKT_TX_IP_CKSUM) {
+			type_tucmd_mlhl = IXGBE_ADVTXD_TUCMD_IPV4;
+			offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+			offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
+		}
 
-	/* Specify which HW CTX to upload. */
-	mss_l4len_idx = (ctx_idx << IXGBE_ADVTXD_IDX_SHIFT);
-	switch (ol_flags & PKT_TX_L4_MASK) {
-	case PKT_TX_UDP_CKSUM:
-		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP |
+		switch (ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
-		mss_l4len_idx |= sizeof(struct udp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
-		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
-		break;
-	case PKT_TX_TCP_CKSUM:
-		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP |
+			mss_l4len_idx |= sizeof(struct udp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
+			offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+			offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
+			break;
+		case PKT_TX_TCP_CKSUM:
+			type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
-		mss_l4len_idx |= sizeof(struct tcp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
-		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
-		break;
-	case PKT_TX_SCTP_CKSUM:
-		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_SCTP |
+			mss_l4len_idx |= sizeof(struct tcp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
+			offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+			offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
+			offload_mask.l4_len = HW_OFFLOAD_L4_LEN_MASK;
+			break;
+		case PKT_TX_SCTP_CKSUM:
+			type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_SCTP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
-		mss_l4len_idx |= sizeof(struct sctp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
-		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
-		break;
-	default:
-		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_RSV |
+			mss_l4len_idx |= sizeof(struct sctp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
+			offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+			offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
+			break;
+		default:
+			type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_RSV |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
-		break;
+			break;
+		}
 	}
 
 	txq->ctx_cache[ctx_idx].flags = ol_flags;
@@ -446,20 +509,25 @@ what_advctx_update(struct igb_tx_queue *txq, uint32_t flags,
 static inline uint32_t
 tx_desc_cksum_flags_to_olinfo(uint32_t ol_flags)
 {
-	static const uint32_t l4_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_TXSM};
-	static const uint32_t l3_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_IXSM};
-	uint32_t tmp;
-
-	tmp  = l4_olinfo[(ol_flags & PKT_TX_L4_MASK)  != PKT_TX_L4_NO_CKSUM];
-	tmp |= l3_olinfo[(ol_flags & PKT_TX_IP_CKSUM) != 0];
+	uint32_t tmp = 0;
+	if ((ol_flags & PKT_TX_L4_MASK) != PKT_TX_L4_NO_CKSUM)
+		tmp |= IXGBE_ADVTXD_POPTS_TXSM;
+	if (ol_flags & PKT_TX_IP_CKSUM)
+		tmp |= IXGBE_ADVTXD_POPTS_IXSM;
+	if (ol_flags & PKT_TX_TCP_SEG)
+		tmp |= IXGBE_ADVTXD_POPTS_TXSM | IXGBE_ADVTXD_POPTS_IXSM;
 	return tmp;
 }
 
 static inline uint32_t
-tx_desc_vlan_flags_to_cmdtype(uint32_t ol_flags)
+tx_desc_ol_flags_to_cmdtype(uint32_t ol_flags)
 {
-	static const uint32_t vlan_cmd[2] = {0, IXGBE_ADVTXD_DCMD_VLE};
-	return vlan_cmd[(ol_flags & PKT_TX_VLAN_PKT) != 0];
+	uint32_t cmdtype = 0;
+	if (ol_flags & PKT_TX_VLAN_PKT)
+		cmdtype |= IXGBE_ADVTXD_DCMD_VLE;
+	if (ol_flags & PKT_TX_TCP_SEG)
+		cmdtype |= IXGBE_ADVTXD_DCMD_TSE;
+	return cmdtype;
 }
 
 /* Default RS bit threshold values */
@@ -583,7 +651,8 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 
 		/* If hardware offload required */
 		tx_ol_req = ol_flags &
-			(PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK);
+			(PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK |
+			PKT_TX_TCP_SEG);
 		if (tx_ol_req) {
 			/* If new context need be built or reuse the exist ctx. */
 			ctx = what_advctx_update(txq, tx_ol_req,
@@ -702,7 +771,20 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		 */
 		cmd_type_len = IXGBE_ADVTXD_DTYP_DATA |
 			IXGBE_ADVTXD_DCMD_IFCS | IXGBE_ADVTXD_DCMD_DEXT;
+
+		if (ol_flags & PKT_TX_TCP_SEG) {
+			/* paylen in descriptor is the not the packet
+			 * len bu the tcp payload len if TSO in on */
+			pkt_len -= (hw_offload.l2_len + hw_offload.l3_len +
+				hw_offload.l4_len);
+
+			/* the pseudo header checksum must be modified:
+			 * it should not include the ip_len */
+			ixgbe_fix_tcp_phdr_cksum(tx_pkt);
+		}
+
 		olinfo_status = (pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+
 #ifdef RTE_LIBRTE_IEEE1588
 		if (ol_flags & PKT_TX_IEEE1588_TMST)
 			cmd_type_len |= IXGBE_ADVTXD_MAC_1588;
@@ -741,7 +823,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			 * This path will go through
 			 * whatever new/reuse the context descriptor
 			 */
-			cmd_type_len  |= tx_desc_vlan_flags_to_cmdtype(ol_flags);
+			cmd_type_len  |= tx_desc_ol_flags_to_cmdtype(ol_flags);
 			olinfo_status |= tx_desc_cksum_flags_to_olinfo(ol_flags);
 			olinfo_status |= ctx << IXGBE_ADVTXD_IDX_SHIFT;
 		}
@@ -3420,9 +3502,10 @@ ixgbe_dev_tx_init(struct rte_eth_dev *dev)
 	PMD_INIT_FUNC_TRACE();
 	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
 
-	/* Enable TX CRC (checksum offload requirement) */
+	/* Enable TX CRC (checksum offload requirement) and hw padding
+	 * (TSO requirement) */
 	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
-	hlreg0 |= IXGBE_HLREG0_TXCRCEN;
+	hlreg0 |= (IXGBE_HLREG0_TXCRCEN | IXGBE_HLREG0_TXPADEN);
 	IXGBE_WRITE_REG(hw, IXGBE_HLREG0, hlreg0);
 
 	/* Setup the Base and Length of the Tx Descriptor Rings */
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield Olivier Matz
@ 2014-05-09 15:39   ` Shaw, Jeffrey B
  2014-05-09 16:06     ` Olivier MATZ
  0 siblings, 1 reply; 51+ messages in thread
From: Shaw, Jeffrey B @ 2014-05-09 15:39 UTC (permalink / raw)
  To: Olivier Matz, dev

Hello Olivier, have you tested this patch to see if there is a negative impact to performance?
Wouldn't the processor have to mask the high bytes of the physical address when it is used, for example, to populate descriptors with buffer addresses?  When compute bound, this could steal CPU cycles away from packet processing.  I think we should understand the performance trade-off in order to save these 2 bytes.

It would be interesting to see how throughput is impacted when the workload is core-bound.  This could be accomplished by running testpmd in io-fwd mode across 4x 10G ports.

Thanks,
Jeff

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
Sent: Friday, May 09, 2014 7:51 AM
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield

The physical address is never greater than (1 << 48) = 256 TB.
We can win 2 bytes in the mbuf structure by merging the physical address and the buffer length in the same bitfield.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 lib/librte_mbuf/rte_mbuf.c | 3 ++-
 lib/librte_mbuf/rte_mbuf.h | 7 ++++---
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c index c229525..9879095 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -104,7 +104,8 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->buf_len = (uint16_t)buf_len;
 
 	/* keep some headroom between start of buffer and data */
-	m->data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
+	m->data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM,
+		(uint16_t)m->buf_len);
 
 	/* init some constant fields */
 	m->pool = mp;
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 803b223..275f6b2 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -130,8 +130,8 @@ union rte_vlan_macip {  struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */
 	void *buf_addr;           /**< Virtual address of segment buffer. */
-	phys_addr_t buf_physaddr; /**< Physical address of segment buffer. */
-	uint16_t buf_len;         /**< Length of segment buffer. */
+	uint64_t buf_physaddr:48; /**< Physical address of segment buffer. */
+	uint64_t buf_len:16;      /**< Length of segment buffer. */
 #ifdef RTE_MBUF_REFCNT
 	/**
 	 * 16-bit Reference counter.
@@ -148,8 +148,9 @@ struct rte_mbuf {
 #else
 	uint16_t refcnt_reserved;     /**< Do not use this field */
 #endif
-	uint16_t reserved;             /**< Unused field. Required for padding. */
+
 	uint16_t ol_flags;            /**< Offload features. */
+	uint32_t reserved;             /**< Unused field. Required for padding. */
 
 	/* valid for any segment */
 	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
--
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-09 15:39   ` Shaw, Jeffrey B
@ 2014-05-09 16:06     ` Olivier MATZ
  2014-05-09 16:11       ` Shaw, Jeffrey B
  0 siblings, 1 reply; 51+ messages in thread
From: Olivier MATZ @ 2014-05-09 16:06 UTC (permalink / raw)
  To: Shaw, Jeffrey B, dev

Hi Jeff,

Thank you for your comment.

On 05/09/2014 05:39 PM, Shaw, Jeffrey B wrote:
> have you tested this patch to see if there is a negative impact to
> performance?

Yes, but not with testpmd. I passed our internal non-regression
performance tests and it shows no difference (or below the error
margin), even with low overhead processing like forwarding whatever
the number of cores I use.

> Wouldn't the processor have to mask the high bytes of the physical
> address when it is used, for example, to populate descriptors with
> buffer addresses?  When compute bound, this could steal CPU cycles
> away from packet processing.  I think we should understand the
> performance trade-off in order to save these 2 bytes.

I would naively say that the cost is negligible: accessing to the
length is the same as before (it's a 16 bits field) and accessing
the physical address is just a mask or a shift, which should not
be very long on an Intel processor (1 cycle?). This is to be
compared with the number of cycles per packet in io-fwd mode,
which is probably around 150 or 200.

> It would be interesting to see how throughput is impacted when the
> workload is core-bound.  This could be accomplished by running testpmd
> in io-fwd mode across 4x 10G ports.

I agree, this is something we could check. If you agree, let's first
wait for some other comments and see if we find a consensus on the
patches.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-09 16:06     ` Olivier MATZ
@ 2014-05-09 16:11       ` Shaw, Jeffrey B
  2014-05-14 14:07         ` Ananyev, Konstantin
  2014-05-19  7:27         ` Olivier MATZ
  0 siblings, 2 replies; 51+ messages in thread
From: Shaw, Jeffrey B @ 2014-05-09 16:11 UTC (permalink / raw)
  To: Olivier MATZ, dev

I agree, we should wait for comments then test the performance when the patches have settled.


-----Original Message-----
From: Olivier MATZ [mailto:olivier.matz@6wind.com] 
Sent: Friday, May 09, 2014 9:06 AM
To: Shaw, Jeffrey B; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield

Hi Jeff,

Thank you for your comment.

On 05/09/2014 05:39 PM, Shaw, Jeffrey B wrote:
> have you tested this patch to see if there is a negative impact to 
> performance?

Yes, but not with testpmd. I passed our internal non-regression performance tests and it shows no difference (or below the error margin), even with low overhead processing like forwarding whatever the number of cores I use.

> Wouldn't the processor have to mask the high bytes of the physical 
> address when it is used, for example, to populate descriptors with 
> buffer addresses?  When compute bound, this could steal CPU cycles 
> away from packet processing.  I think we should understand the 
> performance trade-off in order to save these 2 bytes.

I would naively say that the cost is negligible: accessing to the length is the same as before (it's a 16 bits field) and accessing the physical address is just a mask or a shift, which should not be very long on an Intel processor (1 cycle?). This is to be compared with the number of cycles per packet in io-fwd mode, which is probably around 150 or 200.

> It would be interesting to see how throughput is impacted when the 
> workload is core-bound.  This could be accomplished by running testpmd 
> in io-fwd mode across 4x 10G ports.

I agree, this is something we could check. If you agree, let's first wait for some other comments and see if we find a consensus on the patches.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (10 preceding siblings ...)
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support Olivier Matz
@ 2014-05-09 17:04 ` Stephen Hemminger
  2014-05-09 21:49   ` Olivier MATZ
  2014-05-19 12:47 ` Thomas Monjalon
  12 siblings, 1 reply; 51+ messages in thread
From: Stephen Hemminger @ 2014-05-09 17:04 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev

On Fri,  9 May 2014 16:50:27 +0200
Olivier Matz <olivier.matz@6wind.com> wrote:

> This series add TSO support in ixgbe DPDK driver. As discussed
> previously on the list [1], one problem is that there is not enough room
> in rte_mbuf today to store the required information to implement this
> feature:
>   - a new ol_flag
>   - the MSS
>   - the L4 header len
> 
> A solution would be to increase the size of the mbuf to 2 cache lines
> but it could have a bad impact on performance. This series proposes some
> rework to drastically reduce the size of the rte_mbuf structures before
> implementing the TSO, avoiding to change the mbuf size to 128 bytes.
> 
> After the rework of mbuf structures, the size of rte_mbuf structure is
> reduced by 9 bytes. The implementation of TSO requires to double the
> size of ol_flags (16 to 32 bits) and to double the size of offload
> information in order to add the mss and the l4 header length (32 to 64
> bits). At the end of the whole series, sizeof(rte_mbuf) is still 64
> bytes and 4 bytes are available for future use.
> 
> This rework causes a lot of modifications in the mbuf structure,
> implying some changes in the applications that directly use the mbuf
> structure fields instead of using the API functions (sometimes there is
> no function). That's why this series is a RFC. In my opinion, it's the
> proper moment for this evolution as the 1.7.0 window is open.
> 
> About TSO, the new fields in mbuf try to be generic enough to apply to
> other hardware in the future. To delegate the TCP segmentation to the
> hardware, the user has to:
> 
>   - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
>     PKT_TX_IP_CKSUM and PKT_TX_TCP_CKSUM)
>   - fill the mbuf->hw_offload information: l2_len, l3_len, l4_len, mss
>   - calculate the pseudo header checksum and set it in the TCP header,
>     as required when doing hardware TCP checksum offload
>   - set the IP checksum to 0
> 
> Compilation of DPDK and examples is tested for the following
> targets: x86_64-*-linuxapp-gcc, i686-*-linuxapp-gcc, x86_64-*-bsdapp-gcc
> 
> The mbuf rework series is validated with autotests:
> 
>   cd dpdk.org/
>   make install T=x86_64-default-linuxapp-gcc
>   cd x86_64-default-linuxapp-gcc/
>   modprobe uio
>   insmod kmod/igb_uio.ko
>   python ../tools/igb_uio_bind.py -b igb_uio 0000:02:00.0
>   echo 0 > /proc/sys/kernel/randomize_va_space
>   echo 1000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>   echo 1000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
>   mount -t hugetlbfs none /mnt/huge
>   make test
> 
> TSO is validated with IPv4 and IPv6 with testpmd (see the commit log of
> last patch for details).
> 
> The performance non-regression has been tested with 6WINDGate fast path.
> 
> Note: this patches may conflict with patch [2] which is pushed yet, but
> will probably be integrated before this series.
> 
> [1] http://dpdk.org/ml/archives/dev/2013-October/thread.html#572
> [2] http://dpdk.org/ml/archives/dev/2014-April/002166.html
> 

I would also like to propose changing the checksum offload flags.
Many devices can indicate good checksum in some cases but can't test
for many other types of packets. By changing the flags to be:
 PKT_RX_L4_CKSUM_GOOD and PKT_RX_IP_CKSUM_GOOD

It is then possible to support devices where some cases (IPv4 + TCP)
are supported but others are not.

This also better aligns with Linux checksum code for cases where mbuf
and meta data are being passed into kernel.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support
  2014-05-09 17:04 ` [dpdk-dev] [PATCH RFC 00/11] " Stephen Hemminger
@ 2014-05-09 21:49   ` Olivier MATZ
  2014-05-10  0:39     ` Stephen Hemminger
  0 siblings, 1 reply; 51+ messages in thread
From: Olivier MATZ @ 2014-05-09 21:49 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

Hi Stephen,

On 05/09/2014 07:04 PM, Stephen Hemminger wrote:
> I would also like to propose changing the checksum offload flags.
> Many devices can indicate good checksum in some cases but can't test
> for many other types of packets. By changing the flags to be:
>  PKT_RX_L4_CKSUM_GOOD and PKT_RX_IP_CKSUM_GOOD
> 
> It is then possible to support devices where some cases (IPv4 + TCP)
> are supported but others are not.

I agree. That's also what I'm talking about in the commit log of
the patch 08/11.

If there is not much rework for all the patches, I think it's feasible
to include this kind of modification in the v2 of this series.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support
  2014-05-09 21:49   ` Olivier MATZ
@ 2014-05-10  0:39     ` Stephen Hemminger
  0 siblings, 0 replies; 51+ messages in thread
From: Stephen Hemminger @ 2014-05-10  0:39 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev

On Fri, 09 May 2014 23:49:45 +0200
Olivier MATZ <olivier.matz@6wind.com> wrote:

> Hi Stephen,
> 
> On 05/09/2014 07:04 PM, Stephen Hemminger wrote:
> > I would also like to propose changing the checksum offload flags.
> > Many devices can indicate good checksum in some cases but can't test
> > for many other types of packets. By changing the flags to be:
> >  PKT_RX_L4_CKSUM_GOOD and PKT_RX_IP_CKSUM_GOOD
> > 
> > It is then possible to support devices where some cases (IPv4 + TCP)
> > are supported but others are not.
> 
> I agree. That's also what I'm talking about in the commit log of
> the patch 08/11.
> 
> If there is not much rework for all the patches, I think it's feasible
> to include this kind of modification in the v2 of this series.
> 
> Regards,
> Olivier
> 

There are three checksum states:
	1. Known good
	2. Known bad
	3. Can't tell

Current choice of flags makes handling #3 impossible. If you change it to CKSUM_GOOD
then 1 => GOOD, 2 => not GOOD, 3 => not GOOD. And for case #3 the software can
validate it.  For most cases IP checksum offload is meaning less anyway because
the IP header fits in a single cache line, and the cost to checksum is minimal.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset Olivier Matz
@ 2014-05-12 14:12   ` Thomas Monjalon
  2014-05-12 14:36     ` Venkatesan, Venky
  0 siblings, 1 reply; 51+ messages in thread
From: Thomas Monjalon @ 2014-05-12 14:12 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev

Hi Olivier,

2014-05-09 16:50, Olivier Matz:
> The mbuf structure already contains a pointer to the beginning of the
> buffer (m->buf_addr). It is not needed to use 8 bytes again to store
> another pointer to the beginning of the data.
> 
> Using a 16 bits unsigned integer is enough as we know that a mbuf is
> never longer than 64KB. We gain 6 bytes in the structure thanks to
> this modification.
> 
> Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
[...]
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -132,6 +132,13 @@ struct rte_mbuf {
>  	void *buf_addr;           /**< Virtual address of segment buffer. */
>  	uint64_t buf_physaddr:48; /**< Physical address of segment buffer. */
>  	uint64_t buf_len:16;      /**< Length of segment buffer. */
> +
> +	/* valid for any segment */
> +	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> +	uint16_t data_off;
> +	uint16_t data_len;        /**< Amount of data in segment buffer. */
> +	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
> +
>  #ifdef RTE_MBUF_REFCNT
>  	/**
>  	 * 16-bit Reference counter.
> @@ -142,36 +149,30 @@ struct rte_mbuf {
>  	 * config option.
>  	 */
>  	union {
> -		rte_atomic16_t refcnt_atomic;   /**< Atomically accessed refcnt */
> -		uint16_t refcnt;                /**< Non-atomically accessed refcnt 
*/
> +		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
> +		uint16_t refcnt;  /**< Non-atomically accessed refcnt */
>  	};
>  #else
> -	uint16_t refcnt_reserved;     /**< Do not use this field */
> +	uint16_t refcnt_reserved; /**< Do not use this field */
>  #endif
> 
> -	uint16_t ol_flags;            /**< Offload features. */
> -	uint32_t reserved;             /**< Unused field. Required for padding. 
*/
> -
> -	/* valid for any segment */
> -	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
> -	void* data;             /**< Start address of data in segment buffer. */
> -	uint16_t data_len;      /**< Amount of data in segment buffer. */
> -
>  	/* these fields are valid for first segment only */
> -	uint8_t nb_segs;        /**< Number of segments. */
> -	uint8_t in_port;        /**< Input port. */
> -	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len.
> */ +	uint8_t nb_segs;          /**< Number of segments. */
> +	uint8_t in_port;          /**< Input port. */
> +	uint16_t ol_flags;        /**< Offload features. */
> +	uint16_t reserved;        /**< Unused field. Required for padding. */
> 
>  	/* offload features, valid for first segment only */
>  	union rte_vlan_macip vlan_macip;
>  	union {
> -		uint32_t rss;       /**< RSS hash result if RSS enabled */
> +		uint32_t rss;     /**< RSS hash result if RSS enabled */
>  		struct {
>  			uint16_t hash;
>  			uint16_t id;
> -		} fdir;             /**< Filter identifier if FDIR enabled */
> -		uint32_t sched;     /**< Hierarchical scheduler */
> -	} hash;                 /**< hash information */
> +		} fdir;           /**< Filter identifier if FDIR enabled */
> +		uint32_t sched;   /**< Hierarchical scheduler */
> +	} hash;                   /**< hash information */
> +	uint64_t reserved2;       /**< Unused field. Required for padding. */
>  } __rte_cache_aligned;

There are some cosmetic changes mixed with real changes.
It make hard to read them.
Please split this patch.

-- 
Thomas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support Olivier Matz
@ 2014-05-12 14:30   ` Thomas Monjalon
  2014-05-15 15:09   ` Ananyev, Konstantin
  1 sibling, 0 replies; 51+ messages in thread
From: Thomas Monjalon @ 2014-05-12 14:30 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev

2014-05-09 16:50, Olivier Matz:
> Implement TSO (TCP segmentation offload) in ixgbe driver. To delegate
> the TCP segmentation to the hardware, the user has to:
> 
> - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
>   PKT_TX_IP_CKSUM and PKT_TX_TCP_CKSUM)
> - fill the mbuf->hw_offload information: l2_len, l3_len, l4_len, mss
> - calculate the pseudo header checksum and set it in the TCP header,
>   as required when doing hardware TCP checksum offload
> - set the IP checksum to 0
> 
> This approach seems generic enough to be used for other hw/drivers
> in the future.

Minor note: it would be nice to separate ixgbe support in another patch.

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 14:12   ` Thomas Monjalon
@ 2014-05-12 14:36     ` Venkatesan, Venky
  2014-05-12 14:41       ` Neil Horman
  0 siblings, 1 reply; 51+ messages in thread
From: Venkatesan, Venky @ 2014-05-12 14:36 UTC (permalink / raw)
  To: Thomas Monjalon, Olivier Matz; +Cc: dev

Olivier, 

This is a hugely problematic change, and has a pretty large  performance impact (because the dependency to compute and access). We debated this for a long time during the early days of DPDK and decided against it. This is also a repeated sequence - the driver will do it twice (Rx + Tx) and the next level stack will do it twice (Rx + Tx) ... 

My vote is to reject this change particular change to the mbuf. 

Regards, 
-Venky

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
Sent: Monday, May 12, 2014 7:13 AM
To: Olivier Matz
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset

Hi Olivier,

2014-05-09 16:50, Olivier Matz:
> The mbuf structure already contains a pointer to the beginning of the 
> buffer (m->buf_addr). It is not needed to use 8 bytes again to store 
> another pointer to the beginning of the data.
> 
> Using a 16 bits unsigned integer is enough as we know that a mbuf is 
> never longer than 64KB. We gain 6 bytes in the structure thanks to 
> this modification.
> 
> Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
[...]
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -132,6 +132,13 @@ struct rte_mbuf {
>  	void *buf_addr;           /**< Virtual address of segment buffer. */
>  	uint64_t buf_physaddr:48; /**< Physical address of segment buffer. */
>  	uint64_t buf_len:16;      /**< Length of segment buffer. */
> +
> +	/* valid for any segment */
> +	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> +	uint16_t data_off;
> +	uint16_t data_len;        /**< Amount of data in segment buffer. */
> +	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
> +
>  #ifdef RTE_MBUF_REFCNT
>  	/**
>  	 * 16-bit Reference counter.
> @@ -142,36 +149,30 @@ struct rte_mbuf {
>  	 * config option.
>  	 */
>  	union {
> -		rte_atomic16_t refcnt_atomic;   /**< Atomically accessed refcnt */
> -		uint16_t refcnt;                /**< Non-atomically accessed refcnt 
*/
> +		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
> +		uint16_t refcnt;  /**< Non-atomically accessed refcnt */
>  	};
>  #else
> -	uint16_t refcnt_reserved;     /**< Do not use this field */
> +	uint16_t refcnt_reserved; /**< Do not use this field */
>  #endif
> 
> -	uint16_t ol_flags;            /**< Offload features. */
> -	uint32_t reserved;             /**< Unused field. Required for padding. 
*/
> -
> -	/* valid for any segment */
> -	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
> -	void* data;             /**< Start address of data in segment buffer. */
> -	uint16_t data_len;      /**< Amount of data in segment buffer. */
> -
>  	/* these fields are valid for first segment only */
> -	uint8_t nb_segs;        /**< Number of segments. */
> -	uint8_t in_port;        /**< Input port. */
> -	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len.
> */ +	uint8_t nb_segs;          /**< Number of segments. */
> +	uint8_t in_port;          /**< Input port. */
> +	uint16_t ol_flags;        /**< Offload features. */
> +	uint16_t reserved;        /**< Unused field. Required for padding. */
> 
>  	/* offload features, valid for first segment only */
>  	union rte_vlan_macip vlan_macip;
>  	union {
> -		uint32_t rss;       /**< RSS hash result if RSS enabled */
> +		uint32_t rss;     /**< RSS hash result if RSS enabled */
>  		struct {
>  			uint16_t hash;
>  			uint16_t id;
> -		} fdir;             /**< Filter identifier if FDIR enabled */
> -		uint32_t sched;     /**< Hierarchical scheduler */
> -	} hash;                 /**< hash information */
> +		} fdir;           /**< Filter identifier if FDIR enabled */
> +		uint32_t sched;   /**< Hierarchical scheduler */
> +	} hash;                   /**< hash information */
> +	uint64_t reserved2;       /**< Unused field. Required for padding. */
>  } __rte_cache_aligned;

There are some cosmetic changes mixed with real changes.
It make hard to read them.
Please split this patch.

--
Thomas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 14:36     ` Venkatesan, Venky
@ 2014-05-12 14:41       ` Neil Horman
  2014-05-12 15:07         ` Olivier MATZ
  0 siblings, 1 reply; 51+ messages in thread
From: Neil Horman @ 2014-05-12 14:41 UTC (permalink / raw)
  To: Venkatesan, Venky; +Cc: dev

On Mon, May 12, 2014 at 02:36:12PM +0000, Venkatesan, Venky wrote:
> Olivier, 
> 
> This is a hugely problematic change, and has a pretty large  performance impact (because the dependency to compute and access). We debated this for a long time during the early days of DPDK and decided against it. This is also a repeated sequence - the driver will do it twice (Rx + Tx) and the next level stack will do it twice (Rx + Tx) ... 
> 
> My vote is to reject this change particular change to the mbuf. 
> 
> Regards, 
> -Venky
> 
Do you have perforamance numbers to compare throughput with and without this
change?  I always feel suspcious when I see the spectre of performane used to
support or deny a change without supporting reasoning or metrics.

Neil

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Monday, May 12, 2014 7:13 AM
> To: Olivier Matz
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
> 
> Hi Olivier,
> 
> 2014-05-09 16:50, Olivier Matz:
> > The mbuf structure already contains a pointer to the beginning of the 
> > buffer (m->buf_addr). It is not needed to use 8 bytes again to store 
> > another pointer to the beginning of the data.
> > 
> > Using a 16 bits unsigned integer is enough as we know that a mbuf is 
> > never longer than 64KB. We gain 6 bytes in the structure thanks to 
> > this modification.
> > 
> > Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
> [...]
> > --- a/lib/librte_mbuf/rte_mbuf.h
> > +++ b/lib/librte_mbuf/rte_mbuf.h
> > @@ -132,6 +132,13 @@ struct rte_mbuf {
> >  	void *buf_addr;           /**< Virtual address of segment buffer. */
> >  	uint64_t buf_physaddr:48; /**< Physical address of segment buffer. */
> >  	uint64_t buf_len:16;      /**< Length of segment buffer. */
> > +
> > +	/* valid for any segment */
> > +	struct rte_mbuf *next;    /**< Next segment of scattered packet. */
> > +	uint16_t data_off;
> > +	uint16_t data_len;        /**< Amount of data in segment buffer. */
> > +	uint32_t pkt_len;         /**< Total pkt len: sum of all segments. */
> > +
> >  #ifdef RTE_MBUF_REFCNT
> >  	/**
> >  	 * 16-bit Reference counter.
> > @@ -142,36 +149,30 @@ struct rte_mbuf {
> >  	 * config option.
> >  	 */
> >  	union {
> > -		rte_atomic16_t refcnt_atomic;   /**< Atomically accessed refcnt */
> > -		uint16_t refcnt;                /**< Non-atomically accessed refcnt 
> */
> > +		rte_atomic16_t refcnt_atomic; /**< Atomically accessed refcnt */
> > +		uint16_t refcnt;  /**< Non-atomically accessed refcnt */
> >  	};
> >  #else
> > -	uint16_t refcnt_reserved;     /**< Do not use this field */
> > +	uint16_t refcnt_reserved; /**< Do not use this field */
> >  #endif
> > 
> > -	uint16_t ol_flags;            /**< Offload features. */
> > -	uint32_t reserved;             /**< Unused field. Required for padding. 
> */
> > -
> > -	/* valid for any segment */
> > -	struct rte_mbuf *next;  /**< Next segment of scattered packet. */
> > -	void* data;             /**< Start address of data in segment buffer. */
> > -	uint16_t data_len;      /**< Amount of data in segment buffer. */
> > -
> >  	/* these fields are valid for first segment only */
> > -	uint8_t nb_segs;        /**< Number of segments. */
> > -	uint8_t in_port;        /**< Input port. */
> > -	uint32_t pkt_len;       /**< Total pkt len: sum of all segment data_len.
> > */ +	uint8_t nb_segs;          /**< Number of segments. */
> > +	uint8_t in_port;          /**< Input port. */
> > +	uint16_t ol_flags;        /**< Offload features. */
> > +	uint16_t reserved;        /**< Unused field. Required for padding. */
> > 
> >  	/* offload features, valid for first segment only */
> >  	union rte_vlan_macip vlan_macip;
> >  	union {
> > -		uint32_t rss;       /**< RSS hash result if RSS enabled */
> > +		uint32_t rss;     /**< RSS hash result if RSS enabled */
> >  		struct {
> >  			uint16_t hash;
> >  			uint16_t id;
> > -		} fdir;             /**< Filter identifier if FDIR enabled */
> > -		uint32_t sched;     /**< Hierarchical scheduler */
> > -	} hash;                 /**< hash information */
> > +		} fdir;           /**< Filter identifier if FDIR enabled */
> > +		uint32_t sched;   /**< Hierarchical scheduler */
> > +	} hash;                   /**< hash information */
> > +	uint64_t reserved2;       /**< Unused field. Required for padding. */
> >  } __rte_cache_aligned;
> 
> There are some cosmetic changes mixed with real changes.
> It make hard to read them.
> Please split this patch.
> 
> --
> Thomas
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 14:41       ` Neil Horman
@ 2014-05-12 15:07         ` Olivier MATZ
  2014-05-12 15:59           ` Stephen Hemminger
  2014-05-12 16:06           ` Venkatesan, Venky
  0 siblings, 2 replies; 51+ messages in thread
From: Olivier MATZ @ 2014-05-12 15:07 UTC (permalink / raw)
  To: Neil Horman, Venkatesan, Venky; +Cc: dev

Hi Venky,

On 05/12/2014 04:41 PM, Neil Horman wrote:
>> This is a hugely problematic change, and has a pretty large
>> performance impact (because the dependency to compute and access). We
>> debated this for a long time during the early days of DPDK and
>> decided against it. This is also a repeated sequence - the driver
>> will do it twice (Rx + Tx) and the next level stack will do it twice
>> (Rx + Tx) ...
>>
>> My vote is to reject this change particular change to the mbuf.
>>
>> Regards,
>> -Venky
>>
> Do you have perforamance numbers to compare throughput with and without this
> change?  I always feel suspcious when I see the spectre of performane used to
> support or deny a change without supporting reasoning or metrics.

I agree with Neil. My feeling is that it won't impact performance, and
it is correlated with the forwarding tests I've done with this patch.

I don't really understand what would cost more by storing the offset
instead of the virtual address. I agree that each time the stack will
access to the begining of the mbuf, there will be an arithmetic
operation, but it is compensated by other operations that will be
accelerated:

- When receiving a packet, the driver will do:

     m->data_off = RTE_PKTMBUF_HEADROOM;

   instead of:

     m->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;

- Each time the stack will prepend data, it has to check if the headroom
   is large enough to do the operation. This will be faster as data_off
   is the headroom.

- When transmitting a packet, the driver will get the physical address:

     phys_addr = m->buf_physaddr + m->data_off

   instead of:

     phys_addr = (m->buf_physaddr +  \
         ((char *)m->data - (char *)m->buf_addr)))

Moreover, these operations look negligible to me (few cycles) compared
to the large amount of arithmetic operations and tests done in the
driver.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 15:07         ` Olivier MATZ
@ 2014-05-12 15:59           ` Stephen Hemminger
  2014-05-12 16:13             ` Olivier MATZ
  2014-05-12 16:06           ` Venkatesan, Venky
  1 sibling, 1 reply; 51+ messages in thread
From: Stephen Hemminger @ 2014-05-12 15:59 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev

On Mon, 12 May 2014 17:07:03 +0200
Olivier MATZ <olivier.matz@6wind.com> wrote:

> Hi Venky,
> 
> On 05/12/2014 04:41 PM, Neil Horman wrote:
> >> This is a hugely problematic change, and has a pretty large
> >> performance impact (because the dependency to compute and access). We
> >> debated this for a long time during the early days of DPDK and
> >> decided against it. This is also a repeated sequence - the driver
> >> will do it twice (Rx + Tx) and the next level stack will do it twice
> >> (Rx + Tx) ...
> >>
> >> My vote is to reject this change particular change to the mbuf.
> >>
> >> Regards,
> >> -Venky
> >>
> > Do you have perforamance numbers to compare throughput with and without this
> > change?  I always feel suspcious when I see the spectre of performane used to
> > support or deny a change without supporting reasoning or metrics.
> 
> I agree with Neil. My feeling is that it won't impact performance, and
> it is correlated with the forwarding tests I've done with this patch.
> 
> I don't really understand what would cost more by storing the offset
> instead of the virtual address. I agree that each time the stack will
> access to the begining of the mbuf, there will be an arithmetic
> operation, but it is compensated by other operations that will be
> accelerated:
> 
> - When receiving a packet, the driver will do:
> 
>      m->data_off = RTE_PKTMBUF_HEADROOM;
> 
>    instead of:
> 
>      m->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
> 
> - Each time the stack will prepend data, it has to check if the headroom
>    is large enough to do the operation. This will be faster as data_off
>    is the headroom.
> 
> - When transmitting a packet, the driver will get the physical address:
> 
>      phys_addr = m->buf_physaddr + m->data_off
> 
>    instead of:
> 
>      phys_addr = (m->buf_physaddr +  \
>          ((char *)m->data - (char *)m->buf_addr)))
> 
> Moreover, these operations look negligible to me (few cycles) compared
> to the large amount of arithmetic operations and tests done in the
> driver.
> 
> Regards,
> Olivier

There is one case which this case might make problematic.
Right now it is possible to clone an mbuf and in the cloned mbuf
use the associated data buffer as private meta data store.
This is convenient (like skb->cb in Linux) and avoids addtional
allocation.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 15:07         ` Olivier MATZ
  2014-05-12 15:59           ` Stephen Hemminger
@ 2014-05-12 16:06           ` Venkatesan, Venky
  2014-05-12 18:39             ` Neil Horman
  1 sibling, 1 reply; 51+ messages in thread
From: Venkatesan, Venky @ 2014-05-12 16:06 UTC (permalink / raw)
  To: Olivier MATZ, Neil Horman; +Cc: dev

Olivier, 

The impact isn't going to be felt on the driver quite as much (and can be mitigated) - the driver runs a pretty low IPC (~1.7) compared to some of the more optimized code above it that actually accesses the data. The problem with the dependent compute is like this - in effect you are changing 

struct eth_hdr * eth = (struct eth_hdr *) m->data;
to 
struct eth_hdr * eth = (struct eth_hdr *) ( (char *)m->buf _addr + m->data_offset);

We have some code that actually processes 4-8 packets in parallel (parse + hash), with a pretty high IPC. What we've done here is essentially replaced is a simple load, with  a load, load, add sequence in front of it. There is no real way to do these computations in parallel for multiple packets - it has to be done one or two at a time. What suffers is the IPC of the overall function that does the parse/hash quite significantly. It's those functions that I worry about more than the driver.  I haven't yet been able to come up with a mitigation for this yet. 

Neil, 

The last time we looked at this change - and it's been a while ago, the negative effect on the upper level functions built on this was on the order of about 15-20%. It's probably will get worse once we tune the code even more.  Hope the above explanation gives you a flavour of the problem this will introduce. 

Regards, 
-Venky




-----Original Message-----
From: Olivier MATZ [mailto:olivier.matz@6wind.com] 
Sent: Monday, May 12, 2014 8:07 AM
To: Neil Horman; Venkatesan, Venky
Cc: Thomas Monjalon; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset

Hi Venky,

On 05/12/2014 04:41 PM, Neil Horman wrote:
>> This is a hugely problematic change, and has a pretty large 
>> performance impact (because the dependency to compute and access). We 
>> debated this for a long time during the early days of DPDK and 
>> decided against it. This is also a repeated sequence - the driver 
>> will do it twice (Rx + Tx) and the next level stack will do it twice 
>> (Rx + Tx) ...
>>
>> My vote is to reject this change particular change to the mbuf.
>>
>> Regards,
>> -Venky
>>
> Do you have perforamance numbers to compare throughput with and 
> without this change?  I always feel suspcious when I see the spectre 
> of performane used to support or deny a change without supporting reasoning or metrics.

I agree with Neil. My feeling is that it won't impact performance, and it is correlated with the forwarding tests I've done with this patch.

I don't really understand what would cost more by storing the offset instead of the virtual address. I agree that each time the stack will access to the begining of the mbuf, there will be an arithmetic operation, but it is compensated by other operations that will be
accelerated:

- When receiving a packet, the driver will do:

     m->data_off = RTE_PKTMBUF_HEADROOM;

   instead of:

     m->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;

- Each time the stack will prepend data, it has to check if the headroom
   is large enough to do the operation. This will be faster as data_off
   is the headroom.

- When transmitting a packet, the driver will get the physical address:

     phys_addr = m->buf_physaddr + m->data_off

   instead of:

     phys_addr = (m->buf_physaddr +  \
         ((char *)m->data - (char *)m->buf_addr)))

Moreover, these operations look negligible to me (few cycles) compared to the large amount of arithmetic operations and tests done in the driver.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 15:59           ` Stephen Hemminger
@ 2014-05-12 16:13             ` Olivier MATZ
  2014-05-12 17:13               ` Stephen Hemminger
  0 siblings, 1 reply; 51+ messages in thread
From: Olivier MATZ @ 2014-05-12 16:13 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

Hi Stephen,

On 05/12/2014 05:59 PM, Stephen Hemminger wrote:
> There is one case which this case might make problematic.
> Right now it is possible to clone an mbuf and in the cloned mbuf
> use the associated data buffer as private meta data store.
> This is convenient (like skb->cb in Linux) and avoids addtional
> allocation.

I don't get your point. Why using rte_pktmbuf_mtod(m, char *)
wouldn't work in your case?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 16:13             ` Olivier MATZ
@ 2014-05-12 17:13               ` Stephen Hemminger
  2014-05-13 13:29                 ` Olivier MATZ
  0 siblings, 1 reply; 51+ messages in thread
From: Stephen Hemminger @ 2014-05-12 17:13 UTC (permalink / raw)
  To: Olivier MATZ; +Cc: dev

On Mon, 12 May 2014 18:13:26 +0200
Olivier MATZ <olivier.matz@6wind.com> wrote:

> Hi Stephen,
> 
> On 05/12/2014 05:59 PM, Stephen Hemminger wrote:
> > There is one case which this case might make problematic.
> > Right now it is possible to clone an mbuf and in the cloned mbuf
> > use the associated data buffer as private meta data store.
> > This is convenient (like skb->cb in Linux) and avoids addtional
> > allocation.
> 
> I don't get your point. Why using rte_pktmbuf_mtod(m, char *)
> wouldn't work in your case?

In cloned mbuf
rte_pktmbuf_mtod(m, char *) points to the original data.
RTE_MBUF_TO_BADDR(m) points to buffer in the mbuf which we
use for metadata (timestamp).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 16:06           ` Venkatesan, Venky
@ 2014-05-12 18:39             ` Neil Horman
  2014-05-13 13:54               ` Venkatesan, Venky
  0 siblings, 1 reply; 51+ messages in thread
From: Neil Horman @ 2014-05-12 18:39 UTC (permalink / raw)
  To: Venkatesan, Venky; +Cc: dev

On Mon, May 12, 2014 at 04:06:23PM +0000, Venkatesan, Venky wrote:
> Olivier, 
> 
> The impact isn't going to be felt on the driver quite as much (and can be mitigated) - the driver runs a pretty low IPC (~1.7) compared to some of the more optimized code above it that actually accesses the data. The problem with the dependent compute is like this - in effect you are changing 
> 
> struct eth_hdr * eth = (struct eth_hdr *) m->data;
> to 
> struct eth_hdr * eth = (struct eth_hdr *) ( (char *)m->buf _addr + m->data_offset);
> 
> We have some code that actually processes 4-8 packets in parallel (parse + hash), with a pretty high IPC. What we've done here is essentially replaced is a simple load, with  a load, load, add sequence in front of it. There is no real way to do these computations in parallel for multiple packets - it has to be done one or two at a time. What suffers is the IPC of the overall function that does the parse/hash quite significantly. It's those functions that I worry about more than the driver.  I haven't yet been able to come up with a mitigation for this yet. 
> 
> Neil, 
> 
> The last time we looked at this change - and it's been a while ago, the negative effect on the upper level functions built on this was on the order of about 15-20%. It's probably will get worse once we tune the code even more.  Hope the above explanation gives you a flavour of the problem this will introduce. 
> 
I'm sorry, it doesnt.  I take you at your word that it was a problem, but I
don't think we can just categorically deny patches based on past testing of
potentially simmilar code, especially given that this series attempts to improve
some traffic patten via the implementation TSO (meaning the net result will be
different based on the use case).  

I understand what your saying above, that this code incurs a second load
operation (though I would think they could be implemented in parallel, or at the
very least accelerated by clever placement of data_offset relative to buf_addr
to ensure that the second load was cache hot).

Regardless, my point is, just saying that this can't be done because you saw a
performance hit with something simmilar in the past, isn't helpful.  If you
think thats a problem, then we really need to get details of your test case and
measurements you took so that they can be reproduced, and confirmed or refuted.

Regards
Neil.

> Regards, 
> -Venky
> 
> 
> 
> 
> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com] 
> Sent: Monday, May 12, 2014 8:07 AM
> To: Neil Horman; Venkatesan, Venky
> Cc: Thomas Monjalon; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
> 
> Hi Venky,
> 
> On 05/12/2014 04:41 PM, Neil Horman wrote:
> >> This is a hugely problematic change, and has a pretty large 
> >> performance impact (because the dependency to compute and access). We 
> >> debated this for a long time during the early days of DPDK and 
> >> decided against it. This is also a repeated sequence - the driver 
> >> will do it twice (Rx + Tx) and the next level stack will do it twice 
> >> (Rx + Tx) ...
> >>
> >> My vote is to reject this change particular change to the mbuf.
> >>
> >> Regards,
> >> -Venky
> >>
> > Do you have perforamance numbers to compare throughput with and 
> > without this change?  I always feel suspcious when I see the spectre 
> > of performane used to support or deny a change without supporting reasoning or metrics.
> 
> I agree with Neil. My feeling is that it won't impact performance, and it is correlated with the forwarding tests I've done with this patch.
> 
> I don't really understand what would cost more by storing the offset instead of the virtual address. I agree that each time the stack will access to the begining of the mbuf, there will be an arithmetic operation, but it is compensated by other operations that will be
> accelerated:
> 
> - When receiving a packet, the driver will do:
> 
>      m->data_off = RTE_PKTMBUF_HEADROOM;
> 
>    instead of:
> 
>      m->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
> 
> - Each time the stack will prepend data, it has to check if the headroom
>    is large enough to do the operation. This will be faster as data_off
>    is the headroom.
> 
> - When transmitting a packet, the driver will get the physical address:
> 
>      phys_addr = m->buf_physaddr + m->data_off
> 
>    instead of:
> 
>      phys_addr = (m->buf_physaddr +  \
>          ((char *)m->data - (char *)m->buf_addr)))
> 
> Moreover, these operations look negligible to me (few cycles) compared to the large amount of arithmetic operations and tests done in the driver.
> 
> Regards,
> Olivier
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 17:13               ` Stephen Hemminger
@ 2014-05-13 13:29                 ` Olivier MATZ
  0 siblings, 0 replies; 51+ messages in thread
From: Olivier MATZ @ 2014-05-13 13:29 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

Hi Stephen,

On 05/12/2014 07:13 PM, Stephen Hemminger wrote:
> In cloned mbuf
> rte_pktmbuf_mtod(m, char *) points to the original data.
> RTE_MBUF_TO_BADDR(m) points to buffer in the mbuf which we
> use for metadata (timestamp).

I still don't see the problem. Let's take an example: m2 is a clone
of m1. Before applying the patch series, we have:

- rte_pktmbuf_mtod(m1) points to m1->pkt.data
- RTE_MBUF_TO_BADDR(m1) points to m1->buf_addr
- rte_pktmbuf_mtod(m2) points to m1->pkt.data
- RTE_MBUF_TO_BADDR(m2) points to m2->buf_addr

After the patches:

- rte_pktmbuf_mtod(m1) points to m1->buf_addr + m1->data_off
- RTE_MBUF_TO_BADDR(m1) points to m1->buf_addr
- rte_pktmbuf_mtod(m2) points to m1->buf_addr + m2->data_off
- RTE_MBUF_TO_BADDR(m2) points to m2->buf_addr

I assume this is equivalent, as m2->data_off will have the same value
than m1->data_off. Have you identified a specific test case that fails?
The mbuf autotest is successful, if you think there is something
else to test, it should be added in the test app.

I don't use this feature, but by the way it seems that the macros
RTE_MBUF_TO_BADDR(mb) and RTE_MBUF_FROM_BADDR(ba) won't return
the proper value if the application initializes a mbuf pool
with an object size != sizeof(rte_mbuf). It thought it was something
quite common as it allows the application to add its own info in the
mbuf.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-12 18:39             ` Neil Horman
@ 2014-05-13 13:54               ` Venkatesan, Venky
  2014-05-13 14:09                 ` Thomas Monjalon
  0 siblings, 1 reply; 51+ messages in thread
From: Venkatesan, Venky @ 2014-05-13 13:54 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

An alternative way to save 6 bytes (without the side effects this change has) would be to change the mempool struct * to a uint16_t mempool_id. That limits the changes to a return function, and the performance impact of that can be mitigated quite easily. 

-Venky

-----Original Message-----
From: Neil Horman [mailto:nhorman@tuxdriver.com] 
Sent: Monday, May 12, 2014 11:40 AM
To: Venkatesan, Venky
Cc: Olivier MATZ; Thomas Monjalon; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset

On Mon, May 12, 2014 at 04:06:23PM +0000, Venkatesan, Venky wrote:
> Olivier,
> 
> The impact isn't going to be felt on the driver quite as much (and can 
> be mitigated) - the driver runs a pretty low IPC (~1.7) compared to 
> some of the more optimized code above it that actually accesses the 
> data. The problem with the dependent compute is like this - in effect 
> you are changing
> 
> struct eth_hdr * eth = (struct eth_hdr *) m->data; to struct eth_hdr * 
> eth = (struct eth_hdr *) ( (char *)m->buf _addr + m->data_offset);
> 
> We have some code that actually processes 4-8 packets in parallel (parse + hash), with a pretty high IPC. What we've done here is essentially replaced is a simple load, with  a load, load, add sequence in front of it. There is no real way to do these computations in parallel for multiple packets - it has to be done one or two at a time. What suffers is the IPC of the overall function that does the parse/hash quite significantly. It's those functions that I worry about more than the driver.  I haven't yet been able to come up with a mitigation for this yet. 
> 
> Neil,
> 
> The last time we looked at this change - and it's been a while ago, the negative effect on the upper level functions built on this was on the order of about 15-20%. It's probably will get worse once we tune the code even more.  Hope the above explanation gives you a flavour of the problem this will introduce. 
> 
I'm sorry, it doesnt.  I take you at your word that it was a problem, but I don't think we can just categorically deny patches based on past testing of potentially simmilar code, especially given that this series attempts to improve some traffic patten via the implementation TSO (meaning the net result will be different based on the use case).  

I understand what your saying above, that this code incurs a second load operation (though I would think they could be implemented in parallel, or at the very least accelerated by clever placement of data_offset relative to buf_addr to ensure that the second load was cache hot).

Regardless, my point is, just saying that this can't be done because you saw a performance hit with something simmilar in the past, isn't helpful.  If you think thats a problem, then we really need to get details of your test case and measurements you took so that they can be reproduced, and confirmed or refuted.

Regards
Neil.

> Regards,
> -Venky
> 
> 
> 
> 
> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Monday, May 12, 2014 8:07 AM
> To: Neil Horman; Venkatesan, Venky
> Cc: Thomas Monjalon; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer 
> by an offset
> 
> Hi Venky,
> 
> On 05/12/2014 04:41 PM, Neil Horman wrote:
> >> This is a hugely problematic change, and has a pretty large 
> >> performance impact (because the dependency to compute and access). 
> >> We debated this for a long time during the early days of DPDK and 
> >> decided against it. This is also a repeated sequence - the driver 
> >> will do it twice (Rx + Tx) and the next level stack will do it 
> >> twice (Rx + Tx) ...
> >>
> >> My vote is to reject this change particular change to the mbuf.
> >>
> >> Regards,
> >> -Venky
> >>
> > Do you have perforamance numbers to compare throughput with and 
> > without this change?  I always feel suspcious when I see the spectre 
> > of performane used to support or deny a change without supporting reasoning or metrics.
> 
> I agree with Neil. My feeling is that it won't impact performance, and it is correlated with the forwarding tests I've done with this patch.
> 
> I don't really understand what would cost more by storing the offset 
> instead of the virtual address. I agree that each time the stack will 
> access to the begining of the mbuf, there will be an arithmetic 
> operation, but it is compensated by other operations that will be
> accelerated:
> 
> - When receiving a packet, the driver will do:
> 
>      m->data_off = RTE_PKTMBUF_HEADROOM;
> 
>    instead of:
> 
>      m->data = (char*) rxm->buf_addr + RTE_PKTMBUF_HEADROOM;
> 
> - Each time the stack will prepend data, it has to check if the headroom
>    is large enough to do the operation. This will be faster as data_off
>    is the headroom.
> 
> - When transmitting a packet, the driver will get the physical address:
> 
>      phys_addr = m->buf_physaddr + m->data_off
> 
>    instead of:
> 
>      phys_addr = (m->buf_physaddr +  \
>          ((char *)m->data - (char *)m->buf_addr)))
> 
> Moreover, these operations look negligible to me (few cycles) compared to the large amount of arithmetic operations and tests done in the driver.
> 
> Regards,
> Olivier
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset
  2014-05-13 13:54               ` Venkatesan, Venky
@ 2014-05-13 14:09                 ` Thomas Monjalon
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Monjalon @ 2014-05-13 14:09 UTC (permalink / raw)
  To: Venkatesan, Venky; +Cc: dev

Hi Venky,

2014-05-13 13:54, Venkatesan, Venky:
> An alternative way to save 6 bytes (without the side effects this change
> has) would be to change the mempool struct * to a uint16_t mempool_id. That
> limits the changes to a return function, and the performance impact of that
> can be mitigated quite easily.

It's very difficult to compare things without code examples.
Please, provide:
- a patch for your proposal
- an example application which allows to test and understand the performance 
issue you are pointing out

PS: please don't top post, it makes this thread difficult to read

-- 
Thomas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-09 16:11       ` Shaw, Jeffrey B
@ 2014-05-14 14:07         ` Ananyev, Konstantin
  2014-05-15  9:53           ` Olivier MATZ
  2014-05-19  7:27         ` Olivier MATZ
  1 sibling, 1 reply; 51+ messages in thread
From: Ananyev, Konstantin @ 2014-05-14 14:07 UTC (permalink / raw)
  To: Shaw, Jeffrey B, Olivier MATZ, dev

Hi Oliver,

Apart from performance impact, one more concern:
As I know, theoretical limit for PA on Intel is 52 bits.
I understand that these days no-one using more than 48 bits and it probably would stay like that for next few years.
Though if we'll occupy these (MAXPHYADDR - 48) bits now, it can become a potential problem in future.
After all the savings from that changes are not that big - only 2 bytes.  
As I understand you already save extra 7 bytes with other proposed modifications of mbuf.
That's enough to add TSO related information into the mbuf.
So my suggestion would be to keep phys_addr 64bit long.
Thanks
Konstantin  

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Shaw, Jeffrey B
Sent: Friday, May 09, 2014 5:12 PM
To: Olivier MATZ; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield

I agree, we should wait for comments then test the performance when the patches have settled.


-----Original Message-----
From: Olivier MATZ [mailto:olivier.matz@6wind.com] 
Sent: Friday, May 09, 2014 9:06 AM
To: Shaw, Jeffrey B; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield

Hi Jeff,

Thank you for your comment.

On 05/09/2014 05:39 PM, Shaw, Jeffrey B wrote:
> have you tested this patch to see if there is a negative impact to 
> performance?

Yes, but not with testpmd. I passed our internal non-regression performance tests and it shows no difference (or below the error margin), even with low overhead processing like forwarding whatever the number of cores I use.

> Wouldn't the processor have to mask the high bytes of the physical 
> address when it is used, for example, to populate descriptors with 
> buffer addresses?  When compute bound, this could steal CPU cycles 
> away from packet processing.  I think we should understand the 
> performance trade-off in order to save these 2 bytes.

I would naively say that the cost is negligible: accessing to the length is the same as before (it's a 16 bits field) and accessing the physical address is just a mask or a shift, which should not be very long on an Intel processor (1 cycle?). This is to be compared with the number of cycles per packet in io-fwd mode, which is probably around 150 or 200.

> It would be interesting to see how throughput is impacted when the 
> workload is core-bound.  This could be accomplished by running testpmd 
> in io-fwd mode across 4x 10G ports.

I agree, this is something we could check. If you agree, let's first wait for some other comments and see if we find a consensus on the patches.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-14 14:07         ` Ananyev, Konstantin
@ 2014-05-15  9:53           ` Olivier MATZ
  0 siblings, 0 replies; 51+ messages in thread
From: Olivier MATZ @ 2014-05-15  9:53 UTC (permalink / raw)
  To: Ananyev, Konstantin, Shaw, Jeffrey B, dev

Hi Konstantin,

On 05/14/2014 04:07 PM, Ananyev, Konstantin wrote:
> Apart from performance impact, one more concern:
> As I know, theoretical limit for PA on Intel is 52 bits.
> I understand that these days no-one using more than 48 bits and it probably would stay like that for next few years.
> Though if we'll occupy these (MAXPHYADDR - 48) bits now, it can become a potential problem in future.
> After all the savings from that changes are not that big - only 2 bytes.
> As I understand you already save extra 7 bytes with other proposed modifications of mbuf.
> That's enough to add TSO related information into the mbuf.
> So my suggestion would be to keep phys_addr 64bit long.

I think that 2 bytes is a good save and will probably be useful in
a near future, as it allows us to store more info in one cache line of
64 bytes. On the other hand, the 48 bits limit (256 Tera bytes) will
probably be reached in several years (probably 10).

I agree the patch is not useful today, so I'm ok to remove it if nobody
feels it useful to keep 4 bytes for future use. In my opinion, these
4 bytes will soon be use for some new feature, looking at linux skb
shows that there is plenty of useful information that could go in
a network packet descriptor (timestamp, queue id, protocol, qos, ...).

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 01/11] igb/ixgbe: fix IP checksum calculation
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 01/11] igb/ixgbe: fix IP checksum calculation Olivier Matz
@ 2014-05-15 10:40   ` Ananyev, Konstantin
  0 siblings, 0 replies; 51+ messages in thread
From: Ananyev, Konstantin @ 2014-05-15 10:40 UTC (permalink / raw)
  To: Olivier Matz, dev



-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
Sent: Friday, May 09, 2014 3:50 PM
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH RFC 01/11] igb/ixgbe: fix IP checksum calculation

According to Intel® 82599 10 GbE Controller Datasheet (Table 7-38), both
L2 and L3 lengths are needed to offload the IP checksum.

Note that the e1000 driver does not need to be patched as it already
contains the fix.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

 


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support Olivier Matz
  2014-05-12 14:30   ` Thomas Monjalon
@ 2014-05-15 15:09   ` Ananyev, Konstantin
  2014-05-15 15:39     ` Olivier MATZ
  1 sibling, 1 reply; 51+ messages in thread
From: Ananyev, Konstantin @ 2014-05-15 15:09 UTC (permalink / raw)
  To: Olivier Matz, dev

Hi Oliver,

By design PMD not supposed to touch (or even look) into actual packet's data.
That is one of the reason why we put l2/l3/l4_len fields into the mbuf itself.
Also it seems a bit strange to calculate one pseudo-header checksum in the upper layer and then
Recalculate it again inside testpmd/
So I wonder is it possible to move fix_tcp_phdr_cksum() logic into the upper layer
(testpmd pkt_burst_checksum_forward())?

Thanks
Konstantin

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
Sent: Friday, May 09, 2014 3:51 PM
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support

Implement TSO (TCP segmentation offload) in ixgbe driver. To delegate
the TCP segmentation to the hardware, the user has to:

- set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
  PKT_TX_IP_CKSUM and PKT_TX_TCP_CKSUM)
- fill the mbuf->hw_offload information: l2_len, l3_len, l4_len, mss
- calculate the pseudo header checksum and set it in the TCP header,
  as required when doing hardware TCP checksum offload
- set the IP checksum to 0

This approach seems generic enough to be used for other hw/drivers
in the future.

In the patch, the tx_desc_cksum_flags_to_olinfo() and
tx_desc_ol_flags_to_cmdtype() functions have been reworked to make them
clearer. This does not impact performance as gcc (version 4.8 in my
case) is smart enough to convert the tests into a code that does not
contain any branch instruction.

validation
==========

platform:

  Tester (linux)   <---->   DUT (DPDK)

Run testpmd on DUT:

  cd dpdk.org/
  make install T=x86_64-default-linuxapp-gcc
  cd x86_64-default-linuxapp-gcc/
  modprobe uio
  insmod kmod/igb_uio.ko
  python ../tools/igb_uio_bind.py -b igb_uio 0000:02:00.0
  echo 0 > /proc/sys/kernel/randomize_va_space
  echo 1000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
  echo 1000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
  mount -t hugetlbfs none /mnt/huge
  ./app/testpmd -c 0x55 -n 4 -m 800 -- -i --port-topology=chained

Disable all offload feature on Tester, and start capture:

  ethtool -K ixgbe0 rx off tx off tso off gso off gro off lro off
  ip l set ixgbe0 up
  tcpdump -n -e -i ixgbe0 -s 0 -w /tmp/cap

We use the following scapy script for testing:

  def test():
    ############### IPv4
    # checksum TCP
    p=Ether()/IP(src=RandIP(), dst=RandIP())/TCP(flags=0x10)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # checksum UDP
    p=Ether()/IP(src=RandIP(), dst=RandIP())/UDP()/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # bad IP checksum
    p=Ether()/IP(src=RandIP(), dst=RandIP(), chksum=0x1234)/TCP(flags=0x10)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # bad TCP checksum
    p=Ether()/IP(src=RandIP(), dst=RandIP())/TCP(flags=0x10, chksum=0x1234)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # large packet
    p=Ether()/IP(src=RandIP(), dst=RandIP())/TCP(flags=0x10)/Raw(RandString(1400))
    sendp(p, iface="ixgbe0", count=5)
    ############### IPv6v6
    # checksum TCP
    p=Ether()/IPv6(src=RandIP6(), dst=RandIP6())/TCP(flags=0x10)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # checksum UDP
    p=Ether()/IPv6(src=RandIP6(), dst=RandIP6())/UDP()/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # bad TCP checksum
    p=Ether()/IPv6(src=RandIP6(), dst=RandIP6())/TCP(flags=0x10, chksum=0x1234)/Raw(RandString(50))
    sendp(p, iface="ixgbe0", count=5)
    # large packet
    p=Ether()/IPv6(src=RandIP6(), dst=RandIP6())/TCP(flags=0x10)/Raw(RandString(1400))
    sendp(p, iface="ixgbe0", count=5)

Without hw cksum
----------------

On DUT:

  # disable hw cksum (use sw) in csumonly test, disable tso
  stop
  set fwd csum
  tx_checksum set 0x0 0
  tso set 0 0
  start

On tester:

  >>> test()

Then check the capture file.

With hw cksum
-------------

On DUT:

  # enable hw cksum in csumonly test, disable tso
  stop
  set fwd csum
  tx_checksum set 0xf 0
  tso set 0 0
  start

On tester:

  >>> test()

Then check the capture file.

With TSO
--------

On DUT:

  set fwd csum
  tx_checksum set 0xf 0
  tso set 800 0
  start

On tester:

  >>> test()

Then check the capture file.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/cmdline.c            |  45 +++++++++++
 app/test-pmd/config.c             |   8 ++
 app/test-pmd/csumonly.c           |  16 ++++
 app/test-pmd/testpmd.h            |   2 +
 lib/librte_mbuf/rte_mbuf.h        |   7 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c | 165 ++++++++++++++++++++++++++++----------
 6 files changed, 200 insertions(+), 43 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index a95b279..c628773 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -2305,6 +2305,50 @@ cmdline_parse_inst_t cmd_tx_cksum_set = {
 	},
 };
 
+/* *** ENABLE HARDWARE SEGMENTATION IN TX PACKETS *** */
+struct cmd_tso_set_result {
+	cmdline_fixed_string_t tso;
+	cmdline_fixed_string_t set;
+	uint16_t mss;
+	uint8_t port_id;
+};
+
+static void
+cmd_tso_set_parsed(void *parsed_result,
+		       __attribute__((unused)) struct cmdline *cl,
+		       __attribute__((unused)) void *data)
+{
+	struct cmd_tso_set_result *res = parsed_result;
+	tso_set(res->port_id, res->mss);
+}
+
+cmdline_parse_token_string_t cmd_tso_set_tso =
+	TOKEN_STRING_INITIALIZER(struct cmd_tso_set_result,
+				tso, "tso");
+cmdline_parse_token_string_t cmd_tso_set_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_tso_set_result,
+				set, "set");
+cmdline_parse_token_num_t cmd_tso_set_mss =
+	TOKEN_NUM_INITIALIZER(struct cmd_tso_set_result,
+				mss, UINT16);
+cmdline_parse_token_num_t cmd_tso_set_portid =
+	TOKEN_NUM_INITIALIZER(struct cmd_tso_set_result,
+				port_id, UINT8);
+
+cmdline_parse_inst_t cmd_tso_set = {
+	.f = cmd_tso_set_parsed,
+	.data = NULL,
+	.help_str = "Enable hardware segmentation (set MSS to 0 to disable): "
+	"tso set <MSS> <PORT>",
+	.tokens = {
+		(void *)&cmd_tso_set_tso,
+		(void *)&cmd_tso_set_set,
+		(void *)&cmd_tso_set_mss,
+		(void *)&cmd_tso_set_portid,
+		NULL,
+	},
+};
+
 /* *** ENABLE/DISABLE FLUSH ON RX STREAMS *** */
 struct cmd_set_flush_rx {
 	cmdline_fixed_string_t set;
@@ -5151,6 +5195,7 @@ cmdline_parse_ctx_t main_ctx[] = {
 	(cmdline_parse_inst_t *)&cmd_tx_vlan_set,
 	(cmdline_parse_inst_t *)&cmd_tx_vlan_reset,
 	(cmdline_parse_inst_t *)&cmd_tx_cksum_set,
+	(cmdline_parse_inst_t *)&cmd_tso_set,
 	(cmdline_parse_inst_t *)&cmd_link_flow_control_set,
 	(cmdline_parse_inst_t *)&cmd_priority_flow_control_set,
 	(cmdline_parse_inst_t *)&cmd_config_dcb,
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index cd82f60..a6d749d 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1455,6 +1455,14 @@ tx_cksum_set(portid_t port_id, uint32_t ol_flags)
 }
 
 void
+tso_set(portid_t port_id, uint16_t mss)
+{
+	if (port_id_is_invalid(port_id))
+		return;
+	ports[port_id].tx_mss = mss;
+}
+
+void
 fdir_add_signature_filter(portid_t port_id, uint8_t queue_id,
 			  struct rte_fdir_filter *fdir_filter)
 {
diff --git a/app/test-pmd/csumonly.c b/app/test-pmd/csumonly.c
index e93d75f..9983618 100644
--- a/app/test-pmd/csumonly.c
+++ b/app/test-pmd/csumonly.c
@@ -220,10 +220,12 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 	uint32_t ol_flags;
 	uint32_t pkt_ol_flags;
 	uint32_t tx_ol_flags;
+	uint16_t tx_mss;
 	uint16_t l4_proto;
 	uint16_t eth_type;
 	uint8_t  l2_len;
 	uint8_t  l3_len;
+	uint8_t  l4_len;
 
 	uint32_t rx_bad_ip_csum;
 	uint32_t rx_bad_l4_csum;
@@ -255,6 +257,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 
 	txp = &ports[fs->tx_port];
 	tx_ol_flags = txp->tx_ol_flags;
+	tx_mss = txp->tx_mss;
 
 	for (i = 0; i < nb_rx; i++) {
 
@@ -272,6 +275,7 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 				((uintptr_t)&eth_hdr->ether_type +
 				sizeof(struct vlan_hdr)));
 		}
+		l4_len  = 0;
 
 		/* Update the L3/L4 checksum error packet count  */
 		rx_bad_ip_csum += ((pkt_ol_flags & PKT_RX_IP_CKSUM_BAD) != 0);
@@ -347,6 +351,11 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 					tcp_hdr->cksum = get_ipv4_udptcp_checksum(ipv4_hdr,
 							(uint16_t*)tcp_hdr);
 				}
+
+				if (tx_mss != 0) {
+					ol_flags |= PKT_TX_TCP_SEG;
+					l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+				}
 			}
 			else if (l4_proto == IPPROTO_SCTP) {
 				sctp_hdr = (struct sctp_hdr*) (rte_pktmbuf_mtod(mb,
@@ -404,6 +413,11 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 					tcp_hdr->cksum = get_ipv6_udptcp_checksum(ipv6_hdr,
 							(uint16_t*)tcp_hdr);
 				}
+
+				if (tx_mss != 0) {
+					ol_flags |= PKT_TX_TCP_SEG;
+					l4_len = (tcp_hdr->data_off & 0xf0) >> 2;
+				}
 			}
 			else if (l4_proto == IPPROTO_SCTP) {
 				sctp_hdr = (struct sctp_hdr*) (rte_pktmbuf_mtod(mb,
@@ -434,6 +448,8 @@ pkt_burst_checksum_forward(struct fwd_stream *fs)
 		/* Combine the packet header write. VLAN is not consider here */
 		mb->hw_offload.l2_len = l2_len;
 		mb->hw_offload.l3_len = l3_len;
+		mb->hw_offload.l4_len = l4_len;
+		mb->hw_offload.mss = tx_mss;
 		mb->ol_flags = ol_flags;
 	}
 	nb_tx = rte_eth_tx_burst(fs->tx_port, fs->tx_queue, pkts_burst, nb_rx);
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 77dcc30..6f567e7 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -134,6 +134,7 @@ struct rte_port {
 	struct fwd_stream       *tx_stream; /**< Port TX stream, if unique */
 	unsigned int            socket_id;  /**< For NUMA support */
 	uint32_t                tx_ol_flags;/**< Offload Flags of TX packets. */
+	uint16_t                tx_mss;     /**< MSS for segmentation offload. */
 	uint16_t                tx_vlan_id; /**< Tag Id. in TX VLAN packets. */
 	void                    *fwd_ctx;   /**< Forwarding mode context */
 	uint64_t                rx_bad_ip_csum; /**< rx pkts with bad ip checksum  */
@@ -480,6 +481,7 @@ void tx_vlan_reset(portid_t port_id);
 void set_qmap(portid_t port_id, uint8_t is_rx, uint16_t queue_id, uint8_t map_value);
 
 void tx_cksum_set(portid_t port_id, uint32_t ol_flags);
+void tso_set(portid_t port_id, uint16_t mss);
 
 void set_verbose_level(uint16_t vb_level);
 void set_tx_pkt_segments(unsigned *seg_lengths, unsigned nb_segs);
diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h
index d71c86c..75298bd 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -96,6 +96,7 @@ extern "C" {
 #define PKT_TX_SCTP_CKSUM    0x00080000 /**< SCTP cksum of TX pkt. computed by NIC. */
 #define PKT_TX_UDP_CKSUM     0x000C0000 /**< UDP cksum of TX pkt. computed by NIC. */
 #define PKT_TX_IEEE1588_TMST 0x00100000 /**< TX IEEE1588 packet to timestamp. */
+#define PKT_TX_TCP_SEG       0x00200000 /**< TCP segmentation offload. */
 
 /**
  * Get the name of a RX offload flag
@@ -140,6 +141,7 @@ static inline const char *rte_get_tx_ol_flag_name(uint32_t mask)
 	case PKT_TX_SCTP_CKSUM: return "PKT_TX_SCTP_CKSUM";
 	case PKT_TX_UDP_CKSUM: return "PKT_TX_UDP_CKSUM";
 	case PKT_TX_IEEE1588_TMST: return "PKT_TX_IEEE1588_TMST";
+	case PKT_TX_TCP_SEG: return "PKT_TX_TCP_SEG";
 	default: return NULL;
 	}
 }
@@ -153,11 +155,12 @@ union rte_hw_offload {
 #define HW_OFFLOAD_L4_LEN_MASK 0xff
 		uint32_t l2_len:7; /**< L2 (MAC) Header Length. */
 		uint32_t l3_len:9; /**< L3 (IP) Header Length. */
-		uint32_t reserved:16;
+		uint32_t l4_len:8; /**< L4 (TCP/UDP) Header Length. */
+		uint32_t reserved:8;
 
 		uint16_t vlan_tci;
 		/**< VLAN Tag Control Identifier (CPU order). */
-		uint16_t reserved2;
+		uint16_t mss; /**< Maximum segment size. */
 	};
 };
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index d52482e..75ff16e 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -347,13 +347,59 @@ ixgbe_xmit_pkts_simple(void *tx_queue, struct rte_mbuf **tx_pkts,
 	return nb_tx;
 }
 
+/* When doing TSO, the IP length must not be included in the pseudo
+ * header checksum of the packet given to the hardware */
+static inline void
+ixgbe_fix_tcp_phdr_cksum(struct rte_mbuf *m)
+{
+	char *data;
+	uint16_t *cksum_ptr;
+	uint16_t prev_cksum;
+	uint16_t new_cksum;
+	uint16_t ip_len, ip_paylen;
+	uint32_t tmp;
+	uint8_t ip_version;
+
+	/* get phdr cksum at offset 16 of TCP header */
+	data = rte_pktmbuf_mtod(m, char *);
+	cksum_ptr = (uint16_t *)(data + m->hw_offload.l2_len +
+		m->hw_offload.l3_len + 16);
+	prev_cksum = *cksum_ptr;
+
+	/* get ip_version */
+	ip_version = (*(uint8_t *)(data + m->hw_offload.l2_len)) >> 4;
+
+	/* get ip_len at offset 2 of IP header or offset 4 of IPv6 header */
+	if (ip_version == 4) {
+		/* override ip cksum to 0 */
+		data[m->hw_offload.l2_len + 10] = 0;
+		data[m->hw_offload.l2_len + 11] = 0;
+
+		ip_len = *(uint16_t *)(data + m->hw_offload.l2_len + 2);
+		ip_paylen = rte_cpu_to_be_16(rte_be_to_cpu_16(ip_len) -
+			m->hw_offload.l3_len);
+	} else {
+		ip_paylen = *(uint16_t *)(data + m->hw_offload.l2_len + 4);
+	}
+
+	/* calculate the new phdr checksum that doesn't include ip_paylen */
+	tmp = prev_cksum ^ 0xffff;
+	if (tmp < ip_paylen)
+		tmp += 0xffff;
+	tmp -= ip_paylen;
+	new_cksum = tmp;
+
+	/* replace it in the packet */
+	*cksum_ptr = new_cksum;
+}
+
 static inline void
 ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 		volatile struct ixgbe_adv_tx_context_desc *ctx_txd,
 		uint32_t ol_flags, union rte_hw_offload hw_offload)
 {
 	uint32_t type_tucmd_mlhl;
-	uint32_t mss_l4len_idx;
+	uint32_t mss_l4len_idx = 0;
 	uint32_t ctx_idx;
 	uint32_t vlan_macip_lens;
 	union rte_hw_offload offload_mask;
@@ -362,44 +408,61 @@ ixgbe_set_xmit_ctx(struct igb_tx_queue* txq,
 	offload_mask.u64 = 0;
 	type_tucmd_mlhl = 0;
 
+	/* Specify which HW CTX to upload. */
+	mss_l4len_idx |= (ctx_idx << IXGBE_ADVTXD_IDX_SHIFT);
+
 	if (ol_flags & PKT_TX_VLAN_PKT) {
 		offload_mask.vlan_tci = 0xffff;
 	}
 
-	if (ol_flags & PKT_TX_IP_CKSUM) {
-		type_tucmd_mlhl = IXGBE_ADVTXD_TUCMD_IPV4;
+	/* check if TCP segmentation required for this packet */
+	if (ol_flags & PKT_TX_TCP_SEG) {
+		/* implies IP cksum and TCP cksum */
+		type_tucmd_mlhl = IXGBE_ADVTXD_TUCMD_IPV4 |
+			IXGBE_ADVTXD_TUCMD_L4T_TCP |
+			IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;;
+
 		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
 		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
-	}
+		offload_mask.l4_len = HW_OFFLOAD_L4_LEN_MASK;
+		offload_mask.mss = 0xffff;
+		mss_l4len_idx |= hw_offload.mss << IXGBE_ADVTXD_MSS_SHIFT;
+		mss_l4len_idx |= hw_offload.l4_len << IXGBE_ADVTXD_L4LEN_SHIFT;
+	} else { /* no TSO, check if hardware checksum is needed */
+		if (ol_flags & PKT_TX_IP_CKSUM) {
+			type_tucmd_mlhl = IXGBE_ADVTXD_TUCMD_IPV4;
+			offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+			offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
+		}
 
-	/* Specify which HW CTX to upload. */
-	mss_l4len_idx = (ctx_idx << IXGBE_ADVTXD_IDX_SHIFT);
-	switch (ol_flags & PKT_TX_L4_MASK) {
-	case PKT_TX_UDP_CKSUM:
-		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP |
+		switch (ol_flags & PKT_TX_L4_MASK) {
+		case PKT_TX_UDP_CKSUM:
+			type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
-		mss_l4len_idx |= sizeof(struct udp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
-		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
-		break;
-	case PKT_TX_TCP_CKSUM:
-		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP |
+			mss_l4len_idx |= sizeof(struct udp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
+			offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+			offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
+			break;
+		case PKT_TX_TCP_CKSUM:
+			type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
-		mss_l4len_idx |= sizeof(struct tcp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
-		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
-		break;
-	case PKT_TX_SCTP_CKSUM:
-		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_SCTP |
+			mss_l4len_idx |= sizeof(struct tcp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
+			offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+			offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
+			offload_mask.l4_len = HW_OFFLOAD_L4_LEN_MASK;
+			break;
+		case PKT_TX_SCTP_CKSUM:
+			type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_SCTP |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
-		mss_l4len_idx |= sizeof(struct sctp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
-		offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
-		offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
-		break;
-	default:
-		type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_RSV |
+			mss_l4len_idx |= sizeof(struct sctp_hdr) << IXGBE_ADVTXD_L4LEN_SHIFT;
+			offload_mask.l2_len = HW_OFFLOAD_L2_LEN_MASK;
+			offload_mask.l3_len = HW_OFFLOAD_L3_LEN_MASK;
+			break;
+		default:
+			type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_RSV |
 				IXGBE_ADVTXD_DTYP_CTXT | IXGBE_ADVTXD_DCMD_DEXT;
-		break;
+			break;
+		}
 	}
 
 	txq->ctx_cache[ctx_idx].flags = ol_flags;
@@ -446,20 +509,25 @@ what_advctx_update(struct igb_tx_queue *txq, uint32_t flags,
 static inline uint32_t
 tx_desc_cksum_flags_to_olinfo(uint32_t ol_flags)
 {
-	static const uint32_t l4_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_TXSM};
-	static const uint32_t l3_olinfo[2] = {0, IXGBE_ADVTXD_POPTS_IXSM};
-	uint32_t tmp;
-
-	tmp  = l4_olinfo[(ol_flags & PKT_TX_L4_MASK)  != PKT_TX_L4_NO_CKSUM];
-	tmp |= l3_olinfo[(ol_flags & PKT_TX_IP_CKSUM) != 0];
+	uint32_t tmp = 0;
+	if ((ol_flags & PKT_TX_L4_MASK) != PKT_TX_L4_NO_CKSUM)
+		tmp |= IXGBE_ADVTXD_POPTS_TXSM;
+	if (ol_flags & PKT_TX_IP_CKSUM)
+		tmp |= IXGBE_ADVTXD_POPTS_IXSM;
+	if (ol_flags & PKT_TX_TCP_SEG)
+		tmp |= IXGBE_ADVTXD_POPTS_TXSM | IXGBE_ADVTXD_POPTS_IXSM;
 	return tmp;
 }
 
 static inline uint32_t
-tx_desc_vlan_flags_to_cmdtype(uint32_t ol_flags)
+tx_desc_ol_flags_to_cmdtype(uint32_t ol_flags)
 {
-	static const uint32_t vlan_cmd[2] = {0, IXGBE_ADVTXD_DCMD_VLE};
-	return vlan_cmd[(ol_flags & PKT_TX_VLAN_PKT) != 0];
+	uint32_t cmdtype = 0;
+	if (ol_flags & PKT_TX_VLAN_PKT)
+		cmdtype |= IXGBE_ADVTXD_DCMD_VLE;
+	if (ol_flags & PKT_TX_TCP_SEG)
+		cmdtype |= IXGBE_ADVTXD_DCMD_TSE;
+	return cmdtype;
 }
 
 /* Default RS bit threshold values */
@@ -583,7 +651,8 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 
 		/* If hardware offload required */
 		tx_ol_req = ol_flags &
-			(PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK);
+			(PKT_TX_VLAN_PKT | PKT_TX_IP_CKSUM | PKT_TX_L4_MASK |
+			PKT_TX_TCP_SEG);
 		if (tx_ol_req) {
 			/* If new context need be built or reuse the exist ctx. */
 			ctx = what_advctx_update(txq, tx_ol_req,
@@ -702,7 +771,20 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 		 */
 		cmd_type_len = IXGBE_ADVTXD_DTYP_DATA |
 			IXGBE_ADVTXD_DCMD_IFCS | IXGBE_ADVTXD_DCMD_DEXT;
+
+		if (ol_flags & PKT_TX_TCP_SEG) {
+			/* paylen in descriptor is the not the packet
+			 * len bu the tcp payload len if TSO in on */
+			pkt_len -= (hw_offload.l2_len + hw_offload.l3_len +
+				hw_offload.l4_len);
+
+			/* the pseudo header checksum must be modified:
+			 * it should not include the ip_len */
+			ixgbe_fix_tcp_phdr_cksum(tx_pkt);
+		}
+
 		olinfo_status = (pkt_len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+
 #ifdef RTE_LIBRTE_IEEE1588
 		if (ol_flags & PKT_TX_IEEE1588_TMST)
 			cmd_type_len |= IXGBE_ADVTXD_MAC_1588;
@@ -741,7 +823,7 @@ ixgbe_xmit_pkts(void *tx_queue, struct rte_mbuf **tx_pkts,
 			 * This path will go through
 			 * whatever new/reuse the context descriptor
 			 */
-			cmd_type_len  |= tx_desc_vlan_flags_to_cmdtype(ol_flags);
+			cmd_type_len  |= tx_desc_ol_flags_to_cmdtype(ol_flags);
 			olinfo_status |= tx_desc_cksum_flags_to_olinfo(ol_flags);
 			olinfo_status |= ctx << IXGBE_ADVTXD_IDX_SHIFT;
 		}
@@ -3420,9 +3502,10 @@ ixgbe_dev_tx_init(struct rte_eth_dev *dev)
 	PMD_INIT_FUNC_TRACE();
 	hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
 
-	/* Enable TX CRC (checksum offload requirement) */
+	/* Enable TX CRC (checksum offload requirement) and hw padding
+	 * (TSO requirement) */
 	hlreg0 = IXGBE_READ_REG(hw, IXGBE_HLREG0);
-	hlreg0 |= IXGBE_HLREG0_TXCRCEN;
+	hlreg0 |= (IXGBE_HLREG0_TXCRCEN | IXGBE_HLREG0_TXPADEN);
 	IXGBE_WRITE_REG(hw, IXGBE_HLREG0, hlreg0);
 
 	/* Setup the Base and Length of the Tx Descriptor Rings */
-- 
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support
  2014-05-15 15:09   ` Ananyev, Konstantin
@ 2014-05-15 15:39     ` Olivier MATZ
  2014-05-15 16:30       ` Ananyev, Konstantin
  0 siblings, 1 reply; 51+ messages in thread
From: Olivier MATZ @ 2014-05-15 15:39 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev

Hi Konstantin,

On 05/15/2014 05:09 PM, Ananyev, Konstantin wrote:
> By design PMD not supposed to touch (or even look) into actual packet's data.

I agree on the principle, we should avoid that as much as possible.

> That is one of the reason why we put l2/l3/l4_len fields into the mbuf itself.
> Also it seems a bit strange to calculate one pseudo-header checksum in the upper layer and then
> Recalculate it again inside testpmd/
> So I wonder is it possible to move fix_tcp_phdr_cksum() logic into the upper layer
> (testpmd pkt_burst_checksum_forward())?

The reason why I did this is to define a generic PMD API for TSO:

Today, if you want to want to offload checksum calculation on tx, the
network stack has to calculate the pseudo header checksum. Even if
it could be calculated by the hardware, it's not a problem to
calculate it by software.

Now, because of the hardware, if you do TCP segmentation offload, you
have to calculate the pseudo-header checksum without including the
ip len. It seems to me that it adds some complexity because, depending
on the hardware configuration, the stack has to behave differently.

It seemed more logical to me that the network code that generates TCP
packets has to be identical whatever the driver options. Moreover, if
tomorrow we add another PMD that needs the pseudo-header with ip_len in
all case (TSO or not), we would have to do the same kind function than
fix_tcp_phdr_cksum().

So, I tried to define the API in order to simplify the work of the
network stack developper, even if it does not map the hardware
behavior. Nevertheless, I'm open to discuss it.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support
  2014-05-15 15:39     ` Olivier MATZ
@ 2014-05-15 16:30       ` Ananyev, Konstantin
  2014-05-16 12:11         ` Olivier MATZ
  0 siblings, 1 reply; 51+ messages in thread
From: Ananyev, Konstantin @ 2014-05-15 16:30 UTC (permalink / raw)
  To: Olivier MATZ, dev

Hi Oliver,

With the current DPDK implementation the upper code would still be different for TCP checksum (without segmentation) and TCP segmentation:
different flags in mbuf, with TSO you need to setup l4_len and mss fields inside mbuf, with just checksum - you don't.
Plus, as I said, it is a bit confusing to calculate PSD csum once in the stack and then re-calculate in PMD.
Again - unnecessary slowdown.
So why not just have get_ipv4_psd_sum() and get_ipv4_psd_tso_sum() inside testpmd/csumonly.c and call them accordingly?

About potential future problem with NICs that implement TX checksum/segmentation offloads in a different manner - yeh that's true...
I think at the final point all that logic could be hidden inside some function at rte_ethdev level, something like:  rte_eth_dev_prep_tx(portid, mbuf[], num).
So,  based on mbuf TX offload flags and device type, it would do necessary modifications inside the packet.       
But that's future discussion, I suppose.
For now, I still think we need to keep pseudo checksum calculations out of PMD code.

Konstantin

-----Original Message-----
From: Olivier MATZ [mailto:olivier.matz@6wind.com] 
Sent: Thursday, May 15, 2014 4:40 PM
To: Ananyev, Konstantin; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support

Hi Konstantin,

On 05/15/2014 05:09 PM, Ananyev, Konstantin wrote:
> By design PMD not supposed to touch (or even look) into actual packet's data.

I agree on the principle, we should avoid that as much as possible.

> That is one of the reason why we put l2/l3/l4_len fields into the mbuf itself.
> Also it seems a bit strange to calculate one pseudo-header checksum in the upper layer and then
> Recalculate it again inside testpmd/
> So I wonder is it possible to move fix_tcp_phdr_cksum() logic into the upper layer
> (testpmd pkt_burst_checksum_forward())?

The reason why I did this is to define a generic PMD API for TSO:

Today, if you want to want to offload checksum calculation on tx, the
network stack has to calculate the pseudo header checksum. Even if
it could be calculated by the hardware, it's not a problem to
calculate it by software.

Now, because of the hardware, if you do TCP segmentation offload, you
have to calculate the pseudo-header checksum without including the
ip len. It seems to me that it adds some complexity because, depending
on the hardware configuration, the stack has to behave differently.

It seemed more logical to me that the network code that generates TCP
packets has to be identical whatever the driver options. Moreover, if
tomorrow we add another PMD that needs the pseudo-header with ip_len in
all case (TSO or not), we would have to do the same kind function than
fix_tcp_phdr_cksum().

So, I tried to define the API in order to simplify the work of the
network stack developper, even if it does not map the hardware
behavior. Nevertheless, I'm open to discuss it.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support
  2014-05-15 16:30       ` Ananyev, Konstantin
@ 2014-05-16 12:11         ` Olivier MATZ
  2014-05-16 17:01           ` Ananyev, Konstantin
  0 siblings, 1 reply; 51+ messages in thread
From: Olivier MATZ @ 2014-05-16 12:11 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev

Hi Konstantin,

On 05/15/2014 06:30 PM, Ananyev, Konstantin wrote:
> With the current DPDK implementation the upper code would still be different for TCP checksum (without segmentation) and TCP segmentation:
> different flags in mbuf, with TSO you need to setup l4_len and mss fields inside mbuf, with just checksum - you don't.

You are right on this point.

> Plus, as I said, it is a bit confusing to calculate PSD csum once in the stack and then re-calculate in PMD.
> Again - unnecessary slowdown.
> So why not just have get_ipv4_psd_sum() and get_ipv4_psd_tso_sum() inside testpmd/csumonly.c and call them accordingly?

Yes, recalculating the pseudo-header checksum without the ip_len
is a slow down. This slow down should however be compared to the
operation in progress. When you do TSO, you are generally transmitting
a large TCP packet (several KB), and the cost of the TCP stack is
probably much higher than fixing the checksum. But that's not the
main argument: my idea was to choose the proper API that will
reduce the slow down for most cases.

Let's take the case of a future vnic pmd driver supporting an emulation
of TSO. In this case, the calculation of the pseudo header is also an
unnecessary slowdown.

Also, some other hardware I've seen don't need to calculate a different
pseudo header checksum when doing TSO.

Last argument, the way Linux works is the same that what I've
implemented. See in linux ixgbe driver [1] at line 6461, there is a
call to csum_tcpudp_magic() which reprocesses the checksum without
the ip_len.

On the other hand, that's true that today ixgbe is the only hardware
supporting TSO in DPDK. The pragmatic approach could be to choose the
API that gives the best performance with what we have (the ixgbe
driver). I'm ok with this approach if we accept to reconsider the API
(and maybe modifying it) when another PMD supporting TSO will be
implemented.

> About potential future problem with NICs that implement TX checksum/segmentation offloads in a different manner - yeh that's true...
> I think at the final point all that logic could be hidden inside some function at rte_ethdev level, something like:  rte_eth_dev_prep_tx(portid, mbuf[], num).

I don't see the real difference between:

   rte_eth_dev_prep_tx(portid, mbuf[], num)
   rte_eth_dev_tx(portid, mbuf[], num)

and:

   rte_eth_dev_tx(portid, mbuf[], num) /* the tx does the cksum job */

And the second is faster because there is only one pointer dereference.

> So,  based on mbuf TX offload flags and device type, it would do necessary modifications inside the packet.
> But that's future discussion, I suppose.

To me, it's not an option to fill that the network stack fills the
mbuf differently depending on the device type. For instance, when doing
ethernet bonding or bridging, the stack may not know which physical
device will be used at the end. So the API to enable TSO on a packet
has to be the same whatever the device.

If fixing the checksum in the PMD is an unnecessary slowdown, forcing
the network stack to check what has to be filled in the mbuf depending
on the device type also has a cost.

> For now, I still think we need to keep pseudo checksum calculations out of PMD code.

To me, there are 2 options:

1/ Update patch to calculate the pseudo header without the ip_len when
    doing TSO. In this case the API is mapped on ixgbe behavior,
    probably providing the best performance today. If another PMD comes
    in the future, this API may change to something more generic.

2/ Try to define a generic API today, accepting that the first driver
    that supports TSO is a bit slower, but reducing the risks of changing
    the API for TSO in the future.

I'm fine with both options.

Regards,
Olivier

[1] 
http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c?v=3.14#L6434

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support
  2014-05-16 12:11         ` Olivier MATZ
@ 2014-05-16 17:01           ` Ananyev, Konstantin
  2014-05-19 12:32             ` Thomas Monjalon
  0 siblings, 1 reply; 51+ messages in thread
From: Ananyev, Konstantin @ 2014-05-16 17:01 UTC (permalink / raw)
  To: Olivier MATZ, dev


Hi Oliver,

>Yes, recalculating the pseudo-header checksum without the ip_len
>is a slow down. This slow down should however be compared to the
>operation in progress. When you do TSO, you are generally transmitting
>a large TCP packet (several KB), and the cost of the TCP stack is
>probably much higher than fixing the checksum.

You can't always predict the context in which PMD TX routine will be called.
Consider the scenario: one core doing IO over several ports, while few other cores doing upper layer processing of the packets.
In that case, pseudo-header checksum (re)calculation inside PMD TX function will slow-down not only that particular packet flow,
but the RX/TX over all ports that are managed by the given core.  
That's why I think that 

rte_eth_dev_prep_tx(portid, mbuf[], num)
rte_eth_dev_tx(portid, mbuf[], num)

might have an advantage over

rte_eth_dev_tx(portid, mbuf[], num) /* the tx does the cksum job */

As it gives us a freedom to choose: do  prep_tx() either on the same execution context with actual tx() or on different one. 
Though yes, it comes with a price:  extra function call with all corresponding drawbacks.

Anyway, right now we probably can argue for a while trying to define how generic TX HW offload API should look like.
So, from your options list:

>1/ Update patch to calculate the pseudo header without the ip_len when
>    doing TSO. In this case the API is mapped on ixgbe behavior,
>    probably providing the best performance today. If another PMD comes
>   in the future, this API may change to something more generic.

>2/ Try to define a generic API today, accepting that the first driver
>    that supports TSO is a bit slower, but reducing the risks of changing
>   the API for TSO in the future.

If #1 means moving  pseudo checksum calculation out of PMD code, then my vote would be for it.

Konstantin

-----Original Message-----
From: Olivier MATZ [mailto:olivier.matz@6wind.com] 
Sent: Friday, May 16, 2014 1:12 PM
To: Ananyev, Konstantin; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support

Hi Konstantin,

On 05/15/2014 06:30 PM, Ananyev, Konstantin wrote:
> With the current DPDK implementation the upper code would still be different for TCP checksum (without segmentation) and TCP segmentation:
> different flags in mbuf, with TSO you need to setup l4_len and mss fields inside mbuf, with just checksum - you don't.

You are right on this point.

> Plus, as I said, it is a bit confusing to calculate PSD csum once in the stack and then re-calculate in PMD.
> Again - unnecessary slowdown.
> So why not just have get_ipv4_psd_sum() and get_ipv4_psd_tso_sum() inside testpmd/csumonly.c and call them accordingly?

Yes, recalculating the pseudo-header checksum without the ip_len
is a slow down. This slow down should however be compared to the
operation in progress. When you do TSO, you are generally transmitting
a large TCP packet (several KB), and the cost of the TCP stack is
probably much higher than fixing the checksum. But that's not the
main argument: my idea was to choose the proper API that will
reduce the slow down for most cases.

Let's take the case of a future vnic pmd driver supporting an emulation
of TSO. In this case, the calculation of the pseudo header is also an
unnecessary slowdown.

Also, some other hardware I've seen don't need to calculate a different
pseudo header checksum when doing TSO.

Last argument, the way Linux works is the same that what I've
implemented. See in linux ixgbe driver [1] at line 6461, there is a
call to csum_tcpudp_magic() which reprocesses the checksum without
the ip_len.

On the other hand, that's true that today ixgbe is the only hardware
supporting TSO in DPDK. The pragmatic approach could be to choose the
API that gives the best performance with what we have (the ixgbe
driver). I'm ok with this approach if we accept to reconsider the API
(and maybe modifying it) when another PMD supporting TSO will be
implemented.

> About potential future problem with NICs that implement TX checksum/segmentation offloads in a different manner - yeh that's true...
> I think at the final point all that logic could be hidden inside some function at rte_ethdev level, something like:  rte_eth_dev_prep_tx(portid, mbuf[], num).

I don't see the real difference between:

   rte_eth_dev_prep_tx(portid, mbuf[], num)
   rte_eth_dev_tx(portid, mbuf[], num)

and:

   rte_eth_dev_tx(portid, mbuf[], num) /* the tx does the cksum job */

And the second is faster because there is only one pointer dereference.

> So,  based on mbuf TX offload flags and device type, it would do necessary modifications inside the packet.
> But that's future discussion, I suppose.

To me, it's not an option to fill that the network stack fills the
mbuf differently depending on the device type. For instance, when doing
ethernet bonding or bridging, the stack may not know which physical
device will be used at the end. So the API to enable TSO on a packet
has to be the same whatever the device.

If fixing the checksum in the PMD is an unnecessary slowdown, forcing
the network stack to check what has to be filled in the mbuf depending
on the device type also has a cost.

> For now, I still think we need to keep pseudo checksum calculations out of PMD code.

To me, there are 2 options:

1/ Update patch to calculate the pseudo header without the ip_len when
    doing TSO. In this case the API is mapped on ixgbe behavior,
    probably providing the best performance today. If another PMD comes
    in the future, this API may change to something more generic.

2/ Try to define a generic API today, accepting that the first driver
    that supports TSO is a bit slower, but reducing the risks of changing
    the API for TSO in the future.

I'm fine with both options.

Regards,
Olivier

[1] 
http://lxr.free-electrons.com/source/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c?v=3.14#L6434

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-09 16:11       ` Shaw, Jeffrey B
  2014-05-14 14:07         ` Ananyev, Konstantin
@ 2014-05-19  7:27         ` Olivier MATZ
  2014-05-19  8:25           ` Richardson, Bruce
  1 sibling, 1 reply; 51+ messages in thread
From: Olivier MATZ @ 2014-05-19  7:27 UTC (permalink / raw)
  To: Shaw, Jeffrey B, dev

[-- Attachment #1: Type: text/plain, Size: 2147 bytes --]

Hi Jeff,

On 05/09/2014 06:11 PM, Shaw, Jeffrey B wrote:
> I agree, we should wait for comments then test the performance when the patches have settled.

Here are some performance numbers I've measured with the TSO
patches. The test platform is:

+-----------+           +-----------+
|           |           |           |
| traffic   |-----------| dpdk      |
| generator |-----------| testpmd   |
|           |-----------|           |
|           |-----------|           |
|           |           |           |
+-----------+           +-----------+

- 4 ixgbe ports
- sandy bridge at 2.7 Ghz

I've only included numbers for pkt_size=64. Other packet sizes
do not bring more information in this case.

I have 4 test cases:

- testpmd in iofwd mode with normal tx/rx function
- testpmd in iofwd mode with simple tx/rx function (txqflags=0xf01)
- testpmd in macfwd mode with normal tx/rx function
- testpmd in macfwd mode with simple tx/rx function (txqflags=0xf01)

I tested this for 1c1t, 1c2t, 2c2t, 2c4t, 4c8t on the following version:

- dpdk.org head
- dpdk.org + tso patchs until 6/11 (included): it includes all mbuf
   reworks (data_offset instead of data, remove ctrl mbuf, use 48 bits
   physical address)
- dpdk.org + all tso series

The conclusion of the tests is:

Patches up to 6/11 do not bring any performance regression. On the
other hand, the full TSO patch series introduces a small performance
regression (usually corresponding to ~5 cycles per packet). This is
probably due to additional tests related to TSO done in driver.
I suppose that this performance loss is acceptable if we consider
that TSO will bring a huge performance enhancement for many real use
cases.

By the way, I found lower numbers in macfwd mode + simple rx/tx
with current version (without my patches) with 1c2t. It seems
reproduceable.

I'll soon provide a v2 that will include:

- the split of patch 6/11 (cosmetics vs functional)
- the split of patch 11/11 (ixgbe vs generic changes)
- new checksum flags PKT_RX_L4_CKSUM_GOOD and PKT_RX_IP_CKSUM_GOOD
   proposed by Stephen
- modifications of external PMDs (memnic, virtio, vmxnet3)

Regards,
Olivier

[-- Attachment #2: iofwd_normalrxtx.png --]
[-- Type: image/png, Size: 15340 bytes --]

[-- Attachment #3: iofwd_simplerxtx.png --]
[-- Type: image/png, Size: 15346 bytes --]

[-- Attachment #4: macfwd_normalrxtx.png --]
[-- Type: image/png, Size: 15378 bytes --]

[-- Attachment #5: macfwd_simplerxtx.png --]
[-- Type: image/png, Size: 15345 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-19  7:27         ` Olivier MATZ
@ 2014-05-19  8:25           ` Richardson, Bruce
  2014-05-19  9:30             ` Olivier MATZ
  0 siblings, 1 reply; 51+ messages in thread
From: Richardson, Bruce @ 2014-05-19  8:25 UTC (permalink / raw)
  To: Olivier MATZ, Shaw, Jeffrey B, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier MATZ
> Sent: Monday, May 19, 2014 8:27 AM
> To: Shaw, Jeffrey B; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in
> a bitfield
> 
> Hi Jeff,
> 
> On 05/09/2014 06:11 PM, Shaw, Jeffrey B wrote:
> > I agree, we should wait for comments then test the performance when the
> patches have settled.
> 
> Here are some performance numbers I've measured with the TSO
> patches. The test platform is:
> 
> +-----------+           +-----------+
> |           |           |           |
> | traffic   |-----------| dpdk      |
> | generator |-----------| testpmd   |
> |           |-----------|           |
> |           |-----------|           |
> |           |           |           |
> +-----------+           +-----------+
> 
> - 4 ixgbe ports
> - sandy bridge at 2.7 Ghz
> 
> I've only included numbers for pkt_size=64. Other packet sizes
> do not bring more information in this case.
> 
> I have 4 test cases:
> 
> - testpmd in iofwd mode with normal tx/rx function
> - testpmd in iofwd mode with simple tx/rx function (txqflags=0xf01)
> - testpmd in macfwd mode with normal tx/rx function
> - testpmd in macfwd mode with simple tx/rx function (txqflags=0xf01)
> 
> I tested this for 1c1t, 1c2t, 2c2t, 2c4t, 4c8t on the following version:
> 
> - dpdk.org head
> - dpdk.org + tso patchs until 6/11 (included): it includes all mbuf
>    reworks (data_offset instead of data, remove ctrl mbuf, use 48 bits
>    physical address)
> - dpdk.org + all tso series

Hi Olivier,
Can you perhaps also include the specific testpmd parameter you used in your tests, as they can have a large effect on performance. On my Sandy Bridge platform here are the testpmd flags I use for iofwd testing:

"--rxd=128 --rxfreet=32 --rxpt=8 --rxht=8 --rxwt=0 --txd=512 --txfreet=32 --txpt=32 --txht=0 --txwt=0 --txrst=32 --txqflags=0xF01 --numa --burst=32 --mbcache=250 --total-num-mbufs=16383"

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-19  8:25           ` Richardson, Bruce
@ 2014-05-19  9:30             ` Olivier MATZ
  2014-05-19  9:57               ` Richardson, Bruce
  0 siblings, 1 reply; 51+ messages in thread
From: Olivier MATZ @ 2014-05-19  9:30 UTC (permalink / raw)
  To: Richardson, Bruce, Shaw, Jeffrey B, dev

Hi Bruce,

> Can you perhaps also include the specific testpmd parameter you used in your tests, as they can have a large effect on performance. On my Sandy Bridge platform here are the testpmd flags I use for iofwd testing:
>
> "--rxd=128 --rxfreet=32 --rxpt=8 --rxht=8 --rxwt=0 --txd=512 --txfreet=32 --txpt=32 --txht=0 --txwt=0 --txrst=32 --txqflags=0xF01 --numa --burst=32 --mbcache=250 --total-num-mbufs=16383"

Sure.

Common arguments:
  -b 0000:07:00.0 -b 0000:07:00.1
  -b 0000:83:00.0 -b 0000:84:00.0
  -n 4
  -c 0x00ff00ff
  --socket-mem=2048,0
  --
  --port-numa-config=0,0,1,0,2,0,3,0
  --socket-num=0
  -i
  --burst=32
  --txd=512
  --rxd=512
  --mbcache=128
  --portmask=0xf

Simple rx/tx:
   --txqflags=0xf01

1c1t:
  --coremask=0x00000080 --rxq=1 --txq=1

1c2t:
  --coremask=0x00800080 --rxq=1 --txq=1

2c2t:
  --coremask=0x00000088 --rxq=1 --txq=1

2c4t:
  --coremask=0x00880088 --rxq=1 --txq=1

4c8t:
  --coremask=0x00cc00cc --rxq=2 --txq=2


By the way, I think the absolute performance numbers are not so
important in these tests. What is really important is to show the
relative impact of the patches.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield
  2014-05-19  9:30             ` Olivier MATZ
@ 2014-05-19  9:57               ` Richardson, Bruce
  0 siblings, 0 replies; 51+ messages in thread
From: Richardson, Bruce @ 2014-05-19  9:57 UTC (permalink / raw)
  To: Olivier MATZ, Shaw, Jeffrey B, dev

> -----Original Message-----
> From: Olivier MATZ [mailto:olivier.matz@6wind.com]
> Sent: Monday, May 19, 2014 10:31 AM
> To: Richardson, Bruce; Shaw, Jeffrey B; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in
> a bitfield
> 
> 
> By the way, I think the absolute performance numbers are not so
> important in these tests. What is really important is to show the
> relative impact of the patches.
> 

Well, yes and no. The relative performance is obviously what we are mostly looking for, but on the other hand we want to ensure that we are looking at an optimised configuration. The impact to our top-line performance is of primary interest.

/Bruce 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support
  2014-05-16 17:01           ` Ananyev, Konstantin
@ 2014-05-19 12:32             ` Thomas Monjalon
  0 siblings, 0 replies; 51+ messages in thread
From: Thomas Monjalon @ 2014-05-19 12:32 UTC (permalink / raw)
  To: Ananyev, Konstantin, Olivier MATZ; +Cc: dev

Hi,

I'll try so sum it up this interesting discussion about checksum API for TSO.

We know at least 2 checksum methods:
- the standard one
- the special one for ixgbe TSO
In Linux ixgbe, checksum is redone in the driver for TSO case.

We want to compute checksum in the application/stack in order to prevent 
driver from modifying packet data, that could cause cache miss.
But the application cannot always know which checksum method to use because it 
doesn't have to know which driver will process the packet.
So we have to choose which checksum method can be done in the application 
without driver processing. It's not an easy choice.

It seems simpler and reasonnable to choose the standard pseudo-header checksum 
method as it is done in Linux.
Having a stable and generic API is something important we must target.

Thanks
-- 
Thomas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support
  2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
                   ` (11 preceding siblings ...)
  2014-05-09 17:04 ` [dpdk-dev] [PATCH RFC 00/11] " Stephen Hemminger
@ 2014-05-19 12:47 ` Thomas Monjalon
  12 siblings, 0 replies; 51+ messages in thread
From: Thomas Monjalon @ 2014-05-19 12:47 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev

2014-05-09 16:50, Olivier Matz:
> This series add TSO support in ixgbe DPDK driver. As discussed
> previously on the list [1], one problem is that there is not enough room
> in rte_mbuf today to store the required information to implement this
> feature:
>   - a new ol_flag
>   - the MSS
>   - the L4 header len
> 
> A solution would be to increase the size of the mbuf to 2 cache lines
> but it could have a bad impact on performance. This series proposes some
> rework to drastically reduce the size of the rte_mbuf structures before
> implementing the TSO, avoiding to change the mbuf size to 128 bytes.
> 
> After the rework of mbuf structures, the size of rte_mbuf structure is
> reduced by 9 bytes. The implementation of TSO requires to double the
> size of ol_flags (16 to 32 bits) and to double the size of offload
> information in order to add the mss and the l4 header length (32 to 64
> bits). At the end of the whole series, sizeof(rte_mbuf) is still 64
> bytes and 4 bytes are available for future use.
> 
> This rework causes a lot of modifications in the mbuf structure,
> implying some changes in the applications that directly use the mbuf
> structure fields instead of using the API functions (sometimes there is
> no function). That's why this series is a RFC. In my opinion, it's the
> proper moment for this evolution as the 1.7.0 window is open.
> 
> About TSO, the new fields in mbuf try to be generic enough to apply to
> other hardware in the future. To delegate the TCP segmentation to the
> hardware, the user has to:
> 
>   - set the PKT_TX_TCP_SEG flag in mbuf->ol_flags (this flag implies
>     PKT_TX_IP_CKSUM and PKT_TX_TCP_CKSUM)
>   - fill the mbuf->hw_offload information: l2_len, l3_len, l4_len, mss
>   - calculate the pseudo header checksum and set it in the TCP header,
>     as required when doing hardware TCP checksum offload
>   - set the IP checksum to 0

This patchset triggers a lot of interesting discussions about mbuf API, TSO 
and performance impacts.
It seems everybody agree that having TSO support is a nice improvement and 
there are some discussions on how it can be even better.
I feel we need a v2 from Olivier in order to allow us sending improvements 
patches on top of it.

Thanks everyone
-- 
Thomas

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf Olivier Matz
@ 2014-05-25 21:39   ` Gilmore, Walter E
  2014-05-26 12:23     ` Olivier MATZ
                       ` (2 more replies)
  2014-05-27  0:17   ` Stephen Hemminger
  1 sibling, 3 replies; 51+ messages in thread
From: Gilmore, Walter E @ 2014-05-25 21:39 UTC (permalink / raw)
  To: Olivier Matz, dev

Olivier you're making an assumption that customer application code running on the Intel DPDK isn't using the rte_ctrlmbuf structure. 
Remember there are more than 300 customers using the Intel DPDK and there is no way you can predict that this is not used by them. 
The purpose of this structure is to send commands, events or any other type of information between user application tasks (normally from a manager task).
It has been there since the beginning in the original design and it's up to the user to define what is in the data field and how they wish to use it. 
It's one thing to fix a bug but to remove a structure like this because you don't see it use in the other parts is asking for trouble with customers. 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
Sent: Friday, May 09, 2014 10:51 AM
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf

The initial role of rte_ctrlmbuf is to carry generic messages (data pointer + data length) but it's not used by the DPDK or it applications.
Keeping it implies:
  - loosing 1 byte in the rte_mbuf structure
  - having some dead code rte_mbuf.[ch]

This patch removes this feature. Thanks to it, it is now possible to simplify the rte_mbuf structure by merging the rte_pktmbuf structure in it. This is done in next commit.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/cmdline.c                   |   1 -
 app/test-pmd/testpmd.c                   |   2 -
 app/test-pmd/txonly.c                    |   2 +-
 app/test/commands.c                      |   1 -
 app/test/test_mbuf.c                     |  72 +------------
 examples/ipv4_multicast/main.c           |   2 +-
 lib/librte_mbuf/rte_mbuf.c               |  65 +++---------
 lib/librte_mbuf/rte_mbuf.h               | 175 ++++++-------------------------
 lib/librte_pmd_e1000/em_rxtx.c           |   2 +-
 lib/librte_pmd_e1000/igb_rxtx.c          |   2 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c        |   4 +-
 lib/librte_pmd_virtio/virtio_rxtx.c      |   2 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c    |   2 +-
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c |   2 +-
 14 files changed, 54 insertions(+), 280 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c index 7becedc..e3d1849 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -5010,7 +5010,6 @@ dump_struct_sizes(void)  #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
 	DUMP_SIZE(struct rte_mbuf);
 	DUMP_SIZE(struct rte_pktmbuf);
-	DUMP_SIZE(struct rte_ctrlmbuf);
 	DUMP_SIZE(struct rte_mempool);
 	DUMP_SIZE(struct rte_ring);
 #undef DUMP_SIZE
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index 9c56914..76b3823 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -389,13 +389,11 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 	mb_ctor_arg = (struct mbuf_ctor_arg *) opaque_arg;
 	mb = (struct rte_mbuf *) raw_mbuf;
 
-	mb->type         = RTE_MBUF_PKT;
 	mb->pool         = mp;
 	mb->buf_addr     = (void *) ((char *)mb + mb_ctor_arg->seg_buf_offset);
 	mb->buf_physaddr = (uint64_t) (rte_mempool_virt2phy(mp, mb) +
 			mb_ctor_arg->seg_buf_offset);
 	mb->buf_len      = mb_ctor_arg->seg_buf_size;
-	mb->type         = RTE_MBUF_PKT;
 	mb->ol_flags     = 0;
 	mb->pkt.data     = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
 	mb->pkt.nb_segs  = 1;
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c index 1cf2574..1f066d0 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -93,7 +93,7 @@ tx_mbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/app/test/commands.c b/app/test/commands.c index b145036..c69544b 100644
--- a/app/test/commands.c
+++ b/app/test/commands.c
@@ -262,7 +262,6 @@ dump_struct_sizes(void)  #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
 	DUMP_SIZE(struct rte_mbuf);
 	DUMP_SIZE(struct rte_pktmbuf);
-	DUMP_SIZE(struct rte_ctrlmbuf);
 	DUMP_SIZE(struct rte_mempool);
 	DUMP_SIZE(struct rte_ring);
 #undef DUMP_SIZE
diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c index fe0f4f6..07b5551 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -80,7 +80,6 @@
 #define MAKE_STRING(x)          # x
 
 static struct rte_mempool *pktmbuf_pool = NULL; -static struct rte_mempool *ctrlmbuf_pool = NULL;
 
 #if defined RTE_MBUF_REFCNT  && defined RTE_MBUF_REFCNT_ATOMIC
 
@@ -272,8 +271,8 @@ test_one_pktmbuf(void)
 		GOTO_FAIL("Buffer should be continuous");
 	memset(hdr, 0x55, MBUF_TEST_HDR2_LEN);
 
-	rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
-	rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+	rte_mbuf_sanity_check(m, 1);
+	rte_mbuf_sanity_check(m, 0);
 	rte_pktmbuf_dump(m, 0);
 
 	/* this prepend should fail */
@@ -320,48 +319,6 @@ fail:
 	return -1;
 }
 
-/*
- * test control mbuf
- */
-static int
-test_one_ctrlmbuf(void)
-{
-	struct rte_mbuf *m = NULL;
-	char message[] = "This is a message carried by a ctrlmbuf";
-
-	printf("Test ctrlmbuf API\n");
-
-	/* alloc a mbuf */
-
-	m = rte_ctrlmbuf_alloc(ctrlmbuf_pool);
-	if (m == NULL)
-		GOTO_FAIL("Cannot allocate mbuf");
-	if (rte_ctrlmbuf_len(m) != 0)
-		GOTO_FAIL("Bad length");
-
-	/* set data */
-	rte_ctrlmbuf_data(m) = &message;
-	rte_ctrlmbuf_len(m) = sizeof(message);
-
-	/* read data */
-	if (rte_ctrlmbuf_data(m) != message)
-		GOTO_FAIL("Invalid data pointer");
-	if (rte_ctrlmbuf_len(m) != sizeof(message))
-		GOTO_FAIL("Invalid len");
-
-	rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-
-	/* free mbuf */
-	rte_ctrlmbuf_free(m);
-	m = NULL;
-	return 0;
-
-fail:
-	if (m)
-		rte_ctrlmbuf_free(m);
-	return -1;
-}
-
 static int
 testclone_testupdate_testdetach(void)
 {
@@ -744,7 +701,7 @@ verify_mbuf_check_panics(struct rte_mbuf *buf)
 	pid = fork();
 
 	if (pid == 0) {
-		rte_mbuf_sanity_check(buf, RTE_MBUF_PKT, 1); /* should panic */
+		rte_mbuf_sanity_check(buf, 1); /* should panic */
 		exit(0);  /* return normally if it doesn't panic */
 	} else if (pid < 0){
 		printf("Fork Failed\n");
@@ -781,13 +738,6 @@ test_failing_mbuf_sanity_check(void)
 	}
 
 	badbuf = *buf;
-	badbuf.type = (uint8_t)-1;
-	if (verify_mbuf_check_panics(&badbuf)) {
-		printf("Error with bad-type mbuf test\n");
-		return -1;
-	}
-
-	badbuf = *buf;
 	badbuf.pool = NULL;
 	if (verify_mbuf_check_panics(&badbuf)) {
 		printf("Error with bad-pool mbuf test\n"); @@ -889,22 +839,6 @@ test_mbuf(void)
 		return -1;
 	}
 
-	/* create ctrlmbuf pool if it does not exist */
-	if (ctrlmbuf_pool == NULL) {
-		ctrlmbuf_pool =
-			rte_mempool_create("test_ctrlmbuf_pool", NB_MBUF,
-					   sizeof(struct rte_mbuf), 32, 0,
-					   NULL, NULL,
-					   rte_ctrlmbuf_init, NULL,
-					   SOCKET_ID_ANY, 0);
-	}
-
-	/* test control mbuf */
-	if (test_one_ctrlmbuf() < 0) {
-		printf("test_one_ctrlmbuf() failed\n");
-		return -1;
-	}
-
 	/* test free pktmbuf segment one by one */
 	if (test_pktmbuf_free_segment() < 0) {
 		printf("test_pktmbuf_free_segment() failed.\n"); diff --git a/examples/ipv4_multicast/main.c b/examples/ipv4_multicast/main.c index 3bd37e4..3967d7a 100644
--- a/examples/ipv4_multicast/main.c
+++ b/examples/ipv4_multicast/main.c
@@ -343,7 +343,7 @@ mcast_out_pkt(struct rte_mbuf *pkt, int use_clone)
 
 	hdr->ol_flags = pkt->ol_flags;
 
-	__rte_mbuf_sanity_check(hdr, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(hdr, 1);
 	return (hdr);
 }
 
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c index bffc2c4..b2e2f0f 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -60,32 +60,6 @@
 #include <rte_hexdump.h>
 
 /*
- * ctrlmbuf constructor, given as a callback function to
- * rte_mempool_create()
- */
-void
-rte_ctrlmbuf_init(struct rte_mempool *mp,
-		  __attribute__((unused)) void *opaque_arg,
-		  void *_m,
-		  __attribute__((unused)) unsigned i)
-{
-	struct rte_mbuf *m = _m;
-
-	memset(m, 0, mp->elt_size);
-
-	/* start of buffer is just after mbuf structure */
-	m->buf_addr = (char *)m + sizeof(struct rte_mbuf);
-	m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
-			sizeof(struct rte_mbuf);
-	m->buf_len = (uint16_t) (mp->elt_size - sizeof(struct rte_mbuf));
-
-	/* init some constant fields */
-	m->type = RTE_MBUF_CTRL;
-	m->ctrl.data = (char *)m->buf_addr;
-	m->pool = (struct rte_mempool *)mp;
-}
-
-/*
  * pktmbuf pool constructor, given as a callback function to
  * rte_mempool_create()
  */
@@ -133,7 +107,6 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->pkt.data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
 
 	/* init some constant fields */
-	m->type = RTE_MBUF_PKT;
 	m->pool = mp;
 	m->pkt.nb_segs = 1;
 	m->pkt.in_port = 0xff;
@@ -141,16 +114,13 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 
 /* do some sanity checks on a mbuf: panic if it fails */  void -rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
-		      int is_header)
+rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header)
 {
 	const struct rte_mbuf *m_seg;
 	unsigned nb_segs;
 
 	if (m == NULL)
 		rte_panic("mbuf is NULL\n");
-	if (m->type != (uint8_t)t)
-		rte_panic("bad mbuf type\n");
 
 	/* generic checks */
 	if (m->pool == NULL)
@@ -166,29 +136,18 @@ rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
 		rte_panic("bad ref cnt\n");
 #endif
 
-	/* nothing to check for ctrl messages */
-	if (m->type == RTE_MBUF_CTRL)
+	/* nothing to check for sub-segments */
+	if (is_header == 0)
 		return;
 
-	/* check pkt consistency */
-	else if (m->type == RTE_MBUF_PKT) {
-
-		/* nothing to check for sub-segments */
-		if (is_header == 0)
-			return;
-
-		nb_segs = m->pkt.nb_segs;
-		m_seg = m;
-		while (m_seg && nb_segs != 0) {
-			m_seg = m_seg->pkt.next;
-			nb_segs --;
-		}
-		if (nb_segs != 0)
-			rte_panic("bad nb_segs\n");
-		return;
+	nb_segs = m->pkt.nb_segs;
+	m_seg = m;
+	while (m_seg && nb_segs != 0) {
+		m_seg = m_seg->pkt.next;
+		nb_segs --;
 	}
-
-	rte_panic("unknown mbuf type\n");
+	if (nb_segs != 0)
+		rte_panic("bad nb_segs\n");
 }
 
 /* dump a mbuf on console */
@@ -198,7 +157,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 	unsigned int len;
 	unsigned nb_segs;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	printf("dump mbuf at 0x%p, phys=%"PRIx64", buf_len=%u\n",
 	       m, (uint64_t)m->buf_physaddr, (unsigned)m->buf_len); @@ -208,7 +167,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 	nb_segs = m->pkt.nb_segs;
 
 	while (m && nb_segs != 0) {
-		__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+		__rte_mbuf_sanity_check(m, 0);
 
 		printf("  segment at 0x%p, data=0x%p, data_len=%u\n",
 		       m, m->pkt.data, (unsigned)m->pkt.data_len); diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 1b1a84e..22e1ac1 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -43,18 +43,13 @@
  * buffers. The message buffers are stored in a mempool, using the
  * RTE mempool library.
  *
- * This library provide an API to allocate/free mbufs, manipulate
- * control message buffer (ctrlmbuf), which are generic message
- * buffers, and packet buffers (pktmbuf), which are used to carry
- * network packets.
+ * This library provide an API to allocate/free packet mbufs, which are
+ * used to carry network packets.
  *
  * To understand the concepts of packet buffers or mbufs, you
  * should read "TCP/IP Illustrated, Volume 2: The Implementation,
  * Addison-Wesley, 1995, ISBN 0-201-63354-X from Richard Stevens"
  * http://www.kohala.com/start/tcpipiv2.html
- *
- * The main modification of this implementation is the use of mbuf for
- * transports other than packets. mbufs can have other types.
  */
 
 #include <stdint.h>
@@ -70,15 +65,6 @@ extern "C" {
 /* deprecated feature, renamed in RTE_MBUF_REFCNT */  #pragma GCC poison RTE_MBUF_SCATTER_GATHER
 
-/**
- * A control message buffer.
- */
-struct rte_ctrlmbuf {
-	void *data;        /**< Pointer to data. */
-	uint32_t data_len; /**< Length of data. */
-};
-
-
 /*
  * Packet Offload Features Flags. It also carry packet type information.
  * Critical resources. Both rx/tx shared these bits. Be cautious on any change @@ -165,15 +151,7 @@ struct rte_pktmbuf {  };
 
 /**
- * This enum indicates the mbuf type.
- */
-enum rte_mbuf_type {
-	RTE_MBUF_CTRL,  /**< Control mbuf. */
-	RTE_MBUF_PKT,   /**< Packet mbuf. */
-};
-
-/**
- * The generic rte_mbuf, containing a packet mbuf or a control mbuf.
+ * The generic rte_mbuf, containing a packet mbuf.
  */
 struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */ @@ -196,14 +174,10 @@ struct rte_mbuf {  #else
 	uint16_t refcnt_reserved;     /**< Do not use this field */
 #endif
-	uint8_t type;                 /**< Type of mbuf. */
-	uint8_t reserved;             /**< Unused field. Required for padding. */
+	uint16_t reserved;             /**< Unused field. Required for padding. */
 	uint16_t ol_flags;            /**< Offload features. */
 
-	union {
-		struct rte_ctrlmbuf ctrl;
-		struct rte_pktmbuf pkt;
-	};
+	struct rte_pktmbuf pkt;
 } __rte_cache_aligned;
 
 /**
@@ -241,12 +215,12 @@ struct rte_pktmbuf_pool_private {  #ifdef RTE_LIBRTE_MBUF_DEBUG
 
 /**  check mbuf type in debug mode */
-#define __rte_mbuf_sanity_check(m, t, is_h) rte_mbuf_sanity_check(m, t, is_h)
+#define __rte_mbuf_sanity_check(m, is_h) rte_mbuf_sanity_check(m, is_h)
 
 /**  check mbuf type in debug mode if mbuf pointer is not null */
-#define __rte_mbuf_sanity_check_raw(m, t, is_h)	do {       \
+#define __rte_mbuf_sanity_check_raw(m, is_h)	do {       \
 	if ((m) != NULL)                                   \
-		rte_mbuf_sanity_check(m, t, is_h);          \
+		rte_mbuf_sanity_check(m, is_h);          \
 } while (0)
 
 /**  MBUF asserts in debug mode */
@@ -258,10 +232,10 @@ if (!(exp)) {                                                        \
 #else /*  RTE_LIBRTE_MBUF_DEBUG */
 
 /**  check mbuf type in debug mode */
-#define __rte_mbuf_sanity_check(m, t, is_h) do { } while(0)
+#define __rte_mbuf_sanity_check(m, is_h) do { } while(0)
 
 /**  check mbuf type in debug mode if mbuf pointer is not null */ -#define __rte_mbuf_sanity_check_raw(m, t, is_h) do { } while(0)
+#define __rte_mbuf_sanity_check_raw(m, is_h) do { } while(0)
 
 /**  MBUF asserts in debug mode */
 #define RTE_MBUF_ASSERT(exp)                do { } while(0)
@@ -368,20 +342,17 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
  *
  * @param m
  *   The mbuf to be checked.
- * @param t
- *   The expected type of the mbuf.
  * @param is_header
  *   True if the mbuf is a packet header, false if it is a sub-segment
  *   of a packet (in this case, some fields like nb_segs are not checked)
  */
 void
-rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
-		      int is_header);
+rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header);
 
 /**
  * @internal Allocate a new mbuf from mempool *mp*.
  * The use of that function is reserved for RTE internal needs.
- * Please use either rte_ctrlmbuf_alloc() or rte_pktmbuf_alloc().
+ * Please use rte_pktmbuf_alloc().
  *
  * @param mp
  *   The mempool from which mbuf is allocated.
@@ -406,7 +377,7 @@ static inline struct rte_mbuf *__rte_mbuf_raw_alloc(struct rte_mempool *mp)
 /**
  * @internal Put mbuf back into its original mempool.
  * The use of that function is reserved for RTE internal needs.
- * Please use either rte_ctrlmbuf_free() or rte_pktmbuf_free().
+ * Please use rte_pktmbuf_free().
  *
  * @param m
  *   The mbuf to be freed.
@@ -420,95 +391,11 @@ __rte_mbuf_raw_free(struct rte_mbuf *m)
 	rte_mempool_put(m->pool, m);
 }
 
-/* Operations on ctrl mbuf */
-
-/**
- * The control mbuf constructor.
- *
- * This function initializes some fields in an mbuf structure that are
- * not modified by the user once created (mbuf type, origin pool, buffer
- * start address, and so on). This function is given as a callback function
- * to rte_mempool_create() at pool creation time.
- *
- * @param mp
- *   The mempool from which the mbuf is allocated.
- * @param opaque_arg
- *   A pointer that can be used by the user to retrieve useful information
- *   for mbuf initialization. This pointer comes from the ``init_arg``
- *   parameter of rte_mempool_create().
- * @param m
- *   The mbuf to initialize.
- * @param i
- *   The index of the mbuf in the pool table.
- */
-void rte_ctrlmbuf_init(struct rte_mempool *mp, void *opaque_arg,
-		       void *m, unsigned i);
-
-/**
- * Allocate a new mbuf (type is ctrl) from mempool *mp*.
- *
- * This new mbuf is initialized with data pointing to the beginning of
- * buffer, and with a length of zero.
- *
- * @param mp
- *   The mempool from which the mbuf is allocated.
- * @return
- *   - The pointer to the new mbuf on success.
- *   - NULL if allocation failed.
- */
-static inline struct rte_mbuf *rte_ctrlmbuf_alloc(struct rte_mempool *mp) -{
-	struct rte_mbuf *m;
-	if ((m = __rte_mbuf_raw_alloc(mp)) != NULL) {
-		m->ctrl.data = m->buf_addr;
-		m->ctrl.data_len = 0;
-		__rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-	}
-	return (m);
-}
-
-/**
- * Free a control mbuf back into its original mempool.
- *
- * @param m
- *   The control mbuf to be freed.
- */
-static inline void rte_ctrlmbuf_free(struct rte_mbuf *m) -{
-	__rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-#ifdef RTE_MBUF_SCATTER_GATHER
-	if (rte_mbuf_refcnt_update(m, -1) == 0)
-#endif /* RTE_MBUF_SCATTER_GATHER */
-		__rte_mbuf_raw_free(m);
-}
-
-/**
- * A macro that returns the pointer to the carried data.
- *
- * The value that can be read or assigned.
- *
- * @param m
- *   The control mbuf.
- */
-#define rte_ctrlmbuf_data(m) ((m)->ctrl.data)
-
-/**
- * A macro that returns the length of the carried data.
- *
- * The value that can be read or assigned.
- *
- * @param m
- *   The control mbuf.
- */
-#define rte_ctrlmbuf_len(m) ((m)->ctrl.data_len)
-
-/* Operations on pkt mbuf */
-
 /**
  * The packet mbuf constructor.
  *
- * This function initializes some fields in the mbuf structure that are not
- * modified by the user once created (mbuf type, origin pool, buffer start
+ * This function initializes some fields in the mbuf structure that are
+ * not modified by the user once created (origin pool, buffer start
  * address, and so on). This function is given as a callback function to
  * rte_mempool_create() at pool creation time.
  *
@@ -569,11 +456,11 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 	m->pkt.data = (char*) m->buf_addr + buf_ofs;
 
 	m->pkt.data_len = 0;
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 }
 
 /**
- * Allocate a new mbuf (type is pkt) from a mempool.
+ * Allocate a new mbuf from a mempool.
  *
  * This new mbuf contains one segment, which has a length of 0. The pointer
  * to data is initialized to have some bytes of headroom in the buffer @@ -629,8 +516,8 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->pkt.nb_segs = 1;
 	mi->ol_flags = md->ol_flags;
 
-	__rte_mbuf_sanity_check(mi, RTE_MBUF_PKT, 1);
-	__rte_mbuf_sanity_check(md, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check(mi, 1);
+	__rte_mbuf_sanity_check(md, 0);
 }
 
 /**
@@ -667,7 +554,7 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)  static inline struct rte_mbuf* __attribute__((always_inline))  __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check(m, 0);
 
 #ifdef RTE_MBUF_REFCNT
 	if (likely (rte_mbuf_refcnt_read(m) == 1) || @@ -722,7 +609,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)  {
 	struct rte_mbuf *m_next;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
 		m_next = m->pkt.next;
@@ -783,7 +670,7 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
 		return (NULL);
 	}
 
-	__rte_mbuf_sanity_check(mc, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(mc, 1);
 	return (mc);
 }
 
@@ -800,7 +687,7 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
  */
 static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
@@ -819,7 +706,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
  */
 static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return (uint16_t) ((char*) m->pkt.data - (char*) m->buf_addr);  }
 
@@ -833,7 +720,7 @@ static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
  */
 static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return (uint16_t)(m->buf_len - rte_pktmbuf_headroom(m) -
 			  m->pkt.data_len);
 }
@@ -850,7 +737,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)  {
 	struct rte_mbuf *m2 = (struct rte_mbuf *)m;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	while (m2->pkt.next != NULL)
 		m2 = m2->pkt.next;
 	return m2;
@@ -908,7 +795,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)  static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
 					uint16_t len)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	if (unlikely(len > rte_pktmbuf_headroom(m)))
 		return NULL;
@@ -940,7 +827,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
 	void *tail;
 	struct rte_mbuf *m_last;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	m_last = rte_pktmbuf_lastseg(m);
 	if (unlikely(len > rte_pktmbuf_tailroom(m_last))) @@ -968,7 +855,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
  */
 static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	if (unlikely(len > m->pkt.data_len))
 		return NULL;
@@ -997,7 +884,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)  {
 	struct rte_mbuf *m_last;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	m_last = rte_pktmbuf_lastseg(m);
 	if (unlikely(len > m_last->pkt.data_len)) @@ -1019,7 +906,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
  */
 static inline int rte_pktmbuf_is_contiguous(const struct rte_mbuf *m)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return !!(m->pkt.nb_segs == 1);
 }
 
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c index 78c0c44..31f480a 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -85,7 +85,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c index b3c8149..62ff7bc 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -79,7 +79,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 4e307c2..76448ab 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -88,7 +88,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
@@ -987,7 +987,6 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 		/* populate the static rte mbuf fields */
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
-		mb->type = RTE_MBUF_PKT;
 		mb->pkt.next = NULL;
 		mb->pkt.data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
 		mb->pkt.nb_segs = 1;
@@ -3084,7 +3083,6 @@ ixgbe_alloc_rx_queue_mbufs(struct igb_rx_queue *rxq)
 		}
 
 		rte_mbuf_refcnt_set(mbuf, 1);
-		mbuf->type = RTE_MBUF_PKT;
 		mbuf->pkt.next = NULL;
 		mbuf->pkt.data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
 		mbuf->pkt.nb_segs = 1;
diff --git a/lib/librte_pmd_virtio/virtio_rxtx.c b/lib/librte_pmd_virtio/virtio_rxtx.c
index fe94a3f..0db3ba0 100644
--- a/lib/librte_pmd_virtio/virtio_rxtx.c
+++ b/lib/librte_pmd_virtio/virtio_rxtx.c
@@ -66,7 +66,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 
 	return (m);
 }
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index 9fdd441..d91404a 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -101,7 +101,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
index 533aa76..5cd1cdb 100644
--- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
+++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
@@ -80,7 +80,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 
 	return m;
 }
--
1.9.2

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf
  2014-05-25 21:39   ` Gilmore, Walter E
@ 2014-05-26 12:23     ` Olivier MATZ
  2014-05-26 16:40     ` Dumitrescu, Cristian
  2014-05-26 22:43     ` Neil Horman
  2 siblings, 0 replies; 51+ messages in thread
From: Olivier MATZ @ 2014-05-26 12:23 UTC (permalink / raw)
  To: Gilmore, Walter E, dev

Hi Walt,

> The purpose of this structure is to send commands, events or any other type
> of information between user application tasks (normally from a manager
> task).  It has been there since the beginning in the original design and
> it's up to the user to define what is in the data field and how they
> wish to use it.  It's one thing to fix a bug but to remove a structure
> like this because you don't see it use in the other parts is asking for
> trouble with customers.

To me, there is nothing that we cannot do without this structure:
depending on the use-case, it could be replaced with the same
functionalities by:

- a packet mbuf, in this case the user pointer would be stored in
   the packet data for instance. In the worst case, I would agree to
   add a flag telling that the mbuf carries control data.

- an application private structure which would contain the pointer,
   the data len (if any), plus any other field that could be useful
   for the application. This structure can be allocated in a mempool.

- nothing! I mean: if the application only wants to carry a pointer,
   why would it need an additional structure to point to it? The
   application can just give the pointer to its private data without
   allocating a control mbuf for that.

To be honnest, I don't see in which case it is useful to have this
additional structure. This modification is motivated by a gain of
bytes in the mbuf and a rationalization of the rte_mbuf structure.

I can add a documentation, for instance in the commit log, about how
the rte_ctrlmbuf could be replaced by something equivalent in different
situations. Are you fine with this?

If we find use cases where rte_ctrlmbuf is required, another idea would
be to keep this structure, but make it independant of rte_mbuf.

Regards,
Olivier

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf
  2014-05-25 21:39   ` Gilmore, Walter E
  2014-05-26 12:23     ` Olivier MATZ
@ 2014-05-26 16:40     ` Dumitrescu, Cristian
  2014-05-26 22:43     ` Neil Horman
  2 siblings, 0 replies; 51+ messages in thread
From: Dumitrescu, Cristian @ 2014-05-26 16:40 UTC (permalink / raw)
  To: Gilmore, Walter E, Olivier Matz, dev

I am also using the rte_ctrlmbuf to send messages between cores in one of the Packet Framework sample apps that I am going to send as a patch tomorrow or later this week.

Removing rte_ctrlmbuf would require additional rework to this (complex) sample app. It can be done, but it is additional work very close to the code freeze cycle.

Thanks,
Cristian

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Gilmore, Walter E
Sent: Sunday, May 25, 2014 10:39 PM
To: Olivier Matz; dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf

Olivier you're making an assumption that customer application code running on the Intel DPDK isn't using the rte_ctrlmbuf structure. 
Remember there are more than 300 customers using the Intel DPDK and there is no way you can predict that this is not used by them. 
The purpose of this structure is to send commands, events or any other type of information between user application tasks (normally from a manager task).
It has been there since the beginning in the original design and it's up to the user to define what is in the data field and how they wish to use it. 
It's one thing to fix a bug but to remove a structure like this because you don't see it use in the other parts is asking for trouble with customers. 

-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
Sent: Friday, May 09, 2014 10:51 AM
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf

The initial role of rte_ctrlmbuf is to carry generic messages (data pointer + data length) but it's not used by the DPDK or it applications.
Keeping it implies:
  - loosing 1 byte in the rte_mbuf structure
  - having some dead code rte_mbuf.[ch]

This patch removes this feature. Thanks to it, it is now possible to simplify the rte_mbuf structure by merging the rte_pktmbuf structure in it. This is done in next commit.

Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test-pmd/cmdline.c                   |   1 -
 app/test-pmd/testpmd.c                   |   2 -
 app/test-pmd/txonly.c                    |   2 +-
 app/test/commands.c                      |   1 -
 app/test/test_mbuf.c                     |  72 +------------
 examples/ipv4_multicast/main.c           |   2 +-
 lib/librte_mbuf/rte_mbuf.c               |  65 +++---------
 lib/librte_mbuf/rte_mbuf.h               | 175 ++++++-------------------------
 lib/librte_pmd_e1000/em_rxtx.c           |   2 +-
 lib/librte_pmd_e1000/igb_rxtx.c          |   2 +-
 lib/librte_pmd_ixgbe/ixgbe_rxtx.c        |   4 +-
 lib/librte_pmd_virtio/virtio_rxtx.c      |   2 +-
 lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c    |   2 +-
 lib/librte_pmd_xenvirt/rte_eth_xenvirt.c |   2 +-
 14 files changed, 54 insertions(+), 280 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c index 7becedc..e3d1849 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -5010,7 +5010,6 @@ dump_struct_sizes(void)  #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
 	DUMP_SIZE(struct rte_mbuf);
 	DUMP_SIZE(struct rte_pktmbuf);
-	DUMP_SIZE(struct rte_ctrlmbuf);
 	DUMP_SIZE(struct rte_mempool);
 	DUMP_SIZE(struct rte_ring);
 #undef DUMP_SIZE
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index 9c56914..76b3823 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -389,13 +389,11 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
 	mb_ctor_arg = (struct mbuf_ctor_arg *) opaque_arg;
 	mb = (struct rte_mbuf *) raw_mbuf;
 
-	mb->type         = RTE_MBUF_PKT;
 	mb->pool         = mp;
 	mb->buf_addr     = (void *) ((char *)mb + mb_ctor_arg->seg_buf_offset);
 	mb->buf_physaddr = (uint64_t) (rte_mempool_virt2phy(mp, mb) +
 			mb_ctor_arg->seg_buf_offset);
 	mb->buf_len      = mb_ctor_arg->seg_buf_size;
-	mb->type         = RTE_MBUF_PKT;
 	mb->ol_flags     = 0;
 	mb->pkt.data     = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
 	mb->pkt.nb_segs  = 1;
diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c index 1cf2574..1f066d0 100644
--- a/app/test-pmd/txonly.c
+++ b/app/test-pmd/txonly.c
@@ -93,7 +93,7 @@ tx_mbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/app/test/commands.c b/app/test/commands.c index b145036..c69544b 100644
--- a/app/test/commands.c
+++ b/app/test/commands.c
@@ -262,7 +262,6 @@ dump_struct_sizes(void)  #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
 	DUMP_SIZE(struct rte_mbuf);
 	DUMP_SIZE(struct rte_pktmbuf);
-	DUMP_SIZE(struct rte_ctrlmbuf);
 	DUMP_SIZE(struct rte_mempool);
 	DUMP_SIZE(struct rte_ring);
 #undef DUMP_SIZE
diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c index fe0f4f6..07b5551 100644
--- a/app/test/test_mbuf.c
+++ b/app/test/test_mbuf.c
@@ -80,7 +80,6 @@
 #define MAKE_STRING(x)          # x
 
 static struct rte_mempool *pktmbuf_pool = NULL; -static struct rte_mempool *ctrlmbuf_pool = NULL;
 
 #if defined RTE_MBUF_REFCNT  && defined RTE_MBUF_REFCNT_ATOMIC
 
@@ -272,8 +271,8 @@ test_one_pktmbuf(void)
 		GOTO_FAIL("Buffer should be continuous");
 	memset(hdr, 0x55, MBUF_TEST_HDR2_LEN);
 
-	rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
-	rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+	rte_mbuf_sanity_check(m, 1);
+	rte_mbuf_sanity_check(m, 0);
 	rte_pktmbuf_dump(m, 0);
 
 	/* this prepend should fail */
@@ -320,48 +319,6 @@ fail:
 	return -1;
 }
 
-/*
- * test control mbuf
- */
-static int
-test_one_ctrlmbuf(void)
-{
-	struct rte_mbuf *m = NULL;
-	char message[] = "This is a message carried by a ctrlmbuf";
-
-	printf("Test ctrlmbuf API\n");
-
-	/* alloc a mbuf */
-
-	m = rte_ctrlmbuf_alloc(ctrlmbuf_pool);
-	if (m == NULL)
-		GOTO_FAIL("Cannot allocate mbuf");
-	if (rte_ctrlmbuf_len(m) != 0)
-		GOTO_FAIL("Bad length");
-
-	/* set data */
-	rte_ctrlmbuf_data(m) = &message;
-	rte_ctrlmbuf_len(m) = sizeof(message);
-
-	/* read data */
-	if (rte_ctrlmbuf_data(m) != message)
-		GOTO_FAIL("Invalid data pointer");
-	if (rte_ctrlmbuf_len(m) != sizeof(message))
-		GOTO_FAIL("Invalid len");
-
-	rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-
-	/* free mbuf */
-	rte_ctrlmbuf_free(m);
-	m = NULL;
-	return 0;
-
-fail:
-	if (m)
-		rte_ctrlmbuf_free(m);
-	return -1;
-}
-
 static int
 testclone_testupdate_testdetach(void)
 {
@@ -744,7 +701,7 @@ verify_mbuf_check_panics(struct rte_mbuf *buf)
 	pid = fork();
 
 	if (pid == 0) {
-		rte_mbuf_sanity_check(buf, RTE_MBUF_PKT, 1); /* should panic */
+		rte_mbuf_sanity_check(buf, 1); /* should panic */
 		exit(0);  /* return normally if it doesn't panic */
 	} else if (pid < 0){
 		printf("Fork Failed\n");
@@ -781,13 +738,6 @@ test_failing_mbuf_sanity_check(void)
 	}
 
 	badbuf = *buf;
-	badbuf.type = (uint8_t)-1;
-	if (verify_mbuf_check_panics(&badbuf)) {
-		printf("Error with bad-type mbuf test\n");
-		return -1;
-	}
-
-	badbuf = *buf;
 	badbuf.pool = NULL;
 	if (verify_mbuf_check_panics(&badbuf)) {
 		printf("Error with bad-pool mbuf test\n"); @@ -889,22 +839,6 @@ test_mbuf(void)
 		return -1;
 	}
 
-	/* create ctrlmbuf pool if it does not exist */
-	if (ctrlmbuf_pool == NULL) {
-		ctrlmbuf_pool =
-			rte_mempool_create("test_ctrlmbuf_pool", NB_MBUF,
-					   sizeof(struct rte_mbuf), 32, 0,
-					   NULL, NULL,
-					   rte_ctrlmbuf_init, NULL,
-					   SOCKET_ID_ANY, 0);
-	}
-
-	/* test control mbuf */
-	if (test_one_ctrlmbuf() < 0) {
-		printf("test_one_ctrlmbuf() failed\n");
-		return -1;
-	}
-
 	/* test free pktmbuf segment one by one */
 	if (test_pktmbuf_free_segment() < 0) {
 		printf("test_pktmbuf_free_segment() failed.\n"); diff --git a/examples/ipv4_multicast/main.c b/examples/ipv4_multicast/main.c index 3bd37e4..3967d7a 100644
--- a/examples/ipv4_multicast/main.c
+++ b/examples/ipv4_multicast/main.c
@@ -343,7 +343,7 @@ mcast_out_pkt(struct rte_mbuf *pkt, int use_clone)
 
 	hdr->ol_flags = pkt->ol_flags;
 
-	__rte_mbuf_sanity_check(hdr, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(hdr, 1);
 	return (hdr);
 }
 
diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c index bffc2c4..b2e2f0f 100644
--- a/lib/librte_mbuf/rte_mbuf.c
+++ b/lib/librte_mbuf/rte_mbuf.c
@@ -60,32 +60,6 @@
 #include <rte_hexdump.h>
 
 /*
- * ctrlmbuf constructor, given as a callback function to
- * rte_mempool_create()
- */
-void
-rte_ctrlmbuf_init(struct rte_mempool *mp,
-		  __attribute__((unused)) void *opaque_arg,
-		  void *_m,
-		  __attribute__((unused)) unsigned i)
-{
-	struct rte_mbuf *m = _m;
-
-	memset(m, 0, mp->elt_size);
-
-	/* start of buffer is just after mbuf structure */
-	m->buf_addr = (char *)m + sizeof(struct rte_mbuf);
-	m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
-			sizeof(struct rte_mbuf);
-	m->buf_len = (uint16_t) (mp->elt_size - sizeof(struct rte_mbuf));
-
-	/* init some constant fields */
-	m->type = RTE_MBUF_CTRL;
-	m->ctrl.data = (char *)m->buf_addr;
-	m->pool = (struct rte_mempool *)mp;
-}
-
-/*
  * pktmbuf pool constructor, given as a callback function to
  * rte_mempool_create()
  */
@@ -133,7 +107,6 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 	m->pkt.data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
 
 	/* init some constant fields */
-	m->type = RTE_MBUF_PKT;
 	m->pool = mp;
 	m->pkt.nb_segs = 1;
 	m->pkt.in_port = 0xff;
@@ -141,16 +114,13 @@ rte_pktmbuf_init(struct rte_mempool *mp,
 
 /* do some sanity checks on a mbuf: panic if it fails */  void -rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
-		      int is_header)
+rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header)
 {
 	const struct rte_mbuf *m_seg;
 	unsigned nb_segs;
 
 	if (m == NULL)
 		rte_panic("mbuf is NULL\n");
-	if (m->type != (uint8_t)t)
-		rte_panic("bad mbuf type\n");
 
 	/* generic checks */
 	if (m->pool == NULL)
@@ -166,29 +136,18 @@ rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
 		rte_panic("bad ref cnt\n");
 #endif
 
-	/* nothing to check for ctrl messages */
-	if (m->type == RTE_MBUF_CTRL)
+	/* nothing to check for sub-segments */
+	if (is_header == 0)
 		return;
 
-	/* check pkt consistency */
-	else if (m->type == RTE_MBUF_PKT) {
-
-		/* nothing to check for sub-segments */
-		if (is_header == 0)
-			return;
-
-		nb_segs = m->pkt.nb_segs;
-		m_seg = m;
-		while (m_seg && nb_segs != 0) {
-			m_seg = m_seg->pkt.next;
-			nb_segs --;
-		}
-		if (nb_segs != 0)
-			rte_panic("bad nb_segs\n");
-		return;
+	nb_segs = m->pkt.nb_segs;
+	m_seg = m;
+	while (m_seg && nb_segs != 0) {
+		m_seg = m_seg->pkt.next;
+		nb_segs --;
 	}
-
-	rte_panic("unknown mbuf type\n");
+	if (nb_segs != 0)
+		rte_panic("bad nb_segs\n");
 }
 
 /* dump a mbuf on console */
@@ -198,7 +157,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 	unsigned int len;
 	unsigned nb_segs;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	printf("dump mbuf at 0x%p, phys=%"PRIx64", buf_len=%u\n",
 	       m, (uint64_t)m->buf_physaddr, (unsigned)m->buf_len); @@ -208,7 +167,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
 	nb_segs = m->pkt.nb_segs;
 
 	while (m && nb_segs != 0) {
-		__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+		__rte_mbuf_sanity_check(m, 0);
 
 		printf("  segment at 0x%p, data=0x%p, data_len=%u\n",
 		       m, m->pkt.data, (unsigned)m->pkt.data_len); diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 1b1a84e..22e1ac1 100644
--- a/lib/librte_mbuf/rte_mbuf.h
+++ b/lib/librte_mbuf/rte_mbuf.h
@@ -43,18 +43,13 @@
  * buffers. The message buffers are stored in a mempool, using the
  * RTE mempool library.
  *
- * This library provide an API to allocate/free mbufs, manipulate
- * control message buffer (ctrlmbuf), which are generic message
- * buffers, and packet buffers (pktmbuf), which are used to carry
- * network packets.
+ * This library provide an API to allocate/free packet mbufs, which are
+ * used to carry network packets.
  *
  * To understand the concepts of packet buffers or mbufs, you
  * should read "TCP/IP Illustrated, Volume 2: The Implementation,
  * Addison-Wesley, 1995, ISBN 0-201-63354-X from Richard Stevens"
  * http://www.kohala.com/start/tcpipiv2.html
- *
- * The main modification of this implementation is the use of mbuf for
- * transports other than packets. mbufs can have other types.
  */
 
 #include <stdint.h>
@@ -70,15 +65,6 @@ extern "C" {
 /* deprecated feature, renamed in RTE_MBUF_REFCNT */  #pragma GCC poison RTE_MBUF_SCATTER_GATHER
 
-/**
- * A control message buffer.
- */
-struct rte_ctrlmbuf {
-	void *data;        /**< Pointer to data. */
-	uint32_t data_len; /**< Length of data. */
-};
-
-
 /*
  * Packet Offload Features Flags. It also carry packet type information.
  * Critical resources. Both rx/tx shared these bits. Be cautious on any change @@ -165,15 +151,7 @@ struct rte_pktmbuf {  };
 
 /**
- * This enum indicates the mbuf type.
- */
-enum rte_mbuf_type {
-	RTE_MBUF_CTRL,  /**< Control mbuf. */
-	RTE_MBUF_PKT,   /**< Packet mbuf. */
-};
-
-/**
- * The generic rte_mbuf, containing a packet mbuf or a control mbuf.
+ * The generic rte_mbuf, containing a packet mbuf.
  */
 struct rte_mbuf {
 	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */ @@ -196,14 +174,10 @@ struct rte_mbuf {  #else
 	uint16_t refcnt_reserved;     /**< Do not use this field */
 #endif
-	uint8_t type;                 /**< Type of mbuf. */
-	uint8_t reserved;             /**< Unused field. Required for padding. */
+	uint16_t reserved;             /**< Unused field. Required for padding. */
 	uint16_t ol_flags;            /**< Offload features. */
 
-	union {
-		struct rte_ctrlmbuf ctrl;
-		struct rte_pktmbuf pkt;
-	};
+	struct rte_pktmbuf pkt;
 } __rte_cache_aligned;
 
 /**
@@ -241,12 +215,12 @@ struct rte_pktmbuf_pool_private {  #ifdef RTE_LIBRTE_MBUF_DEBUG
 
 /**  check mbuf type in debug mode */
-#define __rte_mbuf_sanity_check(m, t, is_h) rte_mbuf_sanity_check(m, t, is_h)
+#define __rte_mbuf_sanity_check(m, is_h) rte_mbuf_sanity_check(m, is_h)
 
 /**  check mbuf type in debug mode if mbuf pointer is not null */
-#define __rte_mbuf_sanity_check_raw(m, t, is_h)	do {       \
+#define __rte_mbuf_sanity_check_raw(m, is_h)	do {       \
 	if ((m) != NULL)                                   \
-		rte_mbuf_sanity_check(m, t, is_h);          \
+		rte_mbuf_sanity_check(m, is_h);          \
 } while (0)
 
 /**  MBUF asserts in debug mode */
@@ -258,10 +232,10 @@ if (!(exp)) {                                                        \
 #else /*  RTE_LIBRTE_MBUF_DEBUG */
 
 /**  check mbuf type in debug mode */
-#define __rte_mbuf_sanity_check(m, t, is_h) do { } while(0)
+#define __rte_mbuf_sanity_check(m, is_h) do { } while(0)
 
 /**  check mbuf type in debug mode if mbuf pointer is not null */ -#define __rte_mbuf_sanity_check_raw(m, t, is_h) do { } while(0)
+#define __rte_mbuf_sanity_check_raw(m, is_h) do { } while(0)
 
 /**  MBUF asserts in debug mode */
 #define RTE_MBUF_ASSERT(exp)                do { } while(0)
@@ -368,20 +342,17 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
  *
  * @param m
  *   The mbuf to be checked.
- * @param t
- *   The expected type of the mbuf.
  * @param is_header
  *   True if the mbuf is a packet header, false if it is a sub-segment
  *   of a packet (in this case, some fields like nb_segs are not checked)
  */
 void
-rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
-		      int is_header);
+rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header);
 
 /**
  * @internal Allocate a new mbuf from mempool *mp*.
  * The use of that function is reserved for RTE internal needs.
- * Please use either rte_ctrlmbuf_alloc() or rte_pktmbuf_alloc().
+ * Please use rte_pktmbuf_alloc().
  *
  * @param mp
  *   The mempool from which mbuf is allocated.
@@ -406,7 +377,7 @@ static inline struct rte_mbuf *__rte_mbuf_raw_alloc(struct rte_mempool *mp)
 /**
  * @internal Put mbuf back into its original mempool.
  * The use of that function is reserved for RTE internal needs.
- * Please use either rte_ctrlmbuf_free() or rte_pktmbuf_free().
+ * Please use rte_pktmbuf_free().
  *
  * @param m
  *   The mbuf to be freed.
@@ -420,95 +391,11 @@ __rte_mbuf_raw_free(struct rte_mbuf *m)
 	rte_mempool_put(m->pool, m);
 }
 
-/* Operations on ctrl mbuf */
-
-/**
- * The control mbuf constructor.
- *
- * This function initializes some fields in an mbuf structure that are
- * not modified by the user once created (mbuf type, origin pool, buffer
- * start address, and so on). This function is given as a callback function
- * to rte_mempool_create() at pool creation time.
- *
- * @param mp
- *   The mempool from which the mbuf is allocated.
- * @param opaque_arg
- *   A pointer that can be used by the user to retrieve useful information
- *   for mbuf initialization. This pointer comes from the ``init_arg``
- *   parameter of rte_mempool_create().
- * @param m
- *   The mbuf to initialize.
- * @param i
- *   The index of the mbuf in the pool table.
- */
-void rte_ctrlmbuf_init(struct rte_mempool *mp, void *opaque_arg,
-		       void *m, unsigned i);
-
-/**
- * Allocate a new mbuf (type is ctrl) from mempool *mp*.
- *
- * This new mbuf is initialized with data pointing to the beginning of
- * buffer, and with a length of zero.
- *
- * @param mp
- *   The mempool from which the mbuf is allocated.
- * @return
- *   - The pointer to the new mbuf on success.
- *   - NULL if allocation failed.
- */
-static inline struct rte_mbuf *rte_ctrlmbuf_alloc(struct rte_mempool *mp) -{
-	struct rte_mbuf *m;
-	if ((m = __rte_mbuf_raw_alloc(mp)) != NULL) {
-		m->ctrl.data = m->buf_addr;
-		m->ctrl.data_len = 0;
-		__rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-	}
-	return (m);
-}
-
-/**
- * Free a control mbuf back into its original mempool.
- *
- * @param m
- *   The control mbuf to be freed.
- */
-static inline void rte_ctrlmbuf_free(struct rte_mbuf *m) -{
-	__rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
-#ifdef RTE_MBUF_SCATTER_GATHER
-	if (rte_mbuf_refcnt_update(m, -1) == 0)
-#endif /* RTE_MBUF_SCATTER_GATHER */
-		__rte_mbuf_raw_free(m);
-}
-
-/**
- * A macro that returns the pointer to the carried data.
- *
- * The value that can be read or assigned.
- *
- * @param m
- *   The control mbuf.
- */
-#define rte_ctrlmbuf_data(m) ((m)->ctrl.data)
-
-/**
- * A macro that returns the length of the carried data.
- *
- * The value that can be read or assigned.
- *
- * @param m
- *   The control mbuf.
- */
-#define rte_ctrlmbuf_len(m) ((m)->ctrl.data_len)
-
-/* Operations on pkt mbuf */
-
 /**
  * The packet mbuf constructor.
  *
- * This function initializes some fields in the mbuf structure that are not
- * modified by the user once created (mbuf type, origin pool, buffer start
+ * This function initializes some fields in the mbuf structure that are
+ * not modified by the user once created (origin pool, buffer start
  * address, and so on). This function is given as a callback function to
  * rte_mempool_create() at pool creation time.
  *
@@ -569,11 +456,11 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
 	m->pkt.data = (char*) m->buf_addr + buf_ofs;
 
 	m->pkt.data_len = 0;
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 }
 
 /**
- * Allocate a new mbuf (type is pkt) from a mempool.
+ * Allocate a new mbuf from a mempool.
  *
  * This new mbuf contains one segment, which has a length of 0. The pointer
  * to data is initialized to have some bytes of headroom in the buffer @@ -629,8 +516,8 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
 	mi->pkt.nb_segs = 1;
 	mi->ol_flags = md->ol_flags;
 
-	__rte_mbuf_sanity_check(mi, RTE_MBUF_PKT, 1);
-	__rte_mbuf_sanity_check(md, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check(mi, 1);
+	__rte_mbuf_sanity_check(md, 0);
 }
 
 /**
@@ -667,7 +554,7 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)  static inline struct rte_mbuf* __attribute__((always_inline))  __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check(m, 0);
 
 #ifdef RTE_MBUF_REFCNT
 	if (likely (rte_mbuf_refcnt_read(m) == 1) || @@ -722,7 +609,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)  {
 	struct rte_mbuf *m_next;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	while (m != NULL) {
 		m_next = m->pkt.next;
@@ -783,7 +670,7 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
 		return (NULL);
 	}
 
-	__rte_mbuf_sanity_check(mc, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(mc, 1);
 	return (mc);
 }
 
@@ -800,7 +687,7 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
  */
 static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	do {
 		rte_mbuf_refcnt_update(m, v);
@@ -819,7 +706,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
  */
 static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return (uint16_t) ((char*) m->pkt.data - (char*) m->buf_addr);  }
 
@@ -833,7 +720,7 @@ static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
  */
 static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return (uint16_t)(m->buf_len - rte_pktmbuf_headroom(m) -
 			  m->pkt.data_len);
 }
@@ -850,7 +737,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)  {
 	struct rte_mbuf *m2 = (struct rte_mbuf *)m;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	while (m2->pkt.next != NULL)
 		m2 = m2->pkt.next;
 	return m2;
@@ -908,7 +795,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)  static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
 					uint16_t len)
 {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	if (unlikely(len > rte_pktmbuf_headroom(m)))
 		return NULL;
@@ -940,7 +827,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
 	void *tail;
 	struct rte_mbuf *m_last;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	m_last = rte_pktmbuf_lastseg(m);
 	if (unlikely(len > rte_pktmbuf_tailroom(m_last))) @@ -968,7 +855,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
  */
 static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	if (unlikely(len > m->pkt.data_len))
 		return NULL;
@@ -997,7 +884,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)  {
 	struct rte_mbuf *m_last;
 
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 
 	m_last = rte_pktmbuf_lastseg(m);
 	if (unlikely(len > m_last->pkt.data_len)) @@ -1019,7 +906,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
  */
 static inline int rte_pktmbuf_is_contiguous(const struct rte_mbuf *m)  {
-	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
+	__rte_mbuf_sanity_check(m, 1);
 	return !!(m->pkt.nb_segs == 1);
 }
 
diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c index 78c0c44..31f480a 100644
--- a/lib/librte_pmd_e1000/em_rxtx.c
+++ b/lib/librte_pmd_e1000/em_rxtx.c
@@ -85,7 +85,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c index b3c8149..62ff7bc 100644
--- a/lib/librte_pmd_e1000/igb_rxtx.c
+++ b/lib/librte_pmd_e1000/igb_rxtx.c
@@ -79,7 +79,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
index 4e307c2..76448ab 100644
--- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
+++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
@@ -88,7 +88,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
@@ -987,7 +987,6 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
 		/* populate the static rte mbuf fields */
 		mb = rxep[i].mbuf;
 		rte_mbuf_refcnt_set(mb, 1);
-		mb->type = RTE_MBUF_PKT;
 		mb->pkt.next = NULL;
 		mb->pkt.data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
 		mb->pkt.nb_segs = 1;
@@ -3084,7 +3083,6 @@ ixgbe_alloc_rx_queue_mbufs(struct igb_rx_queue *rxq)
 		}
 
 		rte_mbuf_refcnt_set(mbuf, 1);
-		mbuf->type = RTE_MBUF_PKT;
 		mbuf->pkt.next = NULL;
 		mbuf->pkt.data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
 		mbuf->pkt.nb_segs = 1;
diff --git a/lib/librte_pmd_virtio/virtio_rxtx.c b/lib/librte_pmd_virtio/virtio_rxtx.c
index fe94a3f..0db3ba0 100644
--- a/lib/librte_pmd_virtio/virtio_rxtx.c
+++ b/lib/librte_pmd_virtio/virtio_rxtx.c
@@ -66,7 +66,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 
 	return (m);
 }
diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
index 9fdd441..d91404a 100644
--- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
+++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
@@ -101,7 +101,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 	return (m);
 }
 
diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
index 533aa76..5cd1cdb 100644
--- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
+++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
@@ -80,7 +80,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
 	struct rte_mbuf *m;
 
 	m = __rte_mbuf_raw_alloc(mp);
-	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
+	__rte_mbuf_sanity_check_raw(m, 0);
 
 	return m;
 }
--
1.9.2

--------------------------------------------------------------
Intel Shannon Limited
Registered in Ireland
Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
Registered Number: 308263
Business address: Dromore House, East Park, Shannon, Co. Clare

This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf
  2014-05-25 21:39   ` Gilmore, Walter E
  2014-05-26 12:23     ` Olivier MATZ
  2014-05-26 16:40     ` Dumitrescu, Cristian
@ 2014-05-26 22:43     ` Neil Horman
  2 siblings, 0 replies; 51+ messages in thread
From: Neil Horman @ 2014-05-26 22:43 UTC (permalink / raw)
  To: Gilmore, Walter E; +Cc: dev

On Sun, May 25, 2014 at 09:39:22PM +0000, Gilmore, Walter E wrote:
> Olivier you're making an assumption that customer application code running on the Intel DPDK isn't using the rte_ctrlmbuf structure. 
> Remember there are more than 300 customers using the Intel DPDK and there is no way you can predict that this is not used by them. 
> The purpose of this structure is to send commands, events or any other type of information between user application tasks (normally from a manager task).
> It has been there since the beginning in the original design and it's up to the user to define what is in the data field and how they wish to use it. 
> It's one thing to fix a bug but to remove a structure like this because you don't see it use in the other parts is asking for trouble with customers. 
> 

Not to rub salt in this, but I'd like to point out here that this strikes me as
a case of wanting cake and eating it too.  This community seems adamant against
the notion of having a fixed API for the dpdk project, yet fractures the moment
anyone tries to change something that they, or someone they are working with,
might be using.

If you want to make sure that use cases outside the scope of the project itself
stay usable, stabilize the API, and mark it with a version.  If you do this,
then you can change the API, and mark it with a new version in the link stage,
and just focus on maintaining backward compatibility.

Neil

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Olivier Matz
> Sent: Friday, May 09, 2014 10:51 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf
> 
> The initial role of rte_ctrlmbuf is to carry generic messages (data pointer + data length) but it's not used by the DPDK or it applications.
> Keeping it implies:
>   - loosing 1 byte in the rte_mbuf structure
>   - having some dead code rte_mbuf.[ch]
> 
> This patch removes this feature. Thanks to it, it is now possible to simplify the rte_mbuf structure by merging the rte_pktmbuf structure in it. This is done in next commit.
> 
> Signed-off-by: Olivier Matz <olivier.matz@6wind.com>
> ---
>  app/test-pmd/cmdline.c                   |   1 -
>  app/test-pmd/testpmd.c                   |   2 -
>  app/test-pmd/txonly.c                    |   2 +-
>  app/test/commands.c                      |   1 -
>  app/test/test_mbuf.c                     |  72 +------------
>  examples/ipv4_multicast/main.c           |   2 +-
>  lib/librte_mbuf/rte_mbuf.c               |  65 +++---------
>  lib/librte_mbuf/rte_mbuf.h               | 175 ++++++-------------------------
>  lib/librte_pmd_e1000/em_rxtx.c           |   2 +-
>  lib/librte_pmd_e1000/igb_rxtx.c          |   2 +-
>  lib/librte_pmd_ixgbe/ixgbe_rxtx.c        |   4 +-
>  lib/librte_pmd_virtio/virtio_rxtx.c      |   2 +-
>  lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c    |   2 +-
>  lib/librte_pmd_xenvirt/rte_eth_xenvirt.c |   2 +-
>  14 files changed, 54 insertions(+), 280 deletions(-)
> 
> diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c index 7becedc..e3d1849 100644
> --- a/app/test-pmd/cmdline.c
> +++ b/app/test-pmd/cmdline.c
> @@ -5010,7 +5010,6 @@ dump_struct_sizes(void)  #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
>  	DUMP_SIZE(struct rte_mbuf);
>  	DUMP_SIZE(struct rte_pktmbuf);
> -	DUMP_SIZE(struct rte_ctrlmbuf);
>  	DUMP_SIZE(struct rte_mempool);
>  	DUMP_SIZE(struct rte_ring);
>  #undef DUMP_SIZE
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index 9c56914..76b3823 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -389,13 +389,11 @@ testpmd_mbuf_ctor(struct rte_mempool *mp,
>  	mb_ctor_arg = (struct mbuf_ctor_arg *) opaque_arg;
>  	mb = (struct rte_mbuf *) raw_mbuf;
>  
> -	mb->type         = RTE_MBUF_PKT;
>  	mb->pool         = mp;
>  	mb->buf_addr     = (void *) ((char *)mb + mb_ctor_arg->seg_buf_offset);
>  	mb->buf_physaddr = (uint64_t) (rte_mempool_virt2phy(mp, mb) +
>  			mb_ctor_arg->seg_buf_offset);
>  	mb->buf_len      = mb_ctor_arg->seg_buf_size;
> -	mb->type         = RTE_MBUF_PKT;
>  	mb->ol_flags     = 0;
>  	mb->pkt.data     = (char *) mb->buf_addr + RTE_PKTMBUF_HEADROOM;
>  	mb->pkt.nb_segs  = 1;
> diff --git a/app/test-pmd/txonly.c b/app/test-pmd/txonly.c index 1cf2574..1f066d0 100644
> --- a/app/test-pmd/txonly.c
> +++ b/app/test-pmd/txonly.c
> @@ -93,7 +93,7 @@ tx_mbuf_alloc(struct rte_mempool *mp)
>  	struct rte_mbuf *m;
>  
>  	m = __rte_mbuf_raw_alloc(mp);
> -	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check_raw(m, 0);
>  	return (m);
>  }
>  
> diff --git a/app/test/commands.c b/app/test/commands.c index b145036..c69544b 100644
> --- a/app/test/commands.c
> +++ b/app/test/commands.c
> @@ -262,7 +262,6 @@ dump_struct_sizes(void)  #define DUMP_SIZE(t) printf("sizeof(" #t ") = %u\n", (unsigned)sizeof(t));
>  	DUMP_SIZE(struct rte_mbuf);
>  	DUMP_SIZE(struct rte_pktmbuf);
> -	DUMP_SIZE(struct rte_ctrlmbuf);
>  	DUMP_SIZE(struct rte_mempool);
>  	DUMP_SIZE(struct rte_ring);
>  #undef DUMP_SIZE
> diff --git a/app/test/test_mbuf.c b/app/test/test_mbuf.c index fe0f4f6..07b5551 100644
> --- a/app/test/test_mbuf.c
> +++ b/app/test/test_mbuf.c
> @@ -80,7 +80,6 @@
>  #define MAKE_STRING(x)          # x
>  
>  static struct rte_mempool *pktmbuf_pool = NULL; -static struct rte_mempool *ctrlmbuf_pool = NULL;
>  
>  #if defined RTE_MBUF_REFCNT  && defined RTE_MBUF_REFCNT_ATOMIC
>  
> @@ -272,8 +271,8 @@ test_one_pktmbuf(void)
>  		GOTO_FAIL("Buffer should be continuous");
>  	memset(hdr, 0x55, MBUF_TEST_HDR2_LEN);
>  
> -	rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> -	rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
> +	rte_mbuf_sanity_check(m, 1);
> +	rte_mbuf_sanity_check(m, 0);
>  	rte_pktmbuf_dump(m, 0);
>  
>  	/* this prepend should fail */
> @@ -320,48 +319,6 @@ fail:
>  	return -1;
>  }
>  
> -/*
> - * test control mbuf
> - */
> -static int
> -test_one_ctrlmbuf(void)
> -{
> -	struct rte_mbuf *m = NULL;
> -	char message[] = "This is a message carried by a ctrlmbuf";
> -
> -	printf("Test ctrlmbuf API\n");
> -
> -	/* alloc a mbuf */
> -
> -	m = rte_ctrlmbuf_alloc(ctrlmbuf_pool);
> -	if (m == NULL)
> -		GOTO_FAIL("Cannot allocate mbuf");
> -	if (rte_ctrlmbuf_len(m) != 0)
> -		GOTO_FAIL("Bad length");
> -
> -	/* set data */
> -	rte_ctrlmbuf_data(m) = &message;
> -	rte_ctrlmbuf_len(m) = sizeof(message);
> -
> -	/* read data */
> -	if (rte_ctrlmbuf_data(m) != message)
> -		GOTO_FAIL("Invalid data pointer");
> -	if (rte_ctrlmbuf_len(m) != sizeof(message))
> -		GOTO_FAIL("Invalid len");
> -
> -	rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
> -
> -	/* free mbuf */
> -	rte_ctrlmbuf_free(m);
> -	m = NULL;
> -	return 0;
> -
> -fail:
> -	if (m)
> -		rte_ctrlmbuf_free(m);
> -	return -1;
> -}
> -
>  static int
>  testclone_testupdate_testdetach(void)
>  {
> @@ -744,7 +701,7 @@ verify_mbuf_check_panics(struct rte_mbuf *buf)
>  	pid = fork();
>  
>  	if (pid == 0) {
> -		rte_mbuf_sanity_check(buf, RTE_MBUF_PKT, 1); /* should panic */
> +		rte_mbuf_sanity_check(buf, 1); /* should panic */
>  		exit(0);  /* return normally if it doesn't panic */
>  	} else if (pid < 0){
>  		printf("Fork Failed\n");
> @@ -781,13 +738,6 @@ test_failing_mbuf_sanity_check(void)
>  	}
>  
>  	badbuf = *buf;
> -	badbuf.type = (uint8_t)-1;
> -	if (verify_mbuf_check_panics(&badbuf)) {
> -		printf("Error with bad-type mbuf test\n");
> -		return -1;
> -	}
> -
> -	badbuf = *buf;
>  	badbuf.pool = NULL;
>  	if (verify_mbuf_check_panics(&badbuf)) {
>  		printf("Error with bad-pool mbuf test\n"); @@ -889,22 +839,6 @@ test_mbuf(void)
>  		return -1;
>  	}
>  
> -	/* create ctrlmbuf pool if it does not exist */
> -	if (ctrlmbuf_pool == NULL) {
> -		ctrlmbuf_pool =
> -			rte_mempool_create("test_ctrlmbuf_pool", NB_MBUF,
> -					   sizeof(struct rte_mbuf), 32, 0,
> -					   NULL, NULL,
> -					   rte_ctrlmbuf_init, NULL,
> -					   SOCKET_ID_ANY, 0);
> -	}
> -
> -	/* test control mbuf */
> -	if (test_one_ctrlmbuf() < 0) {
> -		printf("test_one_ctrlmbuf() failed\n");
> -		return -1;
> -	}
> -
>  	/* test free pktmbuf segment one by one */
>  	if (test_pktmbuf_free_segment() < 0) {
>  		printf("test_pktmbuf_free_segment() failed.\n"); diff --git a/examples/ipv4_multicast/main.c b/examples/ipv4_multicast/main.c index 3bd37e4..3967d7a 100644
> --- a/examples/ipv4_multicast/main.c
> +++ b/examples/ipv4_multicast/main.c
> @@ -343,7 +343,7 @@ mcast_out_pkt(struct rte_mbuf *pkt, int use_clone)
>  
>  	hdr->ol_flags = pkt->ol_flags;
>  
> -	__rte_mbuf_sanity_check(hdr, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(hdr, 1);
>  	return (hdr);
>  }
>  
> diff --git a/lib/librte_mbuf/rte_mbuf.c b/lib/librte_mbuf/rte_mbuf.c index bffc2c4..b2e2f0f 100644
> --- a/lib/librte_mbuf/rte_mbuf.c
> +++ b/lib/librte_mbuf/rte_mbuf.c
> @@ -60,32 +60,6 @@
>  #include <rte_hexdump.h>
>  
>  /*
> - * ctrlmbuf constructor, given as a callback function to
> - * rte_mempool_create()
> - */
> -void
> -rte_ctrlmbuf_init(struct rte_mempool *mp,
> -		  __attribute__((unused)) void *opaque_arg,
> -		  void *_m,
> -		  __attribute__((unused)) unsigned i)
> -{
> -	struct rte_mbuf *m = _m;
> -
> -	memset(m, 0, mp->elt_size);
> -
> -	/* start of buffer is just after mbuf structure */
> -	m->buf_addr = (char *)m + sizeof(struct rte_mbuf);
> -	m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
> -			sizeof(struct rte_mbuf);
> -	m->buf_len = (uint16_t) (mp->elt_size - sizeof(struct rte_mbuf));
> -
> -	/* init some constant fields */
> -	m->type = RTE_MBUF_CTRL;
> -	m->ctrl.data = (char *)m->buf_addr;
> -	m->pool = (struct rte_mempool *)mp;
> -}
> -
> -/*
>   * pktmbuf pool constructor, given as a callback function to
>   * rte_mempool_create()
>   */
> @@ -133,7 +107,6 @@ rte_pktmbuf_init(struct rte_mempool *mp,
>  	m->pkt.data = (char*) m->buf_addr + RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>  
>  	/* init some constant fields */
> -	m->type = RTE_MBUF_PKT;
>  	m->pool = mp;
>  	m->pkt.nb_segs = 1;
>  	m->pkt.in_port = 0xff;
> @@ -141,16 +114,13 @@ rte_pktmbuf_init(struct rte_mempool *mp,
>  
>  /* do some sanity checks on a mbuf: panic if it fails */  void -rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
> -		      int is_header)
> +rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header)
>  {
>  	const struct rte_mbuf *m_seg;
>  	unsigned nb_segs;
>  
>  	if (m == NULL)
>  		rte_panic("mbuf is NULL\n");
> -	if (m->type != (uint8_t)t)
> -		rte_panic("bad mbuf type\n");
>  
>  	/* generic checks */
>  	if (m->pool == NULL)
> @@ -166,29 +136,18 @@ rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
>  		rte_panic("bad ref cnt\n");
>  #endif
>  
> -	/* nothing to check for ctrl messages */
> -	if (m->type == RTE_MBUF_CTRL)
> +	/* nothing to check for sub-segments */
> +	if (is_header == 0)
>  		return;
>  
> -	/* check pkt consistency */
> -	else if (m->type == RTE_MBUF_PKT) {
> -
> -		/* nothing to check for sub-segments */
> -		if (is_header == 0)
> -			return;
> -
> -		nb_segs = m->pkt.nb_segs;
> -		m_seg = m;
> -		while (m_seg && nb_segs != 0) {
> -			m_seg = m_seg->pkt.next;
> -			nb_segs --;
> -		}
> -		if (nb_segs != 0)
> -			rte_panic("bad nb_segs\n");
> -		return;
> +	nb_segs = m->pkt.nb_segs;
> +	m_seg = m;
> +	while (m_seg && nb_segs != 0) {
> +		m_seg = m_seg->pkt.next;
> +		nb_segs --;
>  	}
> -
> -	rte_panic("unknown mbuf type\n");
> +	if (nb_segs != 0)
> +		rte_panic("bad nb_segs\n");
>  }
>  
>  /* dump a mbuf on console */
> @@ -198,7 +157,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
>  	unsigned int len;
>  	unsigned nb_segs;
>  
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  
>  	printf("dump mbuf at 0x%p, phys=%"PRIx64", buf_len=%u\n",
>  	       m, (uint64_t)m->buf_physaddr, (unsigned)m->buf_len); @@ -208,7 +167,7 @@ rte_pktmbuf_dump(const struct rte_mbuf *m, unsigned dump_len)
>  	nb_segs = m->pkt.nb_segs;
>  
>  	while (m && nb_segs != 0) {
> -		__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
> +		__rte_mbuf_sanity_check(m, 0);
>  
>  		printf("  segment at 0x%p, data=0x%p, data_len=%u\n",
>  		       m, m->pkt.data, (unsigned)m->pkt.data_len); diff --git a/lib/librte_mbuf/rte_mbuf.h b/lib/librte_mbuf/rte_mbuf.h index 1b1a84e..22e1ac1 100644
> --- a/lib/librte_mbuf/rte_mbuf.h
> +++ b/lib/librte_mbuf/rte_mbuf.h
> @@ -43,18 +43,13 @@
>   * buffers. The message buffers are stored in a mempool, using the
>   * RTE mempool library.
>   *
> - * This library provide an API to allocate/free mbufs, manipulate
> - * control message buffer (ctrlmbuf), which are generic message
> - * buffers, and packet buffers (pktmbuf), which are used to carry
> - * network packets.
> + * This library provide an API to allocate/free packet mbufs, which are
> + * used to carry network packets.
>   *
>   * To understand the concepts of packet buffers or mbufs, you
>   * should read "TCP/IP Illustrated, Volume 2: The Implementation,
>   * Addison-Wesley, 1995, ISBN 0-201-63354-X from Richard Stevens"
>   * http://www.kohala.com/start/tcpipiv2.html
> - *
> - * The main modification of this implementation is the use of mbuf for
> - * transports other than packets. mbufs can have other types.
>   */
>  
>  #include <stdint.h>
> @@ -70,15 +65,6 @@ extern "C" {
>  /* deprecated feature, renamed in RTE_MBUF_REFCNT */  #pragma GCC poison RTE_MBUF_SCATTER_GATHER
>  
> -/**
> - * A control message buffer.
> - */
> -struct rte_ctrlmbuf {
> -	void *data;        /**< Pointer to data. */
> -	uint32_t data_len; /**< Length of data. */
> -};
> -
> -
>  /*
>   * Packet Offload Features Flags. It also carry packet type information.
>   * Critical resources. Both rx/tx shared these bits. Be cautious on any change @@ -165,15 +151,7 @@ struct rte_pktmbuf {  };
>  
>  /**
> - * This enum indicates the mbuf type.
> - */
> -enum rte_mbuf_type {
> -	RTE_MBUF_CTRL,  /**< Control mbuf. */
> -	RTE_MBUF_PKT,   /**< Packet mbuf. */
> -};
> -
> -/**
> - * The generic rte_mbuf, containing a packet mbuf or a control mbuf.
> + * The generic rte_mbuf, containing a packet mbuf.
>   */
>  struct rte_mbuf {
>  	struct rte_mempool *pool; /**< Pool from which mbuf was allocated. */ @@ -196,14 +174,10 @@ struct rte_mbuf {  #else
>  	uint16_t refcnt_reserved;     /**< Do not use this field */
>  #endif
> -	uint8_t type;                 /**< Type of mbuf. */
> -	uint8_t reserved;             /**< Unused field. Required for padding. */
> +	uint16_t reserved;             /**< Unused field. Required for padding. */
>  	uint16_t ol_flags;            /**< Offload features. */
>  
> -	union {
> -		struct rte_ctrlmbuf ctrl;
> -		struct rte_pktmbuf pkt;
> -	};
> +	struct rte_pktmbuf pkt;
>  } __rte_cache_aligned;
>  
>  /**
> @@ -241,12 +215,12 @@ struct rte_pktmbuf_pool_private {  #ifdef RTE_LIBRTE_MBUF_DEBUG
>  
>  /**  check mbuf type in debug mode */
> -#define __rte_mbuf_sanity_check(m, t, is_h) rte_mbuf_sanity_check(m, t, is_h)
> +#define __rte_mbuf_sanity_check(m, is_h) rte_mbuf_sanity_check(m, is_h)
>  
>  /**  check mbuf type in debug mode if mbuf pointer is not null */
> -#define __rte_mbuf_sanity_check_raw(m, t, is_h)	do {       \
> +#define __rte_mbuf_sanity_check_raw(m, is_h)	do {       \
>  	if ((m) != NULL)                                   \
> -		rte_mbuf_sanity_check(m, t, is_h);          \
> +		rte_mbuf_sanity_check(m, is_h);          \
>  } while (0)
>  
>  /**  MBUF asserts in debug mode */
> @@ -258,10 +232,10 @@ if (!(exp)) {                                                        \
>  #else /*  RTE_LIBRTE_MBUF_DEBUG */
>  
>  /**  check mbuf type in debug mode */
> -#define __rte_mbuf_sanity_check(m, t, is_h) do { } while(0)
> +#define __rte_mbuf_sanity_check(m, is_h) do { } while(0)
>  
>  /**  check mbuf type in debug mode if mbuf pointer is not null */ -#define __rte_mbuf_sanity_check_raw(m, t, is_h) do { } while(0)
> +#define __rte_mbuf_sanity_check_raw(m, is_h) do { } while(0)
>  
>  /**  MBUF asserts in debug mode */
>  #define RTE_MBUF_ASSERT(exp)                do { } while(0)
> @@ -368,20 +342,17 @@ rte_mbuf_refcnt_set(struct rte_mbuf *m, uint16_t new_value)
>   *
>   * @param m
>   *   The mbuf to be checked.
> - * @param t
> - *   The expected type of the mbuf.
>   * @param is_header
>   *   True if the mbuf is a packet header, false if it is a sub-segment
>   *   of a packet (in this case, some fields like nb_segs are not checked)
>   */
>  void
> -rte_mbuf_sanity_check(const struct rte_mbuf *m, enum rte_mbuf_type t,
> -		      int is_header);
> +rte_mbuf_sanity_check(const struct rte_mbuf *m, int is_header);
>  
>  /**
>   * @internal Allocate a new mbuf from mempool *mp*.
>   * The use of that function is reserved for RTE internal needs.
> - * Please use either rte_ctrlmbuf_alloc() or rte_pktmbuf_alloc().
> + * Please use rte_pktmbuf_alloc().
>   *
>   * @param mp
>   *   The mempool from which mbuf is allocated.
> @@ -406,7 +377,7 @@ static inline struct rte_mbuf *__rte_mbuf_raw_alloc(struct rte_mempool *mp)
>  /**
>   * @internal Put mbuf back into its original mempool.
>   * The use of that function is reserved for RTE internal needs.
> - * Please use either rte_ctrlmbuf_free() or rte_pktmbuf_free().
> + * Please use rte_pktmbuf_free().
>   *
>   * @param m
>   *   The mbuf to be freed.
> @@ -420,95 +391,11 @@ __rte_mbuf_raw_free(struct rte_mbuf *m)
>  	rte_mempool_put(m->pool, m);
>  }
>  
> -/* Operations on ctrl mbuf */
> -
> -/**
> - * The control mbuf constructor.
> - *
> - * This function initializes some fields in an mbuf structure that are
> - * not modified by the user once created (mbuf type, origin pool, buffer
> - * start address, and so on). This function is given as a callback function
> - * to rte_mempool_create() at pool creation time.
> - *
> - * @param mp
> - *   The mempool from which the mbuf is allocated.
> - * @param opaque_arg
> - *   A pointer that can be used by the user to retrieve useful information
> - *   for mbuf initialization. This pointer comes from the ``init_arg``
> - *   parameter of rte_mempool_create().
> - * @param m
> - *   The mbuf to initialize.
> - * @param i
> - *   The index of the mbuf in the pool table.
> - */
> -void rte_ctrlmbuf_init(struct rte_mempool *mp, void *opaque_arg,
> -		       void *m, unsigned i);
> -
> -/**
> - * Allocate a new mbuf (type is ctrl) from mempool *mp*.
> - *
> - * This new mbuf is initialized with data pointing to the beginning of
> - * buffer, and with a length of zero.
> - *
> - * @param mp
> - *   The mempool from which the mbuf is allocated.
> - * @return
> - *   - The pointer to the new mbuf on success.
> - *   - NULL if allocation failed.
> - */
> -static inline struct rte_mbuf *rte_ctrlmbuf_alloc(struct rte_mempool *mp) -{
> -	struct rte_mbuf *m;
> -	if ((m = __rte_mbuf_raw_alloc(mp)) != NULL) {
> -		m->ctrl.data = m->buf_addr;
> -		m->ctrl.data_len = 0;
> -		__rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
> -	}
> -	return (m);
> -}
> -
> -/**
> - * Free a control mbuf back into its original mempool.
> - *
> - * @param m
> - *   The control mbuf to be freed.
> - */
> -static inline void rte_ctrlmbuf_free(struct rte_mbuf *m) -{
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_CTRL, 0);
> -#ifdef RTE_MBUF_SCATTER_GATHER
> -	if (rte_mbuf_refcnt_update(m, -1) == 0)
> -#endif /* RTE_MBUF_SCATTER_GATHER */
> -		__rte_mbuf_raw_free(m);
> -}
> -
> -/**
> - * A macro that returns the pointer to the carried data.
> - *
> - * The value that can be read or assigned.
> - *
> - * @param m
> - *   The control mbuf.
> - */
> -#define rte_ctrlmbuf_data(m) ((m)->ctrl.data)
> -
> -/**
> - * A macro that returns the length of the carried data.
> - *
> - * The value that can be read or assigned.
> - *
> - * @param m
> - *   The control mbuf.
> - */
> -#define rte_ctrlmbuf_len(m) ((m)->ctrl.data_len)
> -
> -/* Operations on pkt mbuf */
> -
>  /**
>   * The packet mbuf constructor.
>   *
> - * This function initializes some fields in the mbuf structure that are not
> - * modified by the user once created (mbuf type, origin pool, buffer start
> + * This function initializes some fields in the mbuf structure that are
> + * not modified by the user once created (origin pool, buffer start
>   * address, and so on). This function is given as a callback function to
>   * rte_mempool_create() at pool creation time.
>   *
> @@ -569,11 +456,11 @@ static inline void rte_pktmbuf_reset(struct rte_mbuf *m)
>  	m->pkt.data = (char*) m->buf_addr + buf_ofs;
>  
>  	m->pkt.data_len = 0;
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  }
>  
>  /**
> - * Allocate a new mbuf (type is pkt) from a mempool.
> + * Allocate a new mbuf from a mempool.
>   *
>   * This new mbuf contains one segment, which has a length of 0. The pointer
>   * to data is initialized to have some bytes of headroom in the buffer @@ -629,8 +516,8 @@ static inline void rte_pktmbuf_attach(struct rte_mbuf *mi, struct rte_mbuf *md)
>  	mi->pkt.nb_segs = 1;
>  	mi->ol_flags = md->ol_flags;
>  
> -	__rte_mbuf_sanity_check(mi, RTE_MBUF_PKT, 1);
> -	__rte_mbuf_sanity_check(md, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check(mi, 1);
> +	__rte_mbuf_sanity_check(md, 0);
>  }
>  
>  /**
> @@ -667,7 +554,7 @@ static inline void rte_pktmbuf_detach(struct rte_mbuf *m)  static inline struct rte_mbuf* __attribute__((always_inline))  __rte_pktmbuf_prefree_seg(struct rte_mbuf *m)  {
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check(m, 0);
>  
>  #ifdef RTE_MBUF_REFCNT
>  	if (likely (rte_mbuf_refcnt_read(m) == 1) || @@ -722,7 +609,7 @@ static inline void rte_pktmbuf_free(struct rte_mbuf *m)  {
>  	struct rte_mbuf *m_next;
>  
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  
>  	while (m != NULL) {
>  		m_next = m->pkt.next;
> @@ -783,7 +670,7 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
>  		return (NULL);
>  	}
>  
> -	__rte_mbuf_sanity_check(mc, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(mc, 1);
>  	return (mc);
>  }
>  
> @@ -800,7 +687,7 @@ static inline struct rte_mbuf *rte_pktmbuf_clone(struct rte_mbuf *md,
>   */
>  static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)  {
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  
>  	do {
>  		rte_mbuf_refcnt_update(m, v);
> @@ -819,7 +706,7 @@ static inline void rte_pktmbuf_refcnt_update(struct rte_mbuf *m, int16_t v)
>   */
>  static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)  {
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  	return (uint16_t) ((char*) m->pkt.data - (char*) m->buf_addr);  }
>  
> @@ -833,7 +720,7 @@ static inline uint16_t rte_pktmbuf_headroom(const struct rte_mbuf *m)
>   */
>  static inline uint16_t rte_pktmbuf_tailroom(const struct rte_mbuf *m)  {
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  	return (uint16_t)(m->buf_len - rte_pktmbuf_headroom(m) -
>  			  m->pkt.data_len);
>  }
> @@ -850,7 +737,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)  {
>  	struct rte_mbuf *m2 = (struct rte_mbuf *)m;
>  
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  	while (m2->pkt.next != NULL)
>  		m2 = m2->pkt.next;
>  	return m2;
> @@ -908,7 +795,7 @@ static inline struct rte_mbuf *rte_pktmbuf_lastseg(struct rte_mbuf *m)  static inline char *rte_pktmbuf_prepend(struct rte_mbuf *m,
>  					uint16_t len)
>  {
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  
>  	if (unlikely(len > rte_pktmbuf_headroom(m)))
>  		return NULL;
> @@ -940,7 +827,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
>  	void *tail;
>  	struct rte_mbuf *m_last;
>  
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  
>  	m_last = rte_pktmbuf_lastseg(m);
>  	if (unlikely(len > rte_pktmbuf_tailroom(m_last))) @@ -968,7 +855,7 @@ static inline char *rte_pktmbuf_append(struct rte_mbuf *m, uint16_t len)
>   */
>  static inline char *rte_pktmbuf_adj(struct rte_mbuf *m, uint16_t len)  {
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  
>  	if (unlikely(len > m->pkt.data_len))
>  		return NULL;
> @@ -997,7 +884,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)  {
>  	struct rte_mbuf *m_last;
>  
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  
>  	m_last = rte_pktmbuf_lastseg(m);
>  	if (unlikely(len > m_last->pkt.data_len)) @@ -1019,7 +906,7 @@ static inline int rte_pktmbuf_trim(struct rte_mbuf *m, uint16_t len)
>   */
>  static inline int rte_pktmbuf_is_contiguous(const struct rte_mbuf *m)  {
> -	__rte_mbuf_sanity_check(m, RTE_MBUF_PKT, 1);
> +	__rte_mbuf_sanity_check(m, 1);
>  	return !!(m->pkt.nb_segs == 1);
>  }
>  
> diff --git a/lib/librte_pmd_e1000/em_rxtx.c b/lib/librte_pmd_e1000/em_rxtx.c index 78c0c44..31f480a 100644
> --- a/lib/librte_pmd_e1000/em_rxtx.c
> +++ b/lib/librte_pmd_e1000/em_rxtx.c
> @@ -85,7 +85,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
>  	struct rte_mbuf *m;
>  
>  	m = __rte_mbuf_raw_alloc(mp);
> -	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check_raw(m, 0);
>  	return (m);
>  }
>  
> diff --git a/lib/librte_pmd_e1000/igb_rxtx.c b/lib/librte_pmd_e1000/igb_rxtx.c index b3c8149..62ff7bc 100644
> --- a/lib/librte_pmd_e1000/igb_rxtx.c
> +++ b/lib/librte_pmd_e1000/igb_rxtx.c
> @@ -79,7 +79,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
>  	struct rte_mbuf *m;
>  
>  	m = __rte_mbuf_raw_alloc(mp);
> -	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check_raw(m, 0);
>  	return (m);
>  }
>  
> diff --git a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> index 4e307c2..76448ab 100644
> --- a/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> +++ b/lib/librte_pmd_ixgbe/ixgbe_rxtx.c
> @@ -88,7 +88,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
>  	struct rte_mbuf *m;
>  
>  	m = __rte_mbuf_raw_alloc(mp);
> -	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check_raw(m, 0);
>  	return (m);
>  }
>  
> @@ -987,7 +987,6 @@ ixgbe_rx_alloc_bufs(struct igb_rx_queue *rxq)
>  		/* populate the static rte mbuf fields */
>  		mb = rxep[i].mbuf;
>  		rte_mbuf_refcnt_set(mb, 1);
> -		mb->type = RTE_MBUF_PKT;
>  		mb->pkt.next = NULL;
>  		mb->pkt.data = (char *)mb->buf_addr + RTE_PKTMBUF_HEADROOM;
>  		mb->pkt.nb_segs = 1;
> @@ -3084,7 +3083,6 @@ ixgbe_alloc_rx_queue_mbufs(struct igb_rx_queue *rxq)
>  		}
>  
>  		rte_mbuf_refcnt_set(mbuf, 1);
> -		mbuf->type = RTE_MBUF_PKT;
>  		mbuf->pkt.next = NULL;
>  		mbuf->pkt.data = (char *)mbuf->buf_addr + RTE_PKTMBUF_HEADROOM;
>  		mbuf->pkt.nb_segs = 1;
> diff --git a/lib/librte_pmd_virtio/virtio_rxtx.c b/lib/librte_pmd_virtio/virtio_rxtx.c
> index fe94a3f..0db3ba0 100644
> --- a/lib/librte_pmd_virtio/virtio_rxtx.c
> +++ b/lib/librte_pmd_virtio/virtio_rxtx.c
> @@ -66,7 +66,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
>  	struct rte_mbuf *m;
>  
>  	m = __rte_mbuf_raw_alloc(mp);
> -	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check_raw(m, 0);
>  
>  	return (m);
>  }
> diff --git a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
> index 9fdd441..d91404a 100644
> --- a/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
> +++ b/lib/librte_pmd_vmxnet3/vmxnet3_rxtx.c
> @@ -101,7 +101,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
>  	struct rte_mbuf *m;
>  
>  	m = __rte_mbuf_raw_alloc(mp);
> -	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check_raw(m, 0);
>  	return (m);
>  }
>  
> diff --git a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
> index 533aa76..5cd1cdb 100644
> --- a/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
> +++ b/lib/librte_pmd_xenvirt/rte_eth_xenvirt.c
> @@ -80,7 +80,7 @@ rte_rxmbuf_alloc(struct rte_mempool *mp)
>  	struct rte_mbuf *m;
>  
>  	m = __rte_mbuf_raw_alloc(mp);
> -	__rte_mbuf_sanity_check_raw(m, RTE_MBUF_PKT, 0);
> +	__rte_mbuf_sanity_check_raw(m, 0);
>  
>  	return m;
>  }
> --
> 1.9.2
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf
  2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf Olivier Matz
  2014-05-25 21:39   ` Gilmore, Walter E
@ 2014-05-27  0:17   ` Stephen Hemminger
  2014-05-28  9:45     ` Ananyev, Konstantin
  1 sibling, 1 reply; 51+ messages in thread
From: Stephen Hemminger @ 2014-05-27  0:17 UTC (permalink / raw)
  To: Olivier Matz; +Cc: dev

On Fri,  9 May 2014 16:50:30 +0200
Olivier Matz <olivier.matz@6wind.com> wrote:

> The initial role of rte_ctrlmbuf is to carry generic messages (data
> pointer + data length) but it's not used by the DPDK or it applications.
> Keeping it implies:
>   - loosing 1 byte in the rte_mbuf structure
>   - having some dead code rte_mbuf.[ch]
> 
> This patch removes this feature. Thanks to it, it is now possible to
> simplify the rte_mbuf structure by merging the rte_pktmbuf structure
> in it. This is done in next commit.
> 
> Signed-off-by: Olivier Matz <olivier.matz@6wind.com>

The only win from this is to save the byte for the type field.
Yes bits here are precious.

Does external application mix control and data mbuf's in the same ring?
The stuff in the tree only uses type field for debug validation/sanity
checks.

Since it is only one bit, maybe you can find one bit to store that. 
Since buffer and pool addresses are going to be at least 32 bit aligned
maybe you can use the old GC trick of using the LSB as flag.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf
  2014-05-27  0:17   ` Stephen Hemminger
@ 2014-05-28  9:45     ` Ananyev, Konstantin
  0 siblings, 0 replies; 51+ messages in thread
From: Ananyev, Konstantin @ 2014-05-28  9:45 UTC (permalink / raw)
  To: Stephen Hemminger, Olivier Matz; +Cc: dev

Hi,

>The only win from this is to save the byte for the type field.
>Yes bits here are precious.

>Does external application mix control and data mbuf's in the same ring?
>The stuff in the tree only uses type field for debug validation/sanity
>checks.

>Since it is only one bit, maybe you can find one bit to store that. 
>Since buffer and pool addresses are going to be at least 32 bit aligned
>maybe you can use the old GC trick of using the LSB as flag.

Or, as an alternative we can move mbuf type up into the mempool.
In most cases user has to deal only with one particular type of mbufs and he already knows what mbuf type it would be.
For the rare cases when code need to deal with mix of mbuf types,
it is probably ok to read mbuf type from the corresponding mempool. 
Of course, it would mean  that all elements in the mempool should have the same type,
but I don't think right now people using mempools with mix of pktmbuf/ctrlmbuf anyway.   

Konstantin 

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2014-05-28  9:46 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-09 14:50 [dpdk-dev] [PATCH RFC 00/11] ixgbe/mbuf: add TSO support Olivier Matz
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 01/11] igb/ixgbe: fix IP checksum calculation Olivier Matz
2014-05-15 10:40   ` Ananyev, Konstantin
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 02/11] mbuf: rename RTE_MBUF_SCATTER_GATHER into RTE_MBUF_REFCNT Olivier Matz
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 03/11] mbuf: remove rte_ctrlmbuf Olivier Matz
2014-05-25 21:39   ` Gilmore, Walter E
2014-05-26 12:23     ` Olivier MATZ
2014-05-26 16:40     ` Dumitrescu, Cristian
2014-05-26 22:43     ` Neil Horman
2014-05-27  0:17   ` Stephen Hemminger
2014-05-28  9:45     ` Ananyev, Konstantin
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 04/11] mbuf: remove the rte_pktmbuf structure Olivier Matz
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 05/11] mbuf: merge physaddr and buf_len in a bitfield Olivier Matz
2014-05-09 15:39   ` Shaw, Jeffrey B
2014-05-09 16:06     ` Olivier MATZ
2014-05-09 16:11       ` Shaw, Jeffrey B
2014-05-14 14:07         ` Ananyev, Konstantin
2014-05-15  9:53           ` Olivier MATZ
2014-05-19  7:27         ` Olivier MATZ
2014-05-19  8:25           ` Richardson, Bruce
2014-05-19  9:30             ` Olivier MATZ
2014-05-19  9:57               ` Richardson, Bruce
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 06/11] mbuf: replace data pointer by an offset Olivier Matz
2014-05-12 14:12   ` Thomas Monjalon
2014-05-12 14:36     ` Venkatesan, Venky
2014-05-12 14:41       ` Neil Horman
2014-05-12 15:07         ` Olivier MATZ
2014-05-12 15:59           ` Stephen Hemminger
2014-05-12 16:13             ` Olivier MATZ
2014-05-12 17:13               ` Stephen Hemminger
2014-05-13 13:29                 ` Olivier MATZ
2014-05-12 16:06           ` Venkatesan, Venky
2014-05-12 18:39             ` Neil Horman
2014-05-13 13:54               ` Venkatesan, Venky
2014-05-13 14:09                 ` Thomas Monjalon
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 07/11] mbuf: add functions to get the name of an ol_flag Olivier Matz
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 08/11] mbuf: change ol_flags to 32 bits Olivier Matz
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 09/11] mbuf: rename vlan_macip_len in hw_offload and increase its size Olivier Matz
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 10/11] testpmd: modify source address to validate checksum calculation Olivier Matz
2014-05-09 14:50 ` [dpdk-dev] [PATCH RFC 11/11] ixgbe/mbuf: add TSO support Olivier Matz
2014-05-12 14:30   ` Thomas Monjalon
2014-05-15 15:09   ` Ananyev, Konstantin
2014-05-15 15:39     ` Olivier MATZ
2014-05-15 16:30       ` Ananyev, Konstantin
2014-05-16 12:11         ` Olivier MATZ
2014-05-16 17:01           ` Ananyev, Konstantin
2014-05-19 12:32             ` Thomas Monjalon
2014-05-09 17:04 ` [dpdk-dev] [PATCH RFC 00/11] " Stephen Hemminger
2014-05-09 21:49   ` Olivier MATZ
2014-05-10  0:39     ` Stephen Hemminger
2014-05-19 12:47 ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).