From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
 by dpdk.org (Postfix) with ESMTP id 0EE4B29CA
 for <dev@dpdk.org>; Thu, 29 Jun 2017 01:56:14 +0200 (CEST)
Received: from orsmga004.jf.intel.com ([10.7.209.38])
 by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 28 Jun 2017 16:56:13 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.40,278,1496127600"; d="scan'208";a="102502267"
Received: from irsmsx106.ger.corp.intel.com ([163.33.3.31])
 by orsmga004.jf.intel.com with ESMTP; 28 Jun 2017 16:56:11 -0700
Received: from irsmsx109.ger.corp.intel.com ([169.254.13.115]) by
 IRSMSX106.ger.corp.intel.com ([169.254.8.236]) with mapi id 14.03.0319.002;
 Thu, 29 Jun 2017 00:56:11 +0100
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: "Hu, Jiayu" <jiayu.hu@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
CC: "Tan, Jianfeng" <jianfeng.tan@intel.com>, "stephen@networkplumber.org"
 <stephen@networkplumber.org>, "yliu@fridaylinux.org" <yliu@fridaylinux.org>,
 "Wu, Jingjing" <jingjing.wu@intel.com>, "Yao, Lei A" <lei.a.yao@intel.com>,
 "Wiles, Keith" <keith.wiles@intel.com>, "Bie, Tiwei" <tiwei.bie@intel.com>
Thread-Topic: [PATCH v7 2/3] lib/gro: add TCP/IPv4 GRO support
Thread-Index: AQHS7kdlXNHGhRYN9Eipev8pdR3kGKI6kXCQ
Date: Wed, 28 Jun 2017 23:56:10 +0000
Message-ID: <2601191342CEEE43887BDE71AB9772583FB13208@IRSMSX109.ger.corp.intel.com>
References: <1498229000-94867-1-git-send-email-jiayu.hu@intel.com>
 <1498459430-116048-1-git-send-email-jiayu.hu@intel.com>
 <1498459430-116048-3-git-send-email-jiayu.hu@intel.com>
In-Reply-To: <1498459430-116048-3-git-send-email-jiayu.hu@intel.com>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
dlp-product: dlpe-windows
dlp-version: 10.0.102.7
dlp-reaction: no-action
x-originating-ip: [163.33.239.180]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v7 2/3] lib/gro: add TCP/IPv4 GRO support
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Jun 2017 23:56:16 -0000

Hi Jiayu,

>=20
> In this patch, we introduce five APIs to support TCP/IPv4 GRO.
> - gro_tcp_tbl_create: create a TCP reassembly table, which is used to
>     merge packets.
> - gro_tcp_tbl_destroy: free memory space of a TCP reassembly table.
> - gro_tcp_tbl_flush: flush all packets from a TCP reassembly table.
> - gro_tcp_tbl_timeout_flush: flush timeout packets from a TCP
>     reassembly table.
> - gro_tcp4_reassemble: reassemble an inputted TCP/IPv4 packet.
>=20
> TCP/IPv4 GRO API assumes all inputted packets are with correct IPv4
> and TCP checksums. And TCP/IPv4 GRO API doesn't update IPv4 and TCP
> checksums for merged packets. If inputted packets are IP fragmented,
> TCP/IPv4 GRO API assumes they are complete packets (i.e. with L4
> headers).
>=20
> In TCP GRO, we use a table structure, called TCP reassembly table, to
> reassemble packets. Both TCP/IPv4 and TCP/IPv6 GRO use the same table
> structure. A TCP reassembly table includes a key array and a item array,
> where the key array keeps the criteria to merge packets and the item
> array keeps packet information.
>=20
> One key in the key array points to an item group, which consists of
> packets which have the same criteria value. If two packets are able to
> merge, they must be in the same item group. Each key in the key array
> includes two parts:
> - criteria: the criteria of merging packets. If two packets can be
>     merged, they must have the same criteria value.
> - start_index: the index of the first incoming packet of the item group.
>=20
> Each element in the item array keeps the information of one packet. It
> mainly includes two parts:
> - pkt: packet address
> - next_pkt_index: the index of the next packet in the same item group.
>     All packets in the same item group are chained by next_pkt_index.
>     With next_pkt_index, we can locate all packets in the same item
>     group one by one.
>=20
> To process an incoming packet needs three steps:
> a. check if the packet should be processed. Packets with the following
>     properties won't be processed:
> 	- packets without data (e.g. SYN, SYN-ACK)
> b. traverse the key array to find a key which has the same criteria
>     value with the incoming packet. If find, goto step c. Otherwise,
>     insert a new key and insert the packet into the item array.
> c. locate the first packet in the item group via the start_index in the
>     key. Then traverse all packets in the item group via next_pkt_index.
>     If find one packet which can merge with the incoming one, merge them
>     together. If can't find, insert the packet into this item group.
>=20
> Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>
> ---
>  doc/guides/rel_notes/release_17_08.rst |   7 +
>  lib/librte_gro/Makefile                |   1 +
>  lib/librte_gro/rte_gro.c               | 123 ++++++++--
>  lib/librte_gro/rte_gro.h               |   6 +-
>  lib/librte_gro/rte_gro_tcp.c           | 394 +++++++++++++++++++++++++++=
++++++
>  lib/librte_gro/rte_gro_tcp.h           | 191 ++++++++++++++++
>  6 files changed, 706 insertions(+), 16 deletions(-)
>  create mode 100644 lib/librte_gro/rte_gro_tcp.c
>  create mode 100644 lib/librte_gro/rte_gro_tcp.h
>=20
> diff --git a/doc/guides/rel_notes/release_17_08.rst b/doc/guides/rel_note=
s/release_17_08.rst
> index 842f46f..f067247 100644
> --- a/doc/guides/rel_notes/release_17_08.rst
> +++ b/doc/guides/rel_notes/release_17_08.rst
> @@ -75,6 +75,13 @@ New Features
>=20
>    Added support for firmwares with multiple Ethernet ports per physical =
port.
>=20
> +* **Add Generic Receive Offload API support.**
> +
> +  Generic Receive Offload (GRO) API supports to reassemble TCP/IPv4
> +  packets. GRO API assumes all inputted packets are with correct
> +  checksums. GRO API doesn't update checksums for merged packets. If
> +  inputted packets are IP fragmented, GRO API assumes they are complete
> +  packets (i.e. with L4 headers).
>=20
>  Resolved Issues
>  ---------------
> diff --git a/lib/librte_gro/Makefile b/lib/librte_gro/Makefile
> index 7e0f128..e89344d 100644
> --- a/lib/librte_gro/Makefile
> +++ b/lib/librte_gro/Makefile
> @@ -43,6 +43,7 @@ LIBABIVER :=3D 1
>=20
>  # source files
>  SRCS-$(CONFIG_RTE_LIBRTE_GRO) +=3D rte_gro.c
> +SRCS-$(CONFIG_RTE_LIBRTE_GRO) +=3D rte_gro_tcp.c
>=20
>  # install this header file
>  SYMLINK-$(CONFIG_RTE_LIBRTE_GRO)-include +=3D rte_gro.h
> diff --git a/lib/librte_gro/rte_gro.c b/lib/librte_gro/rte_gro.c
> index 33275e8..5b89928 100644
> --- a/lib/librte_gro/rte_gro.c
> +++ b/lib/librte_gro/rte_gro.c
> @@ -32,11 +32,15 @@
>=20
>  #include <rte_malloc.h>
>  #include <rte_mbuf.h>
> +#include <rte_ethdev.h>
>=20
>  #include "rte_gro.h"
> +#include "rte_gro_tcp.h"
>=20
> -static gro_tbl_create_fn tbl_create_functions[GRO_TYPE_MAX_NUM];
> -static gro_tbl_destroy_fn tbl_destroy_functions[GRO_TYPE_MAX_NUM];
> +static gro_tbl_create_fn tbl_create_functions[GRO_TYPE_MAX_NUM] =3D {
> +	gro_tcp_tbl_create, NULL};
> +static gro_tbl_destroy_fn tbl_destroy_functions[GRO_TYPE_MAX_NUM] =3D {
> +	gro_tcp_tbl_destroy, NULL};
>=20
>  struct rte_gro_tbl *rte_gro_tbl_create(uint16_t socket_id,
>  		uint16_t max_flow_num,
> @@ -94,32 +98,121 @@ void rte_gro_tbl_destroy(struct rte_gro_tbl *gro_tbl=
)
>  }
>=20
>  uint16_t
> -rte_gro_reassemble_burst(struct rte_mbuf **pkts __rte_unused,
> +rte_gro_reassemble_burst(struct rte_mbuf **pkts,
>  		const uint16_t nb_pkts,
> -		const struct rte_gro_param param __rte_unused)
> +		const struct rte_gro_param param)
>  {
> -	return nb_pkts;
> +	uint16_t i;
> +	uint16_t nb_after_gro =3D nb_pkts;
> +	uint32_t item_num =3D RTE_MIN(nb_pkts, param.max_flow_num *
> +			param.max_item_per_flow);
> +
> +	/* allocate a reassembly table for TCP/IPv4 GRO */
> +	uint32_t tcp_item_num =3D RTE_MIN(item_num, GRO_MAX_BURST_ITEM_NUM);
> +	struct gro_tcp_tbl tcp_tbl;
> +	struct gro_tcp_key tcp_keys[tcp_item_num];
> +	struct gro_tcp_item tcp_items[tcp_item_num];
> +
> +	struct rte_mbuf *unprocess_pkts[nb_pkts];
> +	uint16_t unprocess_num =3D 0;
> +	int32_t ret;
> +
> +	memset(tcp_keys, 0, sizeof(struct gro_tcp_key) *
> +			tcp_item_num);
> +	memset(tcp_items, 0, sizeof(struct gro_tcp_item) *
> +			tcp_item_num);
> +	tcp_tbl.keys =3D tcp_keys;
> +	tcp_tbl.items =3D tcp_items;
> +	tcp_tbl.key_num =3D 0;
> +	tcp_tbl.item_num =3D 0;
> +	tcp_tbl.max_key_num =3D tcp_item_num;
> +	tcp_tbl.max_item_num =3D tcp_item_num;
> +
> +	for (i =3D 0; i < nb_pkts; i++) {
> +		if (RTE_ETH_IS_IPV4_HDR(pkts[i]->packet_type)) {

Why just not && for these 2 conditions?

> +			if ((pkts[i]->packet_type & RTE_PTYPE_L4_TCP) &&
> +				(param.desired_gro_types &
> +					 GRO_TCP_IPV4)) {

No need to check param.desired_gro_types inside the loop.
You can do that before the loop.

> +				ret =3D gro_tcp4_reassemble(pkts[i],
> +						&tcp_tbl,
> +						param.max_packet_size);
> +				/* merge successfully */
> +				if (ret > 0)
> +					nb_after_gro--;
> +				else if (ret < 0)
> +					unprocess_pkts[unprocess_num++] =3D
> +						pkts[i];
> +			} else
> +				unprocess_pkts[unprocess_num++] =3D
> +					pkts[i];
> +		} else
> +			unprocess_pkts[unprocess_num++] =3D
> +				pkts[i];
> +	}
> +
> +	/* re-arrange GROed packets */
> +	if (nb_after_gro < nb_pkts) {
> +		if (param.desired_gro_types & GRO_TCP_IPV4)
> +			i =3D gro_tcp_tbl_flush(&tcp_tbl, pkts, nb_pkts);
> +		if (unprocess_num > 0) {
> +			memcpy(&pkts[i], unprocess_pkts,
> +					sizeof(struct rte_mbuf *) *
> +					unprocess_num);
> +			i +=3D unprocess_num;
> +		}
> +		if (nb_pkts > i)
> +			memset(&pkts[i], 0,
> +					sizeof(struct rte_mbuf *) *
> +					(nb_pkts - i));
> +	}

Why do you need to zero remaining pkts[]?

> +	return nb_after_gro;
>  }
>=20
> -int rte_gro_reassemble(struct rte_mbuf *pkt __rte_unused,
> -		struct rte_gro_tbl *gro_tbl __rte_unused)
> +int rte_gro_reassemble(struct rte_mbuf *pkt,
> +		struct rte_gro_tbl *gro_tbl)
>  {
> +	if (unlikely(pkt =3D=3D NULL))
> +		return -1;
> +
> +	if (RTE_ETH_IS_IPV4_HDR(pkt->packet_type)) {
> +		if ((pkt->packet_type & RTE_PTYPE_L4_TCP) &&
> +				(gro_tbl->desired_gro_types &
> +				 GRO_TCP_IPV4))
> +			return gro_tcp4_reassemble(pkt,
> +					gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> +					gro_tbl->max_packet_size);
> +	}
> +
>  	return -1;
>  }
>=20
> -uint16_t rte_gro_flush(struct rte_gro_tbl *gro_tbl __rte_unused,
> -		uint64_t desired_gro_types __rte_unused,
> -		struct rte_mbuf **out __rte_unused,
> -		const uint16_t max_nb_out __rte_unused)
> +uint16_t rte_gro_flush(struct rte_gro_tbl *gro_tbl,
> +		uint64_t desired_gro_types,
> +		struct rte_mbuf **out,
> +		const uint16_t max_nb_out)
>  {
> +	desired_gro_types =3D desired_gro_types &
> +		gro_tbl->desired_gro_types;
> +	if (desired_gro_types & GRO_TCP_IPV4)
> +		return gro_tcp_tbl_flush(
> +				gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> +				out,
> +				max_nb_out);
>  	return 0;
>  }
>=20
>  uint16_t
> -rte_gro_timeout_flush(struct rte_gro_tbl *gro_tbl __rte_unused,
> -		uint64_t desired_gro_types __rte_unused,
> -		struct rte_mbuf **out __rte_unused,
> -		const uint16_t max_nb_out __rte_unused)
> +rte_gro_timeout_flush(struct rte_gro_tbl *gro_tbl,
> +		uint64_t desired_gro_types,
> +		struct rte_mbuf **out,
> +		const uint16_t max_nb_out)
>  {
> +	desired_gro_types =3D desired_gro_types &
> +		gro_tbl->desired_gro_types;
> +	if (desired_gro_types & GRO_TCP_IPV4)
> +		return gro_tcp_tbl_timeout_flush(
> +				gro_tbl->tbls[GRO_TCP_IPV4_INDEX],
> +				gro_tbl->max_timeout_cycles,
> +				out, max_nb_out);
>  	return 0;
>  }
> diff --git a/lib/librte_gro/rte_gro.h b/lib/librte_gro/rte_gro.h
> index f9d36e8..a30b1c6 100644
> --- a/lib/librte_gro/rte_gro.h
> +++ b/lib/librte_gro/rte_gro.h
> @@ -45,7 +45,11 @@ extern "C" {
>=20
>  /* max number of supported GRO types */
>  #define GRO_TYPE_MAX_NUM 64
> -#define GRO_TYPE_SUPPORT_NUM 0	/**< current supported GRO num */
> +#define GRO_TYPE_SUPPORT_NUM 1	/**< current supported GRO num */
> +
> +/* TCP/IPv4 GRO flag */
> +#define GRO_TCP_IPV4_INDEX 0
> +#define GRO_TCP_IPV4 (1ULL << GRO_TCP_IPV4_INDEX)
>=20
>  /**
>   * GRO table, which is used to merge packets. It keeps many reassembly
> diff --git a/lib/librte_gro/rte_gro_tcp.c b/lib/librte_gro/rte_gro_tcp.c
> new file mode 100644
> index 0000000..c0eaa45
> --- /dev/null
> +++ b/lib/librte_gro/rte_gro_tcp.c
> @@ -0,0 +1,394 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyrig=
ht
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS F=
OR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGH=
T
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTA=
L,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF US=
E,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON A=
NY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE U=
SE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE=
.
> + */
> +
> +#include <rte_malloc.h>
> +#include <rte_mbuf.h>
> +#include <rte_cycles.h>
> +
> +#include <rte_ethdev.h>
> +#include <rte_ip.h>
> +#include <rte_tcp.h>
> +
> +#include "rte_gro_tcp.h"
> +
> +void *gro_tcp_tbl_create(uint16_t socket_id,
> +		uint16_t max_flow_num,
> +		uint16_t max_item_per_flow)
> +{
> +	size_t size;
> +	uint32_t entries_num;
> +	struct gro_tcp_tbl *tbl;
> +
> +	entries_num =3D max_flow_num * max_item_per_flow;
> +	entries_num =3D entries_num > GRO_TCP_TBL_MAX_ITEM_NUM ?
> +		GRO_TCP_TBL_MAX_ITEM_NUM : entries_num;
> +
> +	if (entries_num =3D=3D 0)
> +		return NULL;
> +
> +	tbl =3D (struct gro_tcp_tbl *)rte_zmalloc_socket(
> +			__func__,
> +			sizeof(struct gro_tcp_tbl),
> +			RTE_CACHE_LINE_SIZE,
> +			socket_id);

Here and everywhere - rte_malloc() can fail.
Add proper error handling.

> +
> +	size =3D sizeof(struct gro_tcp_item) * entries_num;
> +	tbl->items =3D (struct gro_tcp_item *)rte_zmalloc_socket(
> +			__func__,
> +			size,
> +			RTE_CACHE_LINE_SIZE,
> +			socket_id);
> +	tbl->max_item_num =3D entries_num;
> +
> +	size =3D sizeof(struct gro_tcp_key) * entries_num;
> +	tbl->keys =3D (struct gro_tcp_key *)rte_zmalloc_socket(
> +			__func__,
> +			size, RTE_CACHE_LINE_SIZE,
> +			socket_id);
> +	tbl->max_key_num =3D entries_num;
> +	return tbl;
> +}
> +
> +void gro_tcp_tbl_destroy(void *tbl)
> +{
> +	struct gro_tcp_tbl *tcp_tbl =3D (struct gro_tcp_tbl *)tbl;
> +
> +	if (tcp_tbl) {
> +		if (tcp_tbl->items)

No need to, rte_free(NULL) is a valid construction.
Same below.

> +			rte_free(tcp_tbl->items);
> +		if (tcp_tbl->keys)
> +			rte_free(tcp_tbl->keys);
> +		rte_free(tcp_tbl);
> +	}
> +}
> +
> +/**
> + * merge two TCP/IPv4 packets without update checksums.
> + */
> +static int
> +merge_two_tcp4_packets(struct rte_mbuf *pkt_src,
> +		struct rte_mbuf *pkt,
> +		uint32_t max_packet_size)
> +{
> +	struct ipv4_hdr *ipv4_hdr1, *ipv4_hdr2;
> +	struct tcp_hdr *tcp_hdr1;
> +	uint16_t ipv4_ihl1, tcp_hl1, tcp_dl1;
> +	struct rte_mbuf *tail;
> +
> +	/* parse the given packet */
> +	ipv4_hdr1 =3D (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt,
> +				struct ether_hdr *) + 1);

You probably shouldn't assume that l2_len is always 14B long.

> +	ipv4_ihl1 =3D IPv4_HDR_LEN(ipv4_hdr1);
> +	tcp_hdr1 =3D (struct tcp_hdr *)((char *)ipv4_hdr1 + ipv4_ihl1);
> +	tcp_hl1 =3D TCP_HDR_LEN(tcp_hdr1);
> +	tcp_dl1 =3D rte_be_to_cpu_16(ipv4_hdr1->total_length) - ipv4_ihl1
> +		- tcp_hl1;
> +
> +	/* parse the original packet */
> +	ipv4_hdr2 =3D (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt_src,
> +				struct ether_hdr *) + 1);
> +
> +	if (pkt_src->pkt_len + tcp_dl1 > max_packet_size)
> +		return -1;
> +
> +	/* remove the header of the incoming packet */
> +	rte_pktmbuf_adj(pkt, sizeof(struct ether_hdr) +
> +			ipv4_ihl1 + tcp_hl1);
> +
> +	/* chain the two packet together */
> +	tail =3D rte_pktmbuf_lastseg(pkt_src);
> +	tail->next =3D pkt;

What I see as a problem here:
You have to reparse your packet and do lastseg for it for each new segment.
That seems like a big overhead.
Would be good instead to parse the packet once and then store that infomati=
o
inside mbuf: l2_len/l3_len/l4_len, etc.
You can probably even avoid parsing inside your library - by adding as a pr=
erequisite
for the caller to fill these fields properly.

Similar thought about lastseg - would be good to store it somewhere inside =
your table.

> +
> +	/* update IP header */
> +	ipv4_hdr2->total_length =3D rte_cpu_to_be_16(
> +			rte_be_to_cpu_16(
> +				ipv4_hdr2->total_length)
> +			+ tcp_dl1);
> +
> +	/* update mbuf metadata for the merged packet */
> +	pkt_src->nb_segs++;

Why do you assume that incoming packet always contains only one segment?

> +	pkt_src->pkt_len +=3D pkt->pkt_len;
> +	return 1;
> +}
> +
> +static int
> +check_seq_option(struct rte_mbuf *pkt,
> +		struct tcp_hdr *tcp_hdr,
> +		uint16_t tcp_hl)
> +{
> +	struct ipv4_hdr *ipv4_hdr1;
> +	struct tcp_hdr *tcp_hdr1;
> +	uint16_t ipv4_ihl1, tcp_hl1, tcp_dl1;
> +	uint32_t sent_seq1, sent_seq;
> +	int ret =3D -1;
> +
> +	ipv4_hdr1 =3D (struct ipv4_hdr *)(rte_pktmbuf_mtod(pkt,
> +				struct ether_hdr *) + 1);
> +	ipv4_ihl1 =3D IPv4_HDR_LEN(ipv4_hdr1);
> +	tcp_hdr1 =3D (struct tcp_hdr *)((char *)ipv4_hdr1 + ipv4_ihl1);
> +	tcp_hl1 =3D TCP_HDR_LEN(tcp_hdr1);
> +	tcp_dl1 =3D rte_be_to_cpu_16(ipv4_hdr1->total_length) - ipv4_ihl1
> +		- tcp_hl1;
> +	sent_seq1 =3D rte_be_to_cpu_32(tcp_hdr1->sent_seq) + tcp_dl1;
> +	sent_seq =3D rte_be_to_cpu_32(tcp_hdr->sent_seq);
> +
> +	/* check if the two packets are neighbor */
> +	if ((sent_seq ^ sent_seq1) =3D=3D 0) {

Why just not if sent_seq =3D=3D sent_seq1?

> +		/* check if TCP option field equals */
> +		if (tcp_hl1 > sizeof(struct tcp_hdr)) {

And what if tcp_hl1 =3D=3D sizeof(struct tcp_hdr), but tcp_hl > tcp_hl1?
I think you need to remove that check.
 =20
> +			if ((tcp_hl1 !=3D tcp_hl) ||
> +					(memcmp(tcp_hdr1 + 1,
> +							tcp_hdr + 1,
> +							tcp_hl - sizeof
> +							(struct tcp_hdr))
> +					 =3D=3D 0))
> +				ret =3D 1;
> +		}
> +	}
> +	return ret;
> +}
> +
> +static uint32_t
> +find_an_empty_item(struct gro_tcp_tbl *tbl)
> +{
> +	uint32_t i;
> +
> +	for (i =3D 0; i < tbl->max_item_num; i++)
> +		if (tbl->items[i].is_valid =3D=3D 0)
> +			return i;
> +	return INVALID_ARRAY_INDEX;
> +}
> +
> +static uint32_t
> +find_an_empty_key(struct gro_tcp_tbl *tbl)
> +{
> +	uint32_t i;
> +
> +	for (i =3D 0; i < tbl->max_key_num; i++)
> +		if (tbl->keys[i].is_valid =3D=3D 0)
> +			return i;
> +	return INVALID_ARRAY_INDEX;
> +}
> +
> +int32_t
> +gro_tcp4_reassemble(struct rte_mbuf *pkt,
> +		struct gro_tcp_tbl *tbl,
> +		uint32_t max_packet_size)
> +{
> +	struct ether_hdr *eth_hdr;
> +	struct ipv4_hdr *ipv4_hdr;
> +	struct tcp_hdr *tcp_hdr;
> +	uint16_t ipv4_ihl, tcp_hl, tcp_dl;
> +
> +	struct tcp_key key;
> +	uint32_t cur_idx, prev_idx, item_idx;
> +	uint32_t i, key_idx;
> +
> +	eth_hdr =3D rte_pktmbuf_mtod(pkt, struct ether_hdr *);
> +	ipv4_hdr =3D (struct ipv4_hdr *)(eth_hdr + 1);
> +	ipv4_ihl =3D IPv4_HDR_LEN(ipv4_hdr);
> +
> +	/* check if the packet should be processed */
> +	if (ipv4_ihl < sizeof(struct ipv4_hdr))
> +		goto fail;
> +	tcp_hdr =3D (struct tcp_hdr *)((char *)ipv4_hdr + ipv4_ihl);
> +	tcp_hl =3D TCP_HDR_LEN(tcp_hdr);
> +	tcp_dl =3D rte_be_to_cpu_16(ipv4_hdr->total_length) - ipv4_ihl
> +		- tcp_hl;
> +	if (tcp_dl =3D=3D 0)
> +		goto fail;
> +
> +	/* find a key and traverse all packets in its item group */
> +	key.eth_saddr =3D eth_hdr->s_addr;
> +	key.eth_daddr =3D eth_hdr->d_addr;
> +	key.ip_src_addr[0] =3D rte_be_to_cpu_32(ipv4_hdr->src_addr);
> +	key.ip_dst_addr[0] =3D rte_be_to_cpu_32(ipv4_hdr->dst_addr);

Your key.ip_src_addr[1-3] still contains some junk.
How memcmp below supposed to worj properly?
BTW why do you need 4 elems here, why just not uint32_t ip_src_addr;?
Same for ip_dst_addr.

> +	key.src_port =3D rte_be_to_cpu_16(tcp_hdr->src_port);
> +	key.dst_port =3D rte_be_to_cpu_16(tcp_hdr->dst_port);
> +	key.recv_ack =3D rte_be_to_cpu_32(tcp_hdr->recv_ack);
> +	key.tcp_flags =3D tcp_hdr->tcp_flags;
> +
> +	for (i =3D 0; i < tbl->max_key_num; i++) {
> +		if (tbl->keys[i].is_valid &&
> +				(memcmp(&(tbl->keys[i].key), &key,
> +						sizeof(struct tcp_key))
> +				 =3D=3D 0)) {
> +			cur_idx =3D tbl->keys[i].start_index;
> +			prev_idx =3D cur_idx;
> +			while (cur_idx !=3D INVALID_ARRAY_INDEX) {
> +				if (check_seq_option(tbl->items[cur_idx].pkt,
> +							tcp_hdr,
> +							tcp_hl) > 0) {

As I remember linux gro also check ipv4 packet_id - it should be consecutiv=
e.

> +					if (merge_two_tcp4_packets(
> +								tbl->items[cur_idx].pkt,
> +								pkt,
> +								max_packet_size) > 0) {
> +						/* successfully merge two packets */
> +						tbl->items[cur_idx].is_groed =3D 1;
> +						return 1;
> +					}

If you allow more then packet per flow to be stored in the table, then you =
should be
prepared that new segment can fill a gap between 2 packets.
Probably the easiest thing - don't allow more then one 'item' per flow. =20

> +					/**
> +					 * fail to merge two packets since
> +					 * it's beyond the max packet length.
> +					 * Insert it into the item group.
> +					 */
> +					goto insert_to_item_group;
> +				} else {
> +					prev_idx =3D cur_idx;
> +					cur_idx =3D tbl->items[cur_idx].next_pkt_idx;
> +				}
> +			}
> +			/**
> +			 * find a corresponding item group but fails to find
> +			 * one packet to merge. Insert it into this item group.
> +			 */
> +insert_to_item_group:
> +			item_idx =3D find_an_empty_item(tbl);
> +			/* the item number is greater than the max value */
> +			if (item_idx =3D=3D INVALID_ARRAY_INDEX)
> +				return -1;
> +			tbl->items[prev_idx].next_pkt_idx =3D item_idx;
> +			tbl->items[item_idx].pkt =3D pkt;
> +			tbl->items[item_idx].is_groed =3D 0;
> +			tbl->items[item_idx].next_pkt_idx =3D INVALID_ARRAY_INDEX;
> +			tbl->items[item_idx].is_valid =3D 1;
> +			tbl->items[item_idx].start_time =3D rte_rdtsc();
> +			tbl->item_num++;
> +			return 0;
> +		}
> +	}
> +
> +	/**
> +	 * merge fail as the given packet has a new key.
> +	 * So insert a new key.
> +	 */
> +	item_idx =3D find_an_empty_item(tbl);
> +	key_idx =3D find_an_empty_key(tbl);
> +	/**
> +	 * if current key or item number is greater than the max
> +	 * value, don't insert the packet into the table and return
> +	 * immediately.
> +	 */
> +	if (item_idx =3D=3D INVALID_ARRAY_INDEX ||
> +			key_idx =3D=3D INVALID_ARRAY_INDEX)
> +		return -1;
> +	tbl->items[item_idx].pkt =3D pkt;
> +	tbl->items[item_idx].next_pkt_idx =3D INVALID_ARRAY_INDEX;
> +	tbl->items[item_idx].is_groed =3D 0;
> +	tbl->items[item_idx].is_valid =3D 1;
> +	tbl->items[item_idx].start_time =3D rte_rdtsc();

You can pass start-time as a parameter instead.

> +	tbl->item_num++;
> +
> +	memcpy(&(tbl->keys[key_idx].key),
> +			&key, sizeof(struct tcp_key));
> +	tbl->keys[key_idx].start_index =3D item_idx;
> +	tbl->keys[key_idx].is_valid =3D 1;
> +	tbl->key_num++;
> +
> +	return 0;
> +fail:

Please try to avoid goto whenever possible.
Looks really ugly.

> +	return -1;
> +}
> +
> +uint16_t gro_tcp_tbl_flush(struct gro_tcp_tbl *tbl,
> +		struct rte_mbuf **out,
> +		const uint16_t nb_out)
> +{
> +	uint32_t i, num =3D 0;
> +
> +	if (nb_out < tbl->item_num)
> +		return 0;

And how user would now how many items are now in the table?

> +
> +	for (i =3D 0; i < tbl->max_item_num; i++) {
> +		if (tbl->items[i].is_valid) {
> +			out[num++] =3D tbl->items[i].pkt;
> +			tbl->items[i].is_valid =3D 0;
> +			tbl->item_num--;
> +		}
> +	}
> +	memset(tbl->keys, 0, sizeof(struct gro_tcp_key) *
> +			tbl->max_key_num);
> +	tbl->key_num =3D 0;
> +
> +	return num;
> +}
> +
> +uint16_t
> +gro_tcp_tbl_timeout_flush(struct gro_tcp_tbl *tbl,
> +		uint64_t timeout_cycles,
> +		struct rte_mbuf **out,
> +		const uint16_t nb_out)
> +{
> +	uint16_t k;
> +	uint32_t i, j;
> +	uint64_t current_time;
> +
> +	if (nb_out =3D=3D 0)
> +		return 0;
> +	k =3D 0;
> +	current_time =3D rte_rdtsc();
> +
> +	for (i =3D 0; i < tbl->max_key_num; i++) {
> +		if (tbl->keys[i].is_valid) {

Seems pretty expensive to traverse the whole table...
Would it worth to have some sort of LRU list?

> +			j =3D tbl->keys[i].start_index;
> +			while (j !=3D INVALID_ARRAY_INDEX) {
> +				if (current_time - tbl->items[j].start_time >=3D
> +						timeout_cycles) {
> +					out[k++] =3D tbl->items[j].pkt;
> +					tbl->items[j].is_valid =3D 0;
> +					tbl->item_num--;
> +					j =3D tbl->items[j].next_pkt_idx;
> +
> +					if (k =3D=3D nb_out &&
> +							j =3D=3D INVALID_ARRAY_INDEX) {
> +						/* delete the key */
> +						tbl->keys[i].is_valid =3D 0;
> +						tbl->key_num--;
> +						goto end;

Please rearrange the code to avoid gotos.

> +					} else if (k =3D=3D nb_out &&
> +							j !=3D INVALID_ARRAY_INDEX) {
> +						/* update the first item index */
> +						tbl->keys[i].start_index =3D j;
> +						goto end;
> +					}
> +				}
> +			}
> +			/* delete the key, as all of its packets are flushed */
> +			tbl->keys[i].is_valid =3D 0;
> +			tbl->key_num--;
> +		}
> +		if (tbl->key_num =3D=3D 0)
> +			goto end;
> +	}
> +end:
> +	return k;
> +}
> diff --git a/lib/librte_gro/rte_gro_tcp.h b/lib/librte_gro/rte_gro_tcp.h
> new file mode 100644
> index 0000000..a9a7aca
> --- /dev/null
> +++ b/lib/librte_gro/rte_gro_tcp.h
> @@ -0,0 +1,191 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyrig=
ht
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS F=
OR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGH=
T
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTA=
L,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF US=
E,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON A=
NY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE U=
SE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE=
.
> + */
> +
> +#ifndef _RTE_GRO_TCP_H_
> +#define _RTE_GRO_TCP_H_
> +
> +#if RTE_BYTE_ORDER =3D=3D RTE_LITTLE_ENDIAN
> +#define TCP_HDR_LEN(tcph) \
> +	((tcph->data_off >> 4) * 4)
> +#define IPv4_HDR_LEN(iph) \
> +	((iph->version_ihl & 0x0f) * 4)
> +#else
> +#define TCP_DATAOFF_MASK 0x0f
> +#define TCP_HDR_LEN(tcph) \
> +	((tcph->data_off & TCP_DATAOFF_MASK) * 4)
> +#define IPv4_HDR_LEN(iph) \
> +	((iph->version_ihl >> 4) * 4)
> +#endif
> +
> +#define INVALID_ARRAY_INDEX 0xffffffffUL
> +#define GRO_TCP_TBL_MAX_ITEM_NUM (UINT32_MAX - 1)
> +
> +/* criteria of mergeing packets */
> +struct tcp_key {
> +	struct ether_addr eth_saddr;
> +	struct ether_addr eth_daddr;
> +	uint32_t ip_src_addr[4];	/**< IPv4 uses the first 8B */
> +	uint32_t ip_dst_addr[4];
> +
> +	uint32_t recv_ack;	/**< acknowledgment sequence number. */
> +	uint16_t src_port;
> +	uint16_t dst_port;
> +	uint8_t tcp_flags;	/**< TCP flags. */
> +};
> +
> +struct gro_tcp_key {
> +	struct tcp_key key;
> +	uint32_t start_index;	/**< the first packet index of the flow */
> +	uint8_t is_valid;
> +};
> +
> +struct gro_tcp_item {
> +	struct rte_mbuf *pkt;	/**< packet address. */
> +	/* the time when the packet in added into the table */
> +	uint64_t start_time;
> +	uint32_t next_pkt_idx;	/**< next packet index. */
> +	/* flag to indicate if the packet is GROed */
> +	uint8_t is_groed;
> +	uint8_t is_valid;	/**< flag indicates if the item is valid */

Why do you need these 2 flags at all?
Why not just reset let say pkt to NULL for invalid item?

> +};
> +
> +/**
> + * TCP reassembly table. Both TCP/IPv4 and TCP/IPv6 use the same table
> + * structure.
> + */
> +struct gro_tcp_tbl {
> +	struct gro_tcp_item *items;	/**< item array */
> +	struct gro_tcp_key *keys;	/**< key array */
> +	uint32_t item_num;	/**< current item number */
> +	uint32_t key_num;	/**< current key num */
> +	uint32_t max_item_num;	/**< item array size */
> +	uint32_t max_key_num;	/**< key array size */
> +};
> +
> +/**
> + * This function creates a TCP reassembly table.
> + *
> + * @param socket_id
> + *  socket index where the Ethernet port connects to.
> + * @param max_flow_num
> + *  the maximum number of flows in the TCP GRO table
> + * @param max_item_per_flow
> + *  the maximum packet number per flow.
> + * @return
> + *  if create successfully, return a pointer which points to the
> + *  created TCP GRO table. Otherwise, return NULL.
> + */
> +void *gro_tcp_tbl_create(uint16_t socket_id,
> +		uint16_t max_flow_num,
> +		uint16_t max_item_per_flow);
> +
> +/**
> + * This function destroys a TCP reassembly table.
> + * @param tbl
> + *  a pointer points to the TCP reassembly table.
> + */
> +void gro_tcp_tbl_destroy(void *tbl);
> +
> +/**
> + * This function searches for a packet in the TCP reassembly table to
> + * merge with the inputted one. To merge two packets is to chain them
> + * together and update packet headers. If the packet is without data
> + * (e.g. SYN, SYN-ACK packet), this function returns immediately.
> + * Otherwise, the packet is either merged, or inserted into the table.
> + * Besides, if there is no available space to insert the packet, this
> + * function returns immediately too.
> + *
> + * This function assumes the inputted packet is with correct IPv4 and
> + * TCP checksums. And if two packets are merged, it won't re-calculate
> + * IPv4 and TCP checksums. Besides, if the inputted packet is IP
> + * fragmented, it assumes the packet is complete (with TCP header).
> + *
> + * @param pkt
> + *  packet to reassemble.
> + * @param tbl
> + *  a pointer that points to a TCP reassembly table.
> + * @param max_packet_size
> + *  max packet length after merged
> + * @return
> + *  if the packet doesn't have data, or there is no available space
> + *  in the table to insert a new item or a new key, return a negative
> + *  value. If the packet is merged successfully, return an positive
> + *  value. If the packet is inserted into the table, return 0.
> + */
> +int32_t
> +gro_tcp4_reassemble(struct rte_mbuf *pkt,
> +		struct gro_tcp_tbl *tbl,
> +		uint32_t max_packet_size);
> +
> +/**
> + * This function flushes all packets in a TCP reassembly table to
> + * applications, and without updating checksums for merged packets.
> + * If the array which is used to keep flushed packets is not large
> + * enough, error happens and this function returns immediately.
> + *
> + * @param tbl
> + *  a pointer that points to a TCP GRO table.
> + * @param out
> + *  pointer array which is used to keep flushed packets. Applications
> + *  should guarantee it's large enough to hold all packets in the table.
> + * @param nb_out
> + *  the element number of out.
> + * @return
> + *  the number of flushed packets. If out is not large enough to hold
> + *  all packets in the table, return 0.
> + */
> +uint16_t
> +gro_tcp_tbl_flush(struct gro_tcp_tbl *tbl,
> +		struct rte_mbuf **out,
> +		const uint16_t nb_out);
> +
> +/**
> + * This function flushes timeout packets in a TCP reassembly table to
> + * applications, and without updating checksums for merged packets.
> + *
> + * @param tbl
> + *  a pointer that points to a TCP GRO table.
> + * @param timeout_cycles
> + *  the maximum time that packets can stay in the table.
> + * @param out
> + *  pointer array which is used to keep flushed packets.
> + * @param nb_out
> + *  the element number of out.
> + * @return
> + *  the number of packets that are returned.
> + */
> +uint16_t
> +gro_tcp_tbl_timeout_flush(struct gro_tcp_tbl *tbl,
> +		uint64_t timeout_cycles,
> +		struct rte_mbuf **out,
> +		const uint16_t nb_out);
> +#endif
> --
> 2.7.4