From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <konstantin.ananyev@intel.com>
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
 by dpdk.org (Postfix) with ESMTP id 78BC01E2F
 for <dev@dpdk.org>; Tue, 30 May 2017 13:56:32 +0200 (CEST)
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
 by orsmga104.jf.intel.com with ESMTP; 30 May 2017 04:56:30 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.38,418,1491289200"; d="scan'208";a="267988365"
Received: from irsmsx107.ger.corp.intel.com ([163.33.3.99])
 by fmsmga004.fm.intel.com with ESMTP; 30 May 2017 04:56:29 -0700
Received: from irsmsx109.ger.corp.intel.com ([169.254.13.250]) by
 IRSMSX107.ger.corp.intel.com ([169.254.10.129]) with mapi id 14.03.0319.002;
 Tue, 30 May 2017 12:56:28 +0100
From: "Ananyev, Konstantin" <konstantin.ananyev@intel.com>
To: "Hu, Jiayu" <jiayu.hu@intel.com>
CC: "dev@dpdk.org" <dev@dpdk.org>, "Wiles, Keith" <keith.wiles@intel.com>,
 "yuanhan.liu@linux.intel.com" <yuanhan.liu@linux.intel.com>
Thread-Topic: [PATCH v3 1/3] lib: add Generic Receive Offload API framework
Thread-Index: AQHSvNIioTj7qilZNEunAvoVQTYKUKIAL6+AgAGlgACAAN/38IADoyeAgAEVmeCAAD37gIAAjg7QgAAhXoCAADsncIACqjiAgAAzONCAAQz+AIAAe9FQ
Date: Tue, 30 May 2017 11:56:27 +0000
Message-ID: <2601191342CEEE43887BDE71AB9772583FB033F4@IRSMSX109.ger.corp.intel.com>
References: <1491309106-94264-1-git-send-email-jiayu.hu@intel.com>
 <1493021398-115955-1-git-send-email-jiayu.hu@intel.com>
 <1493021398-115955-2-git-send-email-jiayu.hu@intel.com>
 <2601191342CEEE43887BDE71AB9772583FAF8C27@IRSMSX109.ger.corp.intel.com>
 <20170523103154.GA2033@localhost.localdomain>
 <2601191342CEEE43887BDE71AB9772583FAF9BAE@IRSMSX109.ger.corp.intel.com>
 <20170526072613.GA85810@localhost.localdomain>
 <2601191342CEEE43887BDE71AB9772583FAFB22A@IRSMSX109.ger.corp.intel.com>
 <20170527034137.GA22570@localhost.localdomain>
 <2601191342CEEE43887BDE71AB9772583FAFE69E@IRSMSX109.ger.corp.intel.com>
 <20170527140929.GA27976@localhost.localdomain>
 <2601191342CEEE43887BDE71AB9772583FB02195@IRSMSX109.ger.corp.intel.com>
 <ED946F0BEFE0A141B63BABBD629A2A9B3876017C@shsmsx102.ccr.corp.intel.com>
 <2601191342CEEE43887BDE71AB9772583FB02F85@IRSMSX109.ger.corp.intel.com>
 <ED946F0BEFE0A141B63BABBD629A2A9B387605BE@shsmsx102.ccr.corp.intel.com>
In-Reply-To: <ED946F0BEFE0A141B63BABBD629A2A9B387605BE@shsmsx102.ccr.corp.intel.com>
Accept-Language: en-IE, en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
dlp-product: dlpe-windows
dlp-version: 10.0.102.7
dlp-reaction: no-action
x-originating-ip: [163.33.239.180]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH v3 1/3] lib: add Generic Receive Offload API
	framework
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 30 May 2017 11:56:33 -0000

HI Jiayu,

> -----Original Message-----
> From: Hu, Jiayu
> Sent: Tuesday, May 30, 2017 6:29 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org; Wiles, Keith <keith.wiles@intel.com>; yuanhan.liu@linux=
.intel.com
> Subject: RE: [PATCH v3 1/3] lib: add Generic Receive Offload API framewor=
k
>=20
> Hi Konstantin,
>=20
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Monday, May 29, 2017 8:52 PM
> > To: Hu, Jiayu <jiayu.hu@intel.com>
> > Cc: dev@dpdk.org; Wiles, Keith <keith.wiles@intel.com>;
> > yuanhan.liu@linux.intel.com
> > Subject: RE: [PATCH v3 1/3] lib: add Generic Receive Offload API framew=
ork
> >
> > Hi Jiayu,
> >
> > > -----Original Message-----
> > > From: Hu, Jiayu
> > > Sent: Monday, May 29, 2017 11:23 AM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > Cc: dev@dpdk.org; Wiles, Keith <keith.wiles@intel.com>;
> > yuanhan.liu@linux.intel.com
> > > Subject: RE: [PATCH v3 1/3] lib: add Generic Receive Offload API
> > framework
> > >
> > > Hi Konstantin,
> > >
> > > > -----Original Message-----
> > > > From: Ananyev, Konstantin
> > > > Sent: Sunday, May 28, 2017 12:51 AM
> > > > To: Hu, Jiayu <jiayu.hu@intel.com>
> > > > Cc: dev@dpdk.org; Wiles, Keith <keith.wiles@intel.com>;
> > > > yuanhan.liu@linux.intel.com
> > > > Subject: RE: [PATCH v3 1/3] lib: add Generic Receive Offload API
> > framework
> > > >
> > > >
> > > > Hi Jiayu,
> > > >
> > > > >
> > > > > Hi Konstantin,
> > > > >
> > > > > On Sat, May 27, 2017 at 07:12:16PM +0800, Ananyev, Konstantin wro=
te:
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Hu, Jiayu
> > > > > > > Sent: Saturday, May 27, 2017 4:42 AM
> > > > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > > > > > Cc: dev@dpdk.org; Wiles, Keith <keith.wiles@intel.com>;
> > > > yuanhan.liu@linux.intel.com
> > > > > > > Subject: Re: [PATCH v3 1/3] lib: add Generic Receive Offload =
API
> > > > framework
> > > > > > >
> > > > > > > On Sat, May 27, 2017 at 07:10:21AM +0800, Ananyev, Konstantin
> > wrote:
> > > > > > > > Hi Jiayu,
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Hu, Jiayu
> > > > > > > > > Sent: Friday, May 26, 2017 8:26 AM
> > > > > > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>
> > > > > > > > > Cc: dev@dpdk.org; Wiles, Keith <keith.wiles@intel.com>;
> > > > yuanhan.liu@linux.intel.com
> > > > > > > > > Subject: Re: [PATCH v3 1/3] lib: add Generic Receive Offl=
oad API
> > > > framework
> > > > > > > > >
> > > > > > > > > Hi Konstantin,
> > > > > > > > >
> > > > > > > > > On Wed, May 24, 2017 at 08:38:25PM +0800, Ananyev,
> > Konstantin
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Jiayu,
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Hi Konstantin,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your comments. My replies/questions are be=
low.
> > > > > > > > > > >
> > > > > > > > > > > BRs,
> > > > > > > > > > > Jiayu
> > > > > > > > > > >
> > > > > > > > > > > On Mon, May 22, 2017 at 05:19:19PM +0800, Ananyev,
> > > > Konstantin wrote:
> > > > > > > > > > > > Hi Jiayu,
> > > > > > > > > > > > My comments/questions below.
> > > > > > > > > > > > Konstantin
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > For applications, DPDK GRO provides three externa=
l
> > functions
> > > > to
> > > > > > > > > > > > > enable/disable GRO:
> > > > > > > > > > > > > - rte_gro_init: initialize GRO environment;
> > > > > > > > > > > > > - rte_gro_enable: enable GRO for a given port;
> > > > > > > > > > > > > - rte_gro_disable: disable GRO for a given port.
> > > > > > > > > > > > > Before using GRO, applications should explicitly =
call
> > > > rte_gro_init to
> > > > > > > > > > > > > initizalize GRO environment. After that, applicat=
ions can
> > call
> > > > > > > > > > > > > rte_gro_enable to enable GRO and call rte_gro_dis=
able
> > to
> > > > disable GRO for
> > > > > > > > > > > > > specific ports.
> > > > > > > > > > > >
> > > > > > > > > > > > I think this is too restrictive and wouldn't meet v=
arious
> > user's
> > > > needs.
> > > > > > > > > > > > User might want to:
> > > > > > > > > > > > - enable/disable GRO for particular RX queue
> > > > > > > > > > > > - or even setup different GRO types for different R=
X queues,
> > > > > > > > > > > >    i.e, - GRO over IPV4/TCP for queue 0, and  GRO o=
ver
> > > > IPV6/TCP for queue 1, etc.
> > > > > > > > > > >
> > > > > > > > > > > The reason for enabling/disabling GRO per-port instea=
d of
> > per-
> > > > queue is that LINUX
> > > > > > > > > > > controls GRO per-port. To control GRO per-queue indee=
d can
> > > > provide more flexibility
> > > > > > > > > > > to applications. But are there any scenarios that dif=
ferent
> > > > queues of a port may
> > > > > > > > > > > require different GRO control (i.e. GRO types and
> > enable/disable
> > > > GRO)?
> > > > > > > > > >
> > > > > > > > > > I think yes.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > - For various reasons, user might prefer not to use=
 RX
> > callbacks
> > > > for various reasons,
> > > > > > > > > > > >   But invoke gro() manually at somepoint in his cod=
e.
> > > > > > > > > > >
> > > > > > > > > > > An application-used GRO library can enable more flexi=
bility to
> > > > applications. Besides,
> > > > > > > > > > > when perform GRO in ethdev layer or inside PMD driver=
s, it is
> > an
> > > > issue that
> > > > > > > > > > > rte_eth_rx_burst returns actually received packet num=
ber or
> > > > GROed packet number. And
> > > > > > > > > > > the same issue happens in GSO, and even more seriousl=
y.
> > This is
> > > > because applications
> > > > > > > > > > > , like VPP, always rely on the return value of rte_et=
h_tx_burst
> > to
> > > > decide further
> > > > > > > > > > > operations. If applications can direcly call GRO/GSO =
libraries,
> > > > this issue won't exist.
> > > > > > > > > > > And DPDK is a library, which is not a holistic system=
 like
> > LINUX.
> > > > We don't need to do
> > > > > > > > > > > the same as LINUX. Therefore, maybe it's a better ide=
a to
> > > > directly provide SW
> > > > > > > > > > > segmentation/reassembling libraries to applications.
> > > > > > > > > > >
> > > > > > > > > > > > - Many users would like to control size (number of
> > flows/items
> > > > per flow),
> > > > > > > > > > > >   max allowed packet size, max timeout, etc., for d=
ifferent
> > GRO
> > > > tables.
> > > > > > > > > > > > - User would need a way to flush all or only timeou=
t
> > packets
> > > > from particular GRO tables.
> > > > > > > > > > > >
> > > > > > > > > > > > So I think that API needs to extended to become be =
much
> > more
> > > > fine-grained.
> > > > > > > > > > > > Something like that:
> > > > > > > > > > > >
> > > > > > > > > > > > struct rte_gro_tbl_param {
> > > > > > > > > > > >    int32_t socket_id;
> > > > > > > > > > > >    size_t max_flows;
> > > > > > > > > > > >    size_t max_items_per_flow;
> > > > > > > > > > > >    size_t max_pkt_size;
> > > > > > > > > > > >    uint64_t packet_timeout_cycles;
> > > > > > > > > > > >    <desired GRO types (IPV4_TCP | IPV6_TCP, ...)>
> > > > > > > > > > > >   <probably type specific params>
> > > > > > > > > > > >   ...
> > > > > > > > > > > > };
> > > > > > > > > > > >
> > > > > > > > > > > > struct rte_gro_tbl;
> > > > > > > > > > > > strct rte_gro_tbl *rte_gro_tbl_create(const struct
> > > > rte_gro_tbl_param *param);
> > > > > > > > > > > >
> > > > > > > > > > > > void rte_gro_tbl_destroy(struct rte_gro_tbl *tbl);
> > > > > > > > > > >
> > > > > > > > > > > Yes, I agree with you. It's necessary to provide more=
 fine-
> > > > grained control APIs to
> > > > > > > > > > > applications. But what's 'packet_timeout_cycles' used=
 for? Is
> > it
> > > > for TCP packets?
> > > > > > > > > >
> > > > > > > > > > For any packets that sits in the gro_table for too long=
.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > /*
> > > > > > > > > > > >  * process packets, might store some packets inside=
 the
> > GRO
> > > > table,
> > > > > > > > > > > >  * returns number of filled entries in pkt[]
> > > > > > > > > > > >  */
> > > > > > > > > > > > uint32_t rte_gro_tbl_process(struct rte_gro_tbl *tb=
l, struct
> > > > rte_mbuf *pkt[], uint32_t num);
> > > > > > > > > > > >
> > > > > > > > > > > > /*
> > > > > > > > > > > >   * retirieves up to num timeouted packets from the=
 table.
> > > > > > > > > > > >   */
> > > > > > > > > > > > uint32_t rtre_gro_tbl_timeout(struct rte_gro_tbl *t=
bl,
> > uint64_t
> > > > tmt, struct rte_mbuf *pkt[], uint32_t num);
> > > > > > > > > > >
> > > > > > > > > > > Currently, we implement GRO as RX callback, whose
> > processing
> > > > logic is simple:
> > > > > > > > > > > receive burst packets -> perform GRO -> return to app=
lication.
> > > > GRO stops after
> > > > > > > > > > > finishing processing received packets. If we provide
> > > > rte_gro_tbl_timeout, when
> > > > > > > > > > > and who will call it?
> > > > > > > > > >
> > > > > > > > > > I mean the following scenario:
> > > > > > > > > > We receive a packet, find it is eligible for GRO and pu=
t it into
> > > > gro_table
> > > > > > > > > > in expectation - there would be more packets for the sa=
me
> > flow.
> > > > > > > > > > But it could happen that we would never (or for some lo=
ng
> > time)
> > > > receive
> > > > > > > > > > any new packets for that flow.
> > > > > > > > > > So the first packet would never be delivered to the upp=
er layer,
> > > > > > > > > > or delivered too late.
> > > > > > > > > > So we need a mechanism to extract such not merged packe=
ts
> > > > > > > > > > and push them to the upper layer.
> > > > > > > > > > My thought it would be application responsibility to ca=
ll it from
> > > > time to time.
> > > > > > > > > > That's actually another reason why I think we shouldn't=
 use
> > > > application
> > > > > > > > > > to always use RX callbacks for GRO.
> > > > > > > > >
> > > > > > > > > Currently, we only provide one reassembly function,
> > > > rte_gro_reassemble_burst,
> > > > > > > > > which merges N inputted packets at a time. After finishin=
g
> > > > processing these
> > > > > > > > > packets, it returns all of them and reset hashing tables.
> > Therefore,
> > > > there
> > > > > > > > > are no packets in hashing tables after rte_gro_reassemble=
_burst
> > > > returns.
> > > > > > > >
> > > > > > > > Ok, sorry I missed that part with rte_hash_reset().
> > > > > > > > So gro_ressemble_burst() performs only inline processing on
> > current
> > > > input packets
> > > > > > > > and doesn't try to save packets for future merging, correct=
?
> > > > > > >
> > > > > > > Yes.
> > > > > > >
> > > > > > > > Such approach indeed is much lightweight and doesn't requir=
e any
> > > > extra timeouts and flush().
> > > > > > > > So my opinion let's keep it like that - nice and simple.
> > > > > > > > BTW, I think in that case we don't need any hashtables (or =
any
> > other
> > > > persistent strucures)at all.
> > > > > > > > What we need is just a set of GROs (tcp4, tpc6, etc.) we wa=
nt to
> > > > perform on given array of packets.
> > > > > > >
> > > > > > > Beside GRO types that are desired to perform, maybe it also n=
eeds
> > > > max_pkt_size and
> > > > > > > some GRO type specific information?
> > > > > >
> > > > > > Yes, but we don't need the actual hash-tables, etc. inside.
> > > > > > Passing something like struct gro_param seems enough.
> > > > >
> > > > > Yes, we can just pass gro_param and allocate hashing tables
> > > > > inside rte_gro_reassemble_burst. If so, hashing tables of
> > > > > desired GRO types are created and freed in each invocation
> > > > > of rte_gro_reassemble_burst. In GRO library, hashing tables
> > > > > are created by GRO type specific gro_tbl_create_fn. These
> > > > > gro_tbl_create_fn may allocate hashing table space via malloc
> > > > > (or rte_malloc). Therefore, we may frequently call malloc/free
> > > > > when using rte_gro_reassemble_burst. In my opinion, it will
> > > > > degrade GRO performance greatly.
> > > >
> > > > I don't' understand why do we need to put/extract each packet into/=
from
> > > > hash table at all.
> > > > We have N input packets that need to be grouped/sorted  by some
> > criteria.
> > > > Surely that can be done without any hash-table involved.
> > > > What is the need for hash table and all the overhead it brings here=
?
> > >
> > > In current design, I assume all GRO types use hash tables to merge
> > > packets. The key of the hash table is the criteria to merge packets.
> > > So the main difference for different GRO types' hash tables is the
> > > key definition.
> > >
> > > And the reason for using hash tables is to speed up reassembly. Given
> > > There are N TCP packets inputted, the simplest way to process packets=
[i]
> > > Is to traverse processed packets[0]~packets[i-1] and try to find a on=
e
> > > to merge. In the worst case, we need to check all of packets[0~i-1].
> > > In this case, the time complexity of processing N packets is O(N^2).
> > > If we use a hash table, whose key is the criteria to merge two packet=
s,
> > > the time to find a packet that may be merged with packets[i] is O(1).
> >
> > I understand that, but add/search inside the hash table,
> > plus resetting it for every burst of packets doesn't come for free also=
.
> > I think that for most common burst sizes (< 100 packets)
> > hash overhead would significantly overweight the price of
> > worst case O(N^2) scan.
>=20
> Yes, the packet burst size is always less than 100, which may amplify
> the cost of using hash tables. To avoid the high price, maybe a simpler
> structure, like rte_ip_frag_tbl in IP reassembly library, is better. And
> actually, I have never tried other designs. In next version, I will use
> a simpler structure to merge TCP packets and compare performance
> results. Thanks.
>=20
> > Also, if such worst case really worries you, you can  pre-sort
> > input packets first before starting the actual reassemble for them.
>=20
> If the inputted N packets are consist of N1 TCP packets and N2 UDP
> packets. If we pre-sort them, what should they look like after sorting?
>

My thought was something like that:
<tcp_flow0 pkts>...<tcp_flowX pkts> <udp_flow0 pkts>...<udp_flowY pkts>
|------- N1 -------------------------------|--------- N2 ------------------=
-------------|

Konstantin

=20
> > That might help to bring the price down.
> > Konstantin
>=20