From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp145.ord.emailsrvr.com (smtp145.ord.emailsrvr.com [173.203.6.145]) by dpdk.org (Postfix) with ESMTP id 85F2D18F for ; Sun, 30 Mar 2014 16:15:59 +0200 (CEST) Received: from smtp11.relay.ord1a.emailsrvr.com (localhost.localdomain [127.0.0.1]) by smtp11.relay.ord1a.emailsrvr.com (SMTP Server) with ESMTP id B9828F01B6; Sun, 30 Mar 2014 10:17:33 -0400 (EDT) X-SMTPDoctor-Processed: csmtpprox beta Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp11.relay.ord1a.emailsrvr.com (SMTP Server) with ESMTP id B3EE6F01D1; Sun, 30 Mar 2014 10:17:33 -0400 (EDT) X-Virus-Scanned: OK Received: from smtp192.mex05.mlsrvr.com (unknown [184.106.31.85]) by smtp11.relay.ord1a.emailsrvr.com (SMTP Server) with ESMTPS id 98193F01B6; Sun, 30 Mar 2014 10:17:33 -0400 (EDT) Received: from ORD2MBX05H.mex05.mlsrvr.com ([fe80::c08:20ff:fe52:4153]) by ORD2HUB06.mex05.mlsrvr.com ([fe80::20b1:196c:7e23:928%20]) with mapi id 14.03.0169.001; Sun, 30 Mar 2014 09:17:33 -0500 From: "David P. Reed" To: Yossi Barshishat Thread-Topic: [dpdk-dev] zero copy of received segmented IP packet Thread-Index: Ac9L3kRgE45P6495QLeeungNrMUUWgAbm6IA Date: Sun, 30 Mar 2014 14:17:32 +0000 Message-ID: <8E2A3E7E-3943-4A03-8BC0-4A385B94927E@tidalscale.com> References: <02a701cf4be4$a8f3a440$fadaecc0$@imvisiontech.com> In-Reply-To: <02a701cf4be4$a8f3a440$fadaecc0$@imvisiontech.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [209.6.168.90] Content-Type: text/plain; charset="Windows-1252" Content-ID: <84609DEBFC2E314A8F52DF4635C197E2@mex05.mlsrvr.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] zero copy of received segmented IP packet X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2014 14:15:59 -0000 Yossi - You may already understand this, but fragments of IP datagrams ("IP packet"= is non-standard slang that confuses IP fragments - packets - with the end-= to-end data unit of IP) need to be checksummed together with items from the= =93virtual header=94 before delivery to TCP and then userspace. Also, TCP= datagrams can overlap each other=92s sequence space and also be partially = =93old=94. There is no rule that says that a later IP datagram cannot tran= smit the part of the sequence-number range of earlier received IP datagrams= . The bytes must be identical, of course. So, for example, if a prior TCP datagram had been received covering sequenc= e numbers 504-508, a subsequent TCP segment might cover sequence number 500= -535 (if the sender has not seen the ack up to 508, which can happen for ma= ny reasons). 504-508 would be covered by the segment=92s TCP checksum (al= ong with that segment=92s virtual header). =20 Whatever you do to handle zero-copy implementation of TCP direct into TCP r= eceiver buffers must, for example, be able to deliver bytes 509-535 directl= y into the user buffer, if bytes 504-508 have already been delivered. Othe= rwise it is a non-standard implementation. A simpler approach might work with certain sender-stacks (those that use t= he same =93datagram-boundaries=94 for retransmission), but hardly all, sinc= e the standard does not require retransmission on such boundaries. In the = old days, terminal concentrators that used telnet over TCP would retransmit= larger segments than the =93single character=94 segments in order to reduc= e the overhead of catching up with packets dropped. It=92s dangerous to pr= esume that one=92s =93sending stack=94 and one=92s =93receiving stack=94 ar= e in the same version of the same OS - especially dangerous to promote a te= chnique that fails on certain standard cases as a performance improving win= . I suspect that a zero-copy TCP requires that at least sometimes, given frag= mentation and this =93overlapping sequence number=94 issue, actual copying,= especially with fragmentation involved. So if you are talking about =93almost always zero-copy with certain senders= =94 that might make the complexity far less. Zero-copy fragment assembly o= nly in the IP layer is much more doable, but it still requires a copy from = the reassembled IP datagram into TCP sequence number space. David P. Reed, Ph.D. TidalScale, Inc. On Mar 30, 2014, at 2:52 AM, Yossi Barshishat wrot= e: > Hi, >=20 >=20 >=20 > Assuming I know ahead that all IP segments related to one single IP packe= t > ID arrive consequently and I need to forward the entire IP payload toward > the application layer. >=20 > One way to handle this is using a hash table for reassembly of the packet > data (like the ipv4_reassembly example), another way would be to assume o= ne > single bucket (following the above assumption). >=20 >=20 >=20 > However any means the DPDK provides doesn't enable a zero copy mechanism = (it > will be required to copy the segments payloads into one larger buffer). >=20 >=20 >=20 > Does anybody has any idea regarding a method to control the place where e= ach > part of the packet will be written to? >=20 > e.g. allocating the first segment regularly while the packet data buffer = is > set to the maximum packet length (rather than to MTU size), and then read= ing > n bytes after the start of each following segment into the data buffer. >=20 >=20 >=20 > That way I can forward the app layer the buffer without copying it. >=20 >=20 >=20 > Thanks, >=20 >=20 >=20 >=20 >=20