From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 06D95235 for ; Tue, 27 Jun 2017 15:12:48 +0200 (CEST) Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 27 Jun 2017 06:12:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.39,399,1493708400"; d="scan'208";a="119748798" Received: from dhunt5-mobl.ger.corp.intel.com (HELO [10.237.221.55]) ([10.237.221.55]) by fmsmga006.fm.intel.com with ESMTP; 27 Jun 2017 06:12:20 -0700 To: Jerin Jacob References: <1492768299-84016-1-git-send-email-harry.van.haaren@intel.com> <1492768299-84016-2-git-send-email-harry.van.haaren@intel.com> <20170510141202.GA8431@jerin> <6c53f05d-2dd2-5b83-2eab-bcecd93bea82@intel.com> <20170627093506.GB14276@jerin> Cc: Harry van Haaren , dev@dpdk.org, Gage Eads , Bruce Richardson From: "Hunt, David" Message-ID: <72d1d150-a119-2b23-642c-484ca658c4b3@intel.com> Date: Tue, 27 Jun 2017 14:12:20 +0100 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170627093506.GB14276@jerin> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added sample app X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jun 2017 13:12:50 -0000 Hi Jerin: On 27/6/2017 10:35 AM, Jerin Jacob wrote: > -----Original Message----- >> Date: Mon, 26 Jun 2017 15:46:47 +0100 >> From: "Hunt, David" >> To: Jerin Jacob , Harry van Haaren >> >> CC: dev@dpdk.org, Gage Eads , Bruce Richardson >> >> Subject: Re: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added >> sample app >> User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 >> Thunderbird/45.8.0 >> >> Hi Jerin, > Hi David, > > Looks like you have sent the old version. The below mentioned comments > are not addressed in v2. Oops. Glitch in the Matrix. I've just pushed a V3 with the changes. >> I'm assisting Harry on the sample app, and have just pushed up a V2 patch >> based on your feedback. I've addressed most of your suggestions, comments >> below. There may still a couple of outstanding questions that need further >> discussion. > A few general comments: > 1) Nikhil/Gage's proposal on ethdev rx to eventdev adapter will change the major > portion(eventdev setup and producer()) of this application > 2) Producing one lcore worth of packets, really cant show as example > eventdev application as it will be pretty bad in low-end machine. > At least application infrastructure should not limit. > > Considering above points, Should we wait for rx adapter to complete > first? I would like to show this as real world application to use eventdev. > > Thoughts? > > On the same note: > Can we finalize on rx adapter proposal? I can work on v1 of patch and > common code if Nikhil or Gage don't have bandwidth. Let me know? > > last followup: > http://dpdk.org/ml/archives/dev/2017-June/068776.html I had a quick chat with Harry, and wonder if we'd be as well to merge the app as it is now, and as the new frameworks become available, the app can be updated to make use of them? I feel it would be better to have something out there for people to play with than waiting for 17.11. Also, if you have bandwidth to patch the app for your desired use cases, that would be a good contribution. I'd only be guessing for some of it :) >> Regards, >> Dave >> >> On 10/5/2017 3:12 PM, Jerin Jacob wrote: >>> -----Original Message----- >>>> Date: Fri, 21 Apr 2017 10:51:37 +0100 >>>> From: Harry van Haaren >>>> To: dev@dpdk.org >>>> CC: jerin.jacob@caviumnetworks.com, Harry van Haaren >>>> , Gage Eads , Bruce >>>> Richardson >>>> Subject: [PATCH 1/3] examples/eventdev_pipeline: added sample app >>>> X-Mailer: git-send-email 2.7.4 >>>> >>>> This commit adds a sample app for the eventdev library. >>>> The app has been tested with DPDK 17.05-rc2, hence this >>>> release (or later) is recommended. >>>> >>>> The sample app showcases a pipeline processing use-case, >>>> with event scheduling and processing defined per stage. >>>> The application recieves traffic as normal, with each >>>> packet traversing the pipeline. Once the packet has >>>> been processed by each of the pipeline stages, it is >>>> transmitted again. >>>> >>>> The app provides a framework to utilize cores for a single >>>> role or multiple roles. Examples of roles are the RX core, >>>> TX core, Scheduling core (in the case of the event/sw PMD), >>>> and worker cores. >>>> >>>> Various flags are available to configure numbers of stages, >>>> cycles of work at each stage, type of scheduling, number of >>>> worker cores, queue depths etc. For a full explaination, >>>> please refer to the documentation. >>>> >>>> Signed-off-by: Gage Eads >>>> Signed-off-by: Bruce Richardson >>>> Signed-off-by: Harry van Haaren >>> Thanks for the example application to share the SW view. >>> I could make it run on HW after some tweaking(not optimized though) >>> >>> [...] >>>> +#define MAX_NUM_STAGES 8 >>>> +#define BATCH_SIZE 16 >>>> +#define MAX_NUM_CORE 64 >>> How about RTE_MAX_LCORE? >> Core usage in the sample app is held in a uint64_t. Adding arrays would be >> possible, but I feel that the extra effort would not give that much benefit. >> I've left as is for the moment, unless you see any strong requirement to go >> beyond 64 cores? > I think, it is OK. Again with service core infrastructure this will change. > >>>> + >>>> +static unsigned int active_cores; >>>> +static unsigned int num_workers; >>>> +static unsigned long num_packets = (1L << 25); /* do ~32M packets */ >>>> +static unsigned int num_fids = 512; >>>> +static unsigned int num_priorities = 1; >>> looks like its not used. >> Yes, Removed. >> >>>> +static unsigned int num_stages = 1; >>>> +static unsigned int worker_cq_depth = 16; >>>> +static int queue_type = RTE_EVENT_QUEUE_CFG_ATOMIC_ONLY; >>>> +static int16_t next_qid[MAX_NUM_STAGES+1] = {-1}; >>>> +static int16_t qid[MAX_NUM_STAGES] = {-1}; >>> Moving all fastpath related variables under a structure with cache >>> aligned will help. >> I tried a few different combinations of this, and saw no gains, some losses. >> So will leave as is for the moment, if that's OK. > I think, the one are using in fastpath better to allocate from huge page > using rte_malloc() > >>>> +static int worker_cycles; >>>> +static int enable_queue_priorities; >>>> + >>>> +struct prod_data { >>>> + uint8_t dev_id; >>>> + uint8_t port_id; >>>> + int32_t qid; >>>> + unsigned num_nic_ports; >>>> +}; >> Yes, saw a percent or two gain when this plus following two data structs >> cache aligned. > looks like it not fixed in v2. Looks like you have sent the old > version. > >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +static int >>>> +producer(void) >>>> +{ >>>> + static uint8_t eth_port; >>>> + struct rte_mbuf *mbufs[BATCH_SIZE]; >>>> + struct rte_event ev[BATCH_SIZE]; >>>> + uint32_t i, num_ports = prod_data.num_nic_ports; >>>> + int32_t qid = prod_data.qid; >>>> + uint8_t dev_id = prod_data.dev_id; >>>> + uint8_t port_id = prod_data.port_id; >>>> + uint32_t prio_idx = 0; >>>> + >>>> + const uint16_t nb_rx = rte_eth_rx_burst(eth_port, 0, mbufs, >> BATCH_SIZE); >>>> + if (++eth_port == num_ports) >>>> + eth_port = 0; >>>> + if (nb_rx == 0) { >>>> + rte_pause(); >>>> + return 0; >>>> + } >>>> + >>>> + for (i = 0; i < nb_rx; i++) { >>>> + ev[i].flow_id = mbufs[i]->hash.rss; >>> prefetching the buff[i+1] may help here? >> I tried, didn't make much difference. > OK. > >>>> + ev[i].op = RTE_EVENT_OP_NEW; >>>> + ev[i].sched_type = queue_type; >>> The value of RTE_EVENT_QUEUE_CFG_ORDERED_ONLY != RTE_SCHED_TYPE_ORDERED. >> So, we >>> cannot assign .sched_type as queue_type. >>> >>> I think, one option could be to avoid translation in application is to >>> - Remove RTE_EVENT_QUEUE_CFG_ALL_TYPES, RTE_EVENT_QUEUE_CFG_*_ONLY >>> - Introduce a new RTE_EVENT_DEV_CAP_ to denote >> RTE_EVENT_QUEUE_CFG_ALL_TYPES cap >>> ability >>> - add sched_type in struct rte_event_queue_conf. If capability flag is >>> not set then implementation takes sched_type value for the queue. >>> >>> Any thoughts? >> >> Not sure here, would it be ok for the moment, and we can work on a patch in >> the future? > OK > >>>> + >>>> + if (tx_core[lcore_id] && (tx_single || >>>> + rte_atomic32_cmpset(&tx_lock, 0, 1))) { >>>> + consumer(); >>> Should consumer() need to come in this pattern? I am thinking like >>> if events is from last stage then call consumer() in worker() >>> >>> I think, above scheme works better when the _same_ worker code need to run >> the >>> case where >>> 1) ethdev HW is capable to enqueuing the packets to same txq from >>> multiple thread >>> 2) ethdev is not capable to do so. >>> >>> So, The above cases can be addressed in configuration time where we link >>> the queues to port >>> case 1) Link all workers to last queue >>> case 2) Link only worker to last queue >>> >>> and keeping the common worker code. >>> >>> HW implementation has functional and performance issue if "two" ports are >>> assigned to one lcore for dequeue. The above scheme fixes that problem >> too. >> >> >> Can we have a bit more discussion on this item? Is this needed for this >> sample app, or can we perhaps work a patch for this later? Harry? > As explained above, Is there any issue in keeping consumer() for last > stage ? I would probably see this as a future enhancement as per my initial comments above. Any hardware or new framework additions are welcome as future patches to the app. >> >>>> + rte_atomic32_clear((rte_atomic32_t *)&tx_lock); >>>> + } >>>> +} >>>> + >>>> +static int >>>> +worker(void *arg) >>>> +{ >>>> + struct rte_event events[BATCH_SIZE]; >>>> + >>>> + struct worker_data *data = (struct worker_data *)arg; >>>> + uint8_t dev_id = data->dev_id; >>>> + uint8_t port_id = data->port_id; >>>> + size_t sent = 0, received = 0; >>>> + unsigned lcore_id = rte_lcore_id(); >>>> + >>>> + while (!done) { >>>> + uint16_t i; >>>> + >>>> + schedule_devices(dev_id, lcore_id); >>>> + >>>> + if (!worker_core[lcore_id]) { >>>> + rte_pause(); >>>> + continue; >>>> + } >>>> + >>>> + uint16_t nb_rx = rte_event_dequeue_burst(dev_id, port_id, >>>> + events, RTE_DIM(events), 0); >>>> + >>>> + if (nb_rx == 0) { >>>> + rte_pause(); >>>> + continue; >>>> + } >>>> + received += nb_rx; >>>> + >>>> + for (i = 0; i < nb_rx; i++) { >>>> + struct ether_hdr *eth; >>>> + struct ether_addr addr; >>>> + struct rte_mbuf *m = events[i].mbuf; >>>> + >>>> + /* The first worker stage does classification */ >>>> + if (events[i].queue_id == qid[0]) >>>> + events[i].flow_id = m->hash.rss % num_fids; >>> Not sure why we need do(shrinking the flows) this in worker() in queue >> based pipeline. >>> If an PMD has any specific requirement on num_fids,I think, we >>> can move this configuration stage or PMD can choose optimum fid internally >> to >>> avoid modulus operation tax in fastpath in all PMD. >>> >>> Does struct rte_event_queue_conf.nb_atomic_flows help here? >> In my tests the modulus makes very little difference in the throughput. And >> I think it's good to have a way of varying the number of flows for testing >> different scenarios, even if it's not the most performant. > Not sure. > >>>> + >>>> + events[i].queue_id = next_qid[events[i].queue_id]; >>>> + events[i].op = RTE_EVENT_OP_FORWARD; >>> missing events[i].sched_type.HW PMD does not work with this. >>> I think, we can use similar scheme like next_qid for next_sched_type. >> Done. added events[i].sched_type = queue_type. >> >>>> + >>>> + /* change mac addresses on packet (to use mbuf data) */ >>>> + eth = rte_pktmbuf_mtod(m, struct ether_hdr *); >>>> + ether_addr_copy(ð->d_addr, &addr); >>>> + ether_addr_copy(ð->s_addr, ð->d_addr); >>>> + ether_addr_copy(&addr, ð->s_addr); >>> IMO, We can make packet processing code code as "static inline function" >> so >>> different worker types can reuse. >> Done. moved out to a work() function. > I think, mac swap should do in last stage, not on each forward. > ie. With existing code, 2 stage forward makes in original order. > >>>> + >>>> + /* do a number of cycles of work per packet */ >>>> + volatile uint64_t start_tsc = rte_rdtsc(); >>>> + while (rte_rdtsc() < start_tsc + worker_cycles) >>>> + rte_pause(); >>> Ditto. >> Done. moved out to a work() function. >> >>> I think, All worker specific variables like "worker_cycles" can moved into >>> one structure and use. >>> >>>> + } >>>> + uint16_t nb_tx = rte_event_enqueue_burst(dev_id, port_id, >>>> + events, nb_rx); >>>> + while (nb_tx < nb_rx && !done) >>>> + nb_tx += rte_event_enqueue_burst(dev_id, port_id, >>>> + events + nb_tx, >>>> + nb_rx - nb_tx); >>>> + sent += nb_tx; >>>> + } >>>> + >>>> + if (!quiet) >>>> + printf(" worker %u thread done. RX=%zu TX=%zu\n", >>>> + rte_lcore_id(), received, sent); >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +/* >>>> + * Parse the coremask given as argument (hexadecimal string) and fill >>>> + * the global configuration (core role and core count) with the parsed >>>> + * value. >>>> + */ >>>> +static int xdigit2val(unsigned char c) >>> multiple instance of "xdigit2val" in DPDK repo. May be we can push this >>> as common code. >> Sure, that's something we can look at in a separate patch, now that it's >> being used more and more. > make sense. > >>>> +{ >>>> + int val; >>>> + >>>> + if (isdigit(c)) >>>> + val = c - '0'; >>>> + else if (isupper(c)) >>>> + val = c - 'A' + 10; >>>> + else >>>> + val = c - 'a' + 10; >>>> + return val; >>>> +} >>>> + >>>> + >>>> +static void >>>> +usage(void) >>>> +{ >>>> + const char *usage_str = >>>> + " Usage: eventdev_demo [options]\n" >>>> + " Options:\n" >>>> + " -n, --packets=N Send N packets (default ~32M), 0 >> implies no limit\n" >>>> + " -f, --atomic-flows=N Use N random flows from 1 to N >> (default 16)\n" >>> I think, this parameter now, effects the application fast path code.I >> think, >>> it should eventdev configuration para-mater. >> See above comment on num_fids >> >>>> + " -s, --num_stages=N Use N atomic stages (default >> 1)\n" >>>> + " -r, --rx-mask=core mask Run NIC rx on CPUs in core >> mask\n" >>>> + " -w, --worker-mask=core mask Run worker on CPUs in core >> mask\n" >>>> + " -t, --tx-mask=core mask Run NIC tx on CPUs in core >> mask\n" >>>> + " -e --sched-mask=core mask Run scheduler on CPUs in core >> mask\n" >>>> + " -c --cq-depth=N Worker CQ depth (default 16)\n" >>>> + " -W --work-cycles=N Worker cycles (default 0)\n" >>>> + " -P --queue-priority Enable scheduler queue >> prioritization\n" >>>> + " -o, --ordered Use ordered scheduling\n" >>>> + " -p, --parallel Use parallel scheduling\n" >>> IMO, all stage being "parallel" or "ordered" or "atomic" is one mode of >>> operation. It is valid have to any combination. We need to express that in >>> command like >>> example: >>> 3 stage with >>> O->A->P >> How about we add an option that specifies the mode of operation for each >> stage in a string? Maybe have a '-m' option (modes) e.g. '-m appo' for 4 >> stages with atomic, parallel, paralled, ordered. Or maybe reuse your >> test-eventdev parameter style? > Any scheme is fine. > >>>> + " -q, --quiet Minimize printed output\n" >>>> + " -D, --dump Print detailed statistics before >> exit" >>>> + "\n"; >>>> + fprintf(stderr, "%s", usage_str); >>>> + exit(1); >>>> +} >>>> + >>> [...] >>> >>>> + rx_single = (popcnt == 1); >>>> + break; >>>> + case 't': >>>> + tx_lcore_mask = parse_coremask(optarg); >>>> + popcnt = __builtin_popcountll(tx_lcore_mask); >>>> + tx_single = (popcnt == 1); >>>> + break; >>>> + case 'e': >>>> + sched_lcore_mask = parse_coremask(optarg); >>>> + popcnt = __builtin_popcountll(sched_lcore_mask); >>>> + sched_single = (popcnt == 1); >>>> + break; >>>> + default: >>>> + usage(); >>>> + } >>>> + } >>>> + >>>> + if (worker_lcore_mask == 0 || rx_lcore_mask == 0 || >>>> + sched_lcore_mask == 0 || tx_lcore_mask == 0) { >>>> + >>>> + /* Q creation - one load balanced per pipeline stage*/ >>>> + >>>> + /* set up one port per worker, linking to all stage queues */ >>>> + for (i = 0; i < num_workers; i++) { >>>> + struct worker_data *w = &worker_data[i]; >>>> + w->dev_id = dev_id; >>>> + if (rte_event_port_setup(dev_id, i, &wkr_p_conf) < 0) { >>>> + printf("Error setting up port %d\n", i); >>>> + return -1; >>>> + } >>>> + >>>> + uint32_t s; >>>> + for (s = 0; s < num_stages; s++) { >>>> + if (rte_event_port_link(dev_id, i, >>>> + &worker_queues[s].queue_id, >>>> + &worker_queues[s].priority, >>>> + 1) != 1) { >>>> + printf("%d: error creating link for port %d\n", >>>> + __LINE__, i); >>>> + return -1; >>>> + } >>>> + } >>>> + w->port_id = i; >>>> + } >>>> + /* port for consumer, linked to TX queue */ >>>> + if (rte_event_port_setup(dev_id, i, &tx_p_conf) < 0) { >>> If ethdev supports MT txq queue support then this port can be linked to >>> worker too. something to consider for future. >>> >> Sure. No change for now. > OK Just to add a comment for any remaining comments above, we would hope that none of them are blockers for the merge of the current version, as they can be patched in the future as the infrastructure changes. Rgds, Dave.