From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by dpdk.org (Postfix) with ESMTP id 225564C91 for ; Tue, 11 Sep 2018 10:03:14 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id D3B6940082 for ; Tue, 11 Sep 2018 10:03:13 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id AA46940022; Tue, 11 Sep 2018 10:03:13 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on bernadotte.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-0.9 required=5.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.4.1 X-Spam-Score: -0.9 Received: from isengard.friendlyfire.se (host-90-232-156-190.mobileonline.telia.com [90.232.156.190]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 7E6C64000F; Tue, 11 Sep 2018 10:03:09 +0200 (CEST) From: =?UTF-8?q?Mattias=20R=C3=B6nnblom?= To: jerin.jacob@caviumnetworks.com Cc: bruce.richardson@intel.com, dev@dpdk.org, =?UTF-8?q?Mattias=20R=C3=B6nnblom?= Date: Tue, 11 Sep 2018 10:02:06 +0200 Message-Id: <20180911080216.3017-1-mattias.ronnblom@ericsson.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP Subject: [dpdk-dev] [PATCH v3 00/10] A Distributed Software Event Device X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Sep 2018 08:03:14 -0000 v3: * Fixed incorrect headline prefixes. * Removed dummy dsw_event_schedule() function. * Removed redundant output buffer flush. v2: * Added support for Meson builds. * Eventdev 'xstats' support is now mandatory. * Added check in dsw_probe() to allow secondary processes. * rte_event_dev_stop() now runs the flush callback. * Added documentation. * Fixed uninitialized-use warning in ‘dsw_port_consider_migration'. * Removed some dead (#if 0) debugging code. * Version .map file is bumped to 18.11. * Header files sorted in alphabetic order, newline after declarations and various other coding style-related improvements. This is the Distributed Software (DSW) event device, which distributes the task of scheduling events among all the eventdev ports and their lcore threads. DSW is primarily designed for atomic-only queues, but also supports single-link and parallel queues. (DSW would be more accurately described as 'parallel', but since that term is used to describe an eventdev queue type, it's referred to as 'distributed', to avoid suggesting it's somehow biased toward parallel queues.) Event Scheduling ================ Internally, DSW hashes an eventdev flow id to a 15-bit "flow hash". For each eventdev queue, there's a table mapping a flow hash to an eventdev port. That port is considered the owner of the flow. Owners are randomly picked at initialization time, among the ports serving (i.e. are linked to) that queue. The scheduling of an event to a port is done (by the sender port) at time of the enqueue operation, and in most cases simply consists of hashing the flow id and performing a lookup in the destination queue's table. Each port has an MP/SC event ring to which the events are enqueued. This means events go directly port-to-port, typically meaning core-to-core. Port Load Measurement ===================== DSW includes a concept of port load. The event device keeps track of transitions between "idle" and "busy" (or vice versa) on a per-port basis, compares this to the wall time passed, and computes to what extent the port was busy (for a certain interval). A port transitions to "busy" on a non-zero dequeue, and again back to "idle" at the point it performs a dequeue operation returning zero events. Flow Migration ============== Periodically, hidden to the API user and as a part of a normal enqueue/dequeue operations, a port updates its load estimate, and in case the load has reached a certain threshold, considers moving one of its flow to a different, more lightly loaded, port. This process is called migration. Migration Strategy ~~~~~~~~~~~~~~~~~~ The DSW migration strategy is to move a small, but yet active flow. To quickly find which are the active flows (w/o resorting to scanning through the tables and/or keeping per-event counters), each port maintains a list of the last 128 events it has dequeued. If there are lightly-loaded enough target ports, it will attempt to migrate one of those flows, starting with the smallest. The size is estimated by the number of events seen on that flow, in that small sample of events. A good migration strategy, based on reasonably good estimates of port and current flow event rates, is key for proper load balancing in a DSW-style event device. Migration Process ~~~~~~~~~~~~~~~~~ If the prerequisites are met, and a migration target flow and port is found, the owning (source) port will initiate the migration process. For parallel queues it's a very straightforward operation - simply a table update. For atomic queues, in order to maintain their semantics, it's a fair bit more elaborate a procedure. A high-level view the migration process is available[1] in the form a sequence diagram. Much simplified, it consist of the source port sending messages to all ports configured, asking them to "pause" the to-be-migrated flow. Such ports will flush their output buffers and provide a confirmation back to the source port. Each port holds a list of which flows are paused. Upon the enqueue of an event belonging to a paused flow, it will be accepted into the machinery, but kept in a paused-events buffer located on the sending port. After receiving confirmations from all ports, the source port will make sure its application-level user has finished processing of all events related to the migrating flow, update the relevant queue's table, and forward all unprocessed events (in its input event ring) to the new target port. The source port will then send out a request to "unpause" the flow to all ports. Upon receiving such a request, the port will flush any buffered (paused) events related to the paused flow, and provide a confirmation. All the signaling are done on regular DPDK rings (separate from the event-carrying rings), and are pulled as a part of normal enqueue/dequeue operation. The migrations can be made fairly rapidly (in the range of a couple hundred us, or even faster), but the algorithm, load measurement and migration interval parameters must be carefully chosen not to cause the system to oscillate or otherwise misbehave. The migration rate is primarily limited by eventdev enqueue/dequeue function call rate, which in turn in the typical application is limited by event burst sizes and event processing latency. Migration API Implications ~~~~~~~~~~~~~~~~~~~~~~~~~~ The migration process puts an addition requirement on the application beyond the regular eventdev API, which is to not leave ports 'unattended'. Unattended here means a port on that neither enqueue nor dequeue operations are performed within a reasonable time frame. What is 'reasonable' depends on the migration latency requirements, which in turns depends on the degree of variation in the workload. For enqueue-only ports, which might well come into situations where no events are enqueued for long duration of time, DSW includes an less-than-elegant solution, allowing zero-sized enqueue operations, which serve no other purpose that to drive the migration machinery. Workload Suitability ==================== DSW operates under the assumption that an active flow will remain so for a duration which is significantly longer than the migration latency. DSW should do well with a larger number of small flows, and also large flows that increase their rates at a pace which is low-enough for the migration process to move away smaller flows to make room on that port. TCP slow-start kind of traffic, with end-to-end latencies on the ms level, should be possible to handle, even though their exponential nature - but all of this is speculation. DSW won't be able to properly load balance workloads with few, highly bursty, and high intensity flows. Compared to the SW event device, DSW allows scaling to higher-core count machines, with its substantially higher throughput and avoiding a single bottleneck core, especially for long pipelines, or systems with very short pipeline stages. In addition, it also scales down to configurations with very few or even just a single core, avoiding the issue with SW has with running application work and event scheduling on the same core. The downsides is that DSW doesn't have SW's near-immediate load balancing flow-rerouting capability, but instead relies on flows changing their inter-event time at a pace which isn't too high for the migration process to handle. Backpressure ============ Like any event device, DSW provides a backpressure mechanism to prevent event producers flooding it. DSW employs a credit system, with a central pool equal to the configured max number of in-flight events. The ports are allowed to take loans from this central pool, and may also return credits, so that consumer-heavy ports don't end up draining the pool. A port will, at the time of enqueue, make sure it has enough credits (one per event) to allow the events into DSW. If not, the port will attempt to retrieve more from the central pool. If this fails, the enqueue operation fails. For efficiency reasons, at least 64 credits are take from the pool (even if fewer might be needed). A port will, at the time of dequeue, gain as many credits as the number of events it dequeued. A port will not return credits until they reach 128, and will always keep 64 credits. All this in a similar although not identical manner to the SW event device. Output Buffering ================ Upon a successful enqueue operation, DSW will not immediately send the events to their destination ports' event input rings. Events will however - unless paused - be assigned a destination port and enqueued on a buffer on the sending port. Such buffered events are considered accepted into the event device, and is so handled from a migration and in-flight credit system point of view. Upon reaching a certain threshold, buffered events will be flushed, and enqueued on the destination port's input ring. The output buffers make the DSW ports use longer bursts against the receiving port rings, much improving event ring efficiency. To avoid having buffered events lingering too long (or even endlessly) in these buffers, DSW has a schema where it only allows a certain number of enqueue/dequeue operations ('ops') to be performed, before the buffers are flushed. A side effect of how 'ops' are counted is that in case a port goes idle, it will likely perform many dequeue operations to pull new work, and thus quickly up the 'ops' to a level it's output buffers are flushed. That will cause lower ring efficiency, but this is of no major concern since the worker is idling anyways. This allows for single-event enqueue operations to be efficient, although credit system and statistics update overhead will still make them slower than burst enqueues. Output Buffering API Implications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The output buffering schema puts the same requirement on the application as the migration process in that it disallows unattended ports. In addition, DSW also implement a schema (maybe more accurately described as a hack) where the application can force a buffer flush by doing a zero-sized enqueue. Alternative Approach ~~~~~~~~~~~~~~~~~~~~ An alternative to the DSW-internal output buffering is to have the application to use burst enqueues, preferably with very large bursts (the more cores, the larger bursts are preferred). Depending on how the application is structured, this might well lead to it having an internal buffer to which it does efficient, single-event enqueue operations to, and then flushes it on a regular basis. However, since the DSW output buffering happens after the scheduling is performed, the buffers can actually be flushed earlier than if buffering happens in the application, if a large fraction of the events are scheduled to a particular port (since the output buffer limit is on a per-destination port basis). In addition, since events in the output buffer are considered accepted into DSW, migration can happen somewhat more quickly, since those events can be flushed on migrations, as oppose to an application-controlled buffer. Statistics ========== DSW supports the eventdev 'xstats' interface. It provides a large, most port-based, set of counters, including things like port load, number of migrations and migration latency, number of events dequeued and enqueued, and on which queue, the average event processing latency and a timestamp to allow the detection of unattended ports. DSW xstats also allows reading the current global total and port credits, making it possible to give a rough estimate of how many events are in flight. Performance Indications ======================= The primary source of performance metrics comes from a test application implementing a simulate pipeline. With zero work in each pipe line stage, running on single socket x86_64 system, fourteen 2.4 GHz worker cores can sustain 300-400 million event/s. With a pipeline with 1000 clock cycles of work per stage, the average event device overhead is somewhere 50-150 clock cycles/event. The benchmark is run when the system is fully loaded (i.e. there are always events available on the pipeline ingress), and thus the event device will benefit from batching effects, which are crucial for performance. Also beneficial for DSW efficiency is the fact that the "dummy" application work cycles has a very small memory working set, leaving all the caches to DSW. The simulated load has flows with a fixed event rate, causing very few migrations - and more importantly - allows DSW to provide near-ideal load balancing. So inefficienes due to imperfect load balancing is also not accounted for. The flow-to-port/thread/core affinity of DSW should provide for some caching benefits for the application, for flow-related data structures, compared to an event device where the flows move around the ports in a more stochastic manner. [1] http://www.lysator.liu.se/~hofors/dsw/migration-sequence.svg Mattias Rönnblom (10): event/dsw: add DSW device registration and build system event/dsw: add DSW device and queue configuration event/dsw: add DSW port configuration event/dsw: add support in DSW for linking/unlinking ports event/dsw: add DSW event scheduling and device start/stop event/dsw: add DSW port load measurements event/dsw: add load balancing to the DSW event device event/dsw: let DSW event device sort events on dequeue event/dsw: implement eventdev 'xstats' counters in DSW event/dsw: include DSW event device documentation config/common_base | 5 + doc/guides/eventdevs/dsw.rst | 97 ++ doc/guides/eventdevs/index.rst | 1 + drivers/event/Makefile | 1 + drivers/event/dsw/Makefile | 27 + drivers/event/dsw/dsw_evdev.c | 435 ++++++ drivers/event/dsw/dsw_evdev.h | 279 ++++ drivers/event/dsw/dsw_event.c | 1253 +++++++++++++++++ drivers/event/dsw/dsw_sort.h | 48 + drivers/event/dsw/dsw_xstats.c | 288 ++++ drivers/event/dsw/meson.build | 6 + .../event/dsw/rte_pmd_dsw_event_version.map | 3 + drivers/event/meson.build | 2 +- mk/rte.app.mk | 1 + 14 files changed, 2445 insertions(+), 1 deletion(-) create mode 100644 doc/guides/eventdevs/dsw.rst create mode 100644 drivers/event/dsw/Makefile create mode 100644 drivers/event/dsw/dsw_evdev.c create mode 100644 drivers/event/dsw/dsw_evdev.h create mode 100644 drivers/event/dsw/dsw_event.c create mode 100644 drivers/event/dsw/dsw_sort.h create mode 100644 drivers/event/dsw/dsw_xstats.c create mode 100644 drivers/event/dsw/meson.build create mode 100644 drivers/event/dsw/rte_pmd_dsw_event_version.map -- 2.17.1