* [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call @ 2017-07-20 15:48 Yongseok Koh 2017-07-20 16:34 ` Sagi Grimberg 2017-07-31 16:12 ` Ferruh Yigit 0 siblings, 2 replies; 8+ messages in thread From: Yongseok Koh @ 2017-07-20 15:48 UTC (permalink / raw) To: adrien.mazarguil, nelio.laranjeiro; +Cc: dev, Yongseok Koh mlx5_tx_complete() polls completion queue multiple times until it encounters an invalid entry. As Tx completions are suppressed by MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions in a poll. And freeing too many buffers in a call can cause high jitter. This patch improves throughput a little. Signed-off-by: Yongseok Koh <yskoh@mellanox.com> Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> --- drivers/net/mlx5/mlx5_rxtx.h | 32 ++++++++++---------------------- 1 file changed, 10 insertions(+), 22 deletions(-) diff --git a/drivers/net/mlx5/mlx5_rxtx.h b/drivers/net/mlx5/mlx5_rxtx.h index 534aaeb46..7fd59a4b1 100644 --- a/drivers/net/mlx5/mlx5_rxtx.h +++ b/drivers/net/mlx5/mlx5_rxtx.h @@ -480,30 +480,18 @@ mlx5_tx_complete(struct txq *txq) struct rte_mempool *pool = NULL; unsigned int blk_n = 0; - do { - volatile struct mlx5_cqe *tmp; - - tmp = &(*txq->cqes)[cq_ci & cqe_cnt]; - if (check_cqe(tmp, cqe_n, cq_ci)) - break; - cqe = tmp; + cqe = &(*txq->cqes)[cq_ci & cqe_cnt]; + if (unlikely(check_cqe(cqe, cqe_n, cq_ci))) + return; #ifndef NDEBUG - if (MLX5_CQE_FORMAT(cqe->op_own) == MLX5_COMPRESSED) { - if (!check_cqe_seen(cqe)) - ERROR("unexpected compressed CQE, TX stopped"); - return; - } - if ((MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_RESP_ERR) || - (MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_REQ_ERR)) { - if (!check_cqe_seen(cqe)) - ERROR("unexpected error CQE, TX stopped"); - return; - } -#endif /* NDEBUG */ - ++cq_ci; - } while (1); - if (unlikely(cqe == NULL)) + if ((MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_RESP_ERR) || + (MLX5_CQE_OPCODE(cqe->op_own) == MLX5_CQE_REQ_ERR)) { + if (!check_cqe_seen(cqe)) + ERROR("unexpected error CQE, TX stopped"); return; + } +#endif /* NDEBUG */ + ++cq_ci; txq->wqe_pi = ntohs(cqe->wqe_counter); ctrl = (volatile struct mlx5_wqe_ctrl *) tx_mlx5_wqe(txq, txq->wqe_pi); -- 2.11.0 ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call 2017-07-20 15:48 [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call Yongseok Koh @ 2017-07-20 16:34 ` Sagi Grimberg 2017-07-21 15:10 ` Yongseok Koh 2017-07-31 16:12 ` Ferruh Yigit 1 sibling, 1 reply; 8+ messages in thread From: Sagi Grimberg @ 2017-07-20 16:34 UTC (permalink / raw) To: Yongseok Koh, adrien.mazarguil, nelio.laranjeiro; +Cc: dev > mlx5_tx_complete() polls completion queue multiple times until it > encounters an invalid entry. As Tx completions are suppressed by > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions > in a poll. And freeing too many buffers in a call can cause high jitter. > This patch improves throughput a little. What if the device generates burst of completions? Holding these completions un-reaped can theoretically cause resource stress on the corresponding mempool(s). I totally get the need for a stopping condition, but is "loop once" the best stop condition? Perhaps an adaptive budget (based on online stats) perform better? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call 2017-07-20 16:34 ` Sagi Grimberg @ 2017-07-21 15:10 ` Yongseok Koh 2017-07-23 9:49 ` Sagi Grimberg 0 siblings, 1 reply; 8+ messages in thread From: Yongseok Koh @ 2017-07-21 15:10 UTC (permalink / raw) To: Sagi Grimberg; +Cc: adrien.mazarguil, nelio.laranjeiro, dev On Thu, Jul 20, 2017 at 07:34:04PM +0300, Sagi Grimberg wrote: > > > mlx5_tx_complete() polls completion queue multiple times until it > > encounters an invalid entry. As Tx completions are suppressed by > > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions > > in a poll. And freeing too many buffers in a call can cause high jitter. > > This patch improves throughput a little. > > What if the device generates burst of completions? mlx5 PMD suppresses completions anyway. It requests a completion per every MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion queue is even much small. > Holding these completions un-reaped can theoretically cause resource stress on > the corresponding mempool(s). Can you make your point clearer? Do you think the "stress" can impact performance? I think stress doesn't matter unless it is depleted. And app is responsible for supplying enough mbufs considering the depth of all queues (max # of outstanding mbufs). > I totally get the need for a stopping condition, but is "loop once" > the best stop condition? Best for what? > Perhaps an adaptive budget (based on online stats) perform better? Please bring up any suggestion or submit a patch if any. Does "budget" mean the threshold? If so, calculation of stats for adaptive threshold can impact single core performance. With multiple cores, adjusting threshold doesn't affect much. Thanks, Yongseok ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call 2017-07-21 15:10 ` Yongseok Koh @ 2017-07-23 9:49 ` Sagi Grimberg 2017-07-25 7:43 ` Yongseok Koh 0 siblings, 1 reply; 8+ messages in thread From: Sagi Grimberg @ 2017-07-23 9:49 UTC (permalink / raw) To: Yongseok Koh; +Cc: adrien.mazarguil, nelio.laranjeiro, dev >>> mlx5_tx_complete() polls completion queue multiple times until it >>> encounters an invalid entry. As Tx completions are suppressed by >>> MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions >>> in a poll. And freeing too many buffers in a call can cause high jitter. >>> This patch improves throughput a little. >> >> What if the device generates burst of completions? > mlx5 PMD suppresses completions anyway. It requests a completion per every > MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion > queue is even much small. Yes I realize that, but can't the device still complete in a burst (of unsuppressed completions)? I mean its not guaranteed that for every txq_complete a signaled completion is pending right? What happens if the device has inconsistent completion pacing? Can't the sw grow a batch of completions if txq_complete will process a single completion unconditionally? >> Holding these completions un-reaped can theoretically cause resource stress on >> the corresponding mempool(s). > Can you make your point clearer? Do you think the "stress" can impact > performance? I think stress doesn't matter unless it is depleted. And app is > responsible for supplying enough mbufs considering the depth of all queues (max > # of outstanding mbufs). I might be missing something, but # of outstanding mbufs should be relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right? Why should the pool account for the entire TX queue depth (which can be very large)? Is there a hard requirement documented somewhere that the application needs to account for the entire TX queue depths for sizing its mbuf pool? My question is with the proposed change, doesn't this mean that the application might need to allocate a bigger TX mbuf pool? Because the pmd can theoretically consume completions slower (as in multiple TX burst calls)? >> I totally get the need for a stopping condition, but is "loop once" >> the best stop condition? > Best for what? Best condition to stop consuming TX completions. As I said, I think that leaving TX completions un-reaped can (at least in theory) slow down the mbuf reclamation, which impacts the application. (unless I'm not understanding something fundamental) >> Perhaps an adaptive budget (based on online stats) perform better? > Please bring up any suggestion or submit a patch if any. I was simply providing a review for the patch. I don't have the time to come up with a better patch unfortunately, but I still think its fair to raise a point. > Does "budget" mean the > threshold? If so, calculation of stats for adaptive threshold can impact single > core performance. With multiple cores, adjusting threshold doesn't affect much. If you look at mlx5e driver in the kernel, it maintains online stats on its RX and TX queues. It maintain these stats mostly for adaptive interrupt moderation control (but not only). I was suggesting maintaining per TX queue stats on average completions consumed for each TX burst call, and adjust the stopping condition according to a calculated stat. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call 2017-07-23 9:49 ` Sagi Grimberg @ 2017-07-25 7:43 ` Yongseok Koh 2017-07-27 11:12 ` Sagi Grimberg 0 siblings, 1 reply; 8+ messages in thread From: Yongseok Koh @ 2017-07-25 7:43 UTC (permalink / raw) To: Sagi Grimberg; +Cc: adrien.mazarguil, nelio.laranjeiro, dev On Sun, Jul 23, 2017 at 12:49:36PM +0300, Sagi Grimberg wrote: > > > > mlx5_tx_complete() polls completion queue multiple times until it > > > > encounters an invalid entry. As Tx completions are suppressed by > > > > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions > > > > in a poll. And freeing too many buffers in a call can cause high jitter. > > > > This patch improves throughput a little. > > > > > > What if the device generates burst of completions? > > mlx5 PMD suppresses completions anyway. It requests a completion per every > > MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion > > queue is even much small. > > Yes I realize that, but can't the device still complete in a burst (of > unsuppressed completions)? I mean its not guaranteed that for every > txq_complete a signaled completion is pending right? What happens if > the device has inconsistent completion pacing? Can't the sw grow a > batch of completions if txq_complete will process a single completion > unconditionally? Speculation. First of all, device doesn't delay completion notifications for no reason. ASIC is not a SW running on top of a OS. If a completion comes up late, this means device really can't keep up the rate of posting descriptors. If so, tx_burst() should generate back-pressure by returning partial Tx, then app can make a decision between drop or retry. Retry on Tx means back-pressuring Rx side if app is forwarding packets. More serious problem I expected was a case that the THRESH is smaller than burst size. In that case, txq->elts[] will be short of slots all the time. But fortunately, in MLX PMD, we request one completion per a burst at most, not every THRESH of packets. If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure. Question to myself was "when does it shrink?". It shrinks when Tx burst is light (burst size is smaller than THRESH) because mlx5_tx_complete() is always called every time tx_burst() is called. What if it keeps growing? Then, drop is necessary and natural like I mentioned above. It doesn't make sense for SW to absorb any possible SW jitters. Cost is high. It is usually done by increasing queue depth. Keeping steady state is more important. Rather, this patch is helpful for reducing jitters. When I run a profiler, the most cycle-consuming part on Tx is still freeing buffers. If we allow loops on checking valid CQE, many buffers could be freed in a single call of mlx5_tx_complete() at some moment, then it would cause a long delay. This would aggravate jitter. > > > Holding these completions un-reaped can theoretically cause resource stress on > > > the corresponding mempool(s). > > Can you make your point clearer? Do you think the "stress" can impact > > performance? I think stress doesn't matter unless it is depleted. And app is > > responsible for supplying enough mbufs considering the depth of all queues (max > > # of outstanding mbufs). > > I might be missing something, but # of outstanding mbufs should be > relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right? > Why should the pool account for the entire TX queue depth (which can > be very large)? Reason is simple for Rx queue. If the number of mbufs in the provisioned mempool is less then rxq depth, PMD can't even successfully initialize device. PMD doesn't keep a private mempool. So, it is nonsensical to provision less mbufs than queue depth even if it isn't documented. It is obvious. No mempool is assigned for Tx. And in this case, app isn't forced to prepare enough mbufs to cover all the Tx queues. But the downside of it is significant performance degradation. From PMD perspective, it just needs to avoid any deadlock condition due to depletion. Even if freeing mbufs in bulk causes some resource depletion in app side, it is a fair trade-off to get higher performance unless there's no deadlock. And as far as I can tell, most of PMDs would free mbufs in bulk, not one by one. Also good for cache locality. Anyway, there are many examples according to packet processing mode - fwd/rxonly/txonly. But I won't explain all of them one by one. > Is there a hard requirement documented somewhere that the application > needs to account for the entire TX queue depths for sizing its mbuf > pool? If needed, we should document it and this will be a good start for you to contribute to DPDK community. But, think about the definition of Tx queue depth, doesn't it mean that a queue can hold that amount of descriptors? Then app should prepare more mbufs than the queue depth which is configured by the app. In my understanding, there's no point of having less mbufs than the total amount of queue entries. If resource is scarce, what's the point of having larger queue depth? It should have smaller queue. > My question is with the proposed change, doesn't this mean that the > application might need to allocate a bigger TX mbuf pool? Because the > pmd can theoretically consume completions slower (as in multiple TX > burst calls)? No. Explained above. [...] > > > Perhaps an adaptive budget (based on online stats) perform better? > > Please bring up any suggestion or submit a patch if any. > > I was simply providing a review for the patch. I don't have the time > to come up with a better patch unfortunately, but I still think its > fair to raise a point. Of course. I appreciate your time for the review. And keep in mind that nothing is impossible in an open source community. I always like to discuss about ideas with anyone. But I was just asking to hear more details about your suggestion if you wanted me to implement it, rather than giving me one-sentence question :-) > > Does "budget" mean the > > threshold? If so, calculation of stats for adaptive threshold can impact single > > core performance. With multiple cores, adjusting threshold doesn't affect much. > > If you look at mlx5e driver in the kernel, it maintains online stats on > its RX and TX queues. It maintain these stats mostly for adaptive > interrupt moderation control (but not only). > > I was suggesting maintaining per TX queue stats on average completions > consumed for each TX burst call, and adjust the stopping condition > according to a calculated stat. In case of interrupt mitigation, it could be beneficial because interrupt handling cost is too costly. But, the beauty of DPDK is polling, isn't it? And please remember to ack at the end of this discussion if you are okay so that this patch can gets merged. One data point is, single core performance (fwd) of vectorized PMD gets improved by more than 6% with this patch. 6% is never small. Thanks for your review again. Yongseok ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call 2017-07-25 7:43 ` Yongseok Koh @ 2017-07-27 11:12 ` Sagi Grimberg 2017-07-28 0:26 ` Yongseok Koh 0 siblings, 1 reply; 8+ messages in thread From: Sagi Grimberg @ 2017-07-27 11:12 UTC (permalink / raw) To: Yongseok Koh; +Cc: adrien.mazarguil, nelio.laranjeiro, dev >> Yes I realize that, but can't the device still complete in a burst (of >> unsuppressed completions)? I mean its not guaranteed that for every >> txq_complete a signaled completion is pending right? What happens if >> the device has inconsistent completion pacing? Can't the sw grow a >> batch of completions if txq_complete will process a single completion >> unconditionally? > Speculation. First of all, device doesn't delay completion notifications for no > reason. ASIC is not a SW running on top of a OS. I'm sorry but this statement is not correct. It might be correct in a lab environment, but in practice, there are lots of things that can affect the device timing. > If a completion comes up late, > this means device really can't keep up the rate of posting descriptors. If so, > tx_burst() should generate back-pressure by returning partial Tx, then app can > make a decision between drop or retry. Retry on Tx means back-pressuring Rx side > if app is forwarding packets. Not arguing on that, I was simply suggesting that better heuristics could be applied than "process one completion unconditionally". > More serious problem I expected was a case that the THRESH is smaller than > burst size. In that case, txq->elts[] will be short of slots all the time. But > fortunately, in MLX PMD, we request one completion per a burst at most, not > every THRESH of packets. > > If there's some SW jitter on Tx processiong, the Tx CQ can grow for sure. > Question to myself was "when does it shrink?". It shrinks when Tx burst is light > (burst size is smaller than THRESH) because mlx5_tx_complete() is always called > every time tx_burst() is called. What if it keeps growing? Then, drop is > necessary and natural like I mentioned above. > > It doesn't make sense for SW to absorb any possible SW jitters. Cost is high. > It is usually done by increasing queue depth. Keeping steady state is more > important. Again, I agree jitters are bad, but with proper heuristics in place mlx5 can still keep a low jitter _and_ consume completions faster than consecutive tx_burst invocations. > Rather, this patch is helpful for reducing jitters. When I run a profiler, the > most cycle-consuming part on Tx is still freeing buffers. If we allow loops on > checking valid CQE, many buffers could be freed in a single call of > mlx5_tx_complete() at some moment, then it would cause a long delay. This would > aggravate jitter. I didn't argue the fact that this patch addresses an issue, but mlx5 is a driver that is designed to run applications that can act differently than your test case. > Of course. I appreciate your time for the review. And keep in mind that nothing > is impossible in an open source community. I always like to discuss about ideas > with anyone. But I was just asking to hear more details about your suggestion if > you wanted me to implement it, rather than giving me one-sentence question :-) Good to know. >>> Does "budget" mean the >>> threshold? If so, calculation of stats for adaptive threshold can impact single >>> core performance. With multiple cores, adjusting threshold doesn't affect much. >> >> If you look at mlx5e driver in the kernel, it maintains online stats on >> its RX and TX queues. It maintain these stats mostly for adaptive >> interrupt moderation control (but not only). >> >> I was suggesting maintaining per TX queue stats on average completions >> consumed for each TX burst call, and adjust the stopping condition >> according to a calculated stat. > In case of interrupt mitigation, it could be beneficial because interrupt > handling cost is too costly. But, the beauty of DPDK is polling, isn't it? If you read again my comment, I didn't suggest to apply stats for interrupt moderation, I just gave an example of a use-case. I was suggesting to maintain the online stats for adjusting a threshold of how much completions to process in a tx burst call (instead of processing one unconditionally). > And please remember to ack at the end of this discussion if you are okay so that > this patch can gets merged. One data point is, single core performance (fwd) of > vectorized PMD gets improved by more than 6% with this patch. 6% is never small. Yea, I don't mind merging it in given that I don't have time to come up with anything better (or worse :)) Reviewed-by: Sagi Grimberg <sagi@grimberg.me> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call 2017-07-27 11:12 ` Sagi Grimberg @ 2017-07-28 0:26 ` Yongseok Koh 0 siblings, 0 replies; 8+ messages in thread From: Yongseok Koh @ 2017-07-28 0:26 UTC (permalink / raw) To: Sagi Grimberg; +Cc: Adrien Mazarguil, Nélio Laranjeiro, dev > On Jul 27, 2017, at 4:12 AM, Sagi Grimberg <sagi@grimberg.me> wrote: > > >>> Yes I realize that, but can't the device still complete in a burst (of >>> unsuppressed completions)? I mean its not guaranteed that for every >>> txq_complete a signaled completion is pending right? What happens if >>> the device has inconsistent completion pacing? Can't the sw grow a >>> batch of completions if txq_complete will process a single completion >>> unconditionally? >> Speculation. First of all, device doesn't delay completion notifications for no >> reason. ASIC is not a SW running on top of a OS. > > I'm sorry but this statement is not correct. It might be correct in a > lab environment, but in practice, there are lots of things that can > affect the device timing. Disagree. [...] > Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Thanks for ack! Yongseok ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call 2017-07-20 15:48 [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call Yongseok Koh 2017-07-20 16:34 ` Sagi Grimberg @ 2017-07-31 16:12 ` Ferruh Yigit 1 sibling, 0 replies; 8+ messages in thread From: Ferruh Yigit @ 2017-07-31 16:12 UTC (permalink / raw) To: Yongseok Koh, adrien.mazarguil, nelio.laranjeiro; +Cc: dev On 7/20/2017 4:48 PM, Yongseok Koh wrote: > mlx5_tx_complete() polls completion queue multiple times until it > encounters an invalid entry. As Tx completions are suppressed by > MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions > in a poll. And freeing too many buffers in a call can cause high jitter. > This patch improves throughput a little. > > Signed-off-by: Yongseok Koh <yskoh@mellanox.com> > Acked-by: Nelio Laranjeiro <nelio.laranjeiro@6wind.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Applied to dpdk-next-net/master, thanks. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2017-07-31 16:12 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-20 15:48 [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call Yongseok Koh 2017-07-20 16:34 ` Sagi Grimberg 2017-07-21 15:10 ` Yongseok Koh 2017-07-23 9:49 ` Sagi Grimberg 2017-07-25 7:43 ` Yongseok Koh 2017-07-27 11:12 ` Sagi Grimberg 2017-07-28 0:26 ` Yongseok Koh 2017-07-31 16:12 ` Ferruh Yigit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).