From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Konstantin Ananyev" <konstantin.ananyev@huawei.com>,
"Konstantin Ananyev" <konstantin.v.ananyev@yandex.ru>,
<dev@dpdk.org>, "Bruce Richardson" <bruce.richardson@intel.com>
Cc: "Jan Viktorin" <viktorin@rehivetech.com>,
"Ruifeng Wang" <ruifeng.wang@arm.com>,
"David Christensen" <drc@linux.vnet.ibm.com>,
"Stanislaw Kardach" <kda@semihalf.com>
Subject: RE: [RFC v2] non-temporal memcpy
Date: Sat, 30 Jul 2022 11:51:17 +0200 [thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D87214@smartserver.smartshare.dk> (raw)
In-Reply-To: <750f172a82014660b16e434a722f04d9@huawei.com>
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Saturday, 30 July 2022 00.00
>
> > > > > Actually, one question I have for such small data-transfer
> > > > > (16B per packet) - do you still see some noticable perfomance
> > > > > improvement for such scenario?
> > > >
> > > > Copying 16 byte from each packet in a burst of 32 packets would
> > > otherwise pollute 64 cache lines = 4 KB cache. With typically 64 KB
> L1
> > > > cache, I think it makes a difference.
> > >
> > > I understand the intention behind, my question was - it is really
> > > measurable?
> > > Something like: using pktmbuf_copy_nt(len=16) over using
> > > pktmbuf_copy(len=16)
> > > on workload X gives Y% thoughtput improvement?
> >
> > If the application is complex enough, and needs some of those 4 KB
> cache otherwise wasted, there will be a significant throughput
> > improvement; otherwise probably not.
> >
> > I have a general problem with this type of question: I hate that
> throughput is the only KPI (Key Performance Indicator) getting any
> > attention on the mailing list! Other KPIs, such as latency and
> resource conservation, are just as important in many real life use
> cases.
>
> Well, I suppose that sort of expected question for the patch that
> introduces performance optimization:
> what is the benefit we expect to get and is it worth the effort?
> Throughput or latency improvement seems like an obvious choice here.
> About resource conservation - if the patch aims to improve cache
> consumption, then on some cache-bound
> workloads it should result in throughput improvement, correct?
The benefit is cache conservation.
Copying a burst of 32 1518 byte packets - using memcpy() - pollutes the entire L1 cache. Not trashing the entire L1 cache - using memcpy_nt() - should provide derived benefits for latency and/or throughput for most applications copying entire packets.
>
> >
> > Here's a number for you: 6.25 % reduction in L1 data cache
> consumption. (Assuming 64 KB L1 cache with 64 byte cache lines and
> > application burst length of 32 packets.)
>
> I understand that it should reduce cache eviction rate.
> The thing is that non-temporal stores are not free also: they consume
> WC buffers and some memory-bus bandwidth.
> AFAIK, for 16B non-consecutive NT stores, it means that only 25% of WC
> buffers capacity will be used,
> and in theory it might lead to extra memory pressure and worse
> performance in general.
I'm not a CPU expert, so I wonder if it makes any difference if the 16B non-consecutive store is non-temporal or normal... intuitively, the need to use a WC buffer and memory-bus bandwidth seems similar to me.
Also, my 16B example might be a bit silly... I used it to argue for the execution performance cost of omitting the alignment hints (added compares and branches). I suppose most NT copies will be packets, so mostly 64 or 1518 byte copies.
And in our application also 16 byte metadata being copied to the front of each packet. The copied packet follows immediately after the 16B metadata, so perhaps I should try to find a way to make these stores consecutive. Feature creep? ;-)
> In fact, IA manuals explicitly recommend to avoid partial cach-line
> writes whenever possible.
> Now, I don't know what would be more expensive in that case: re-fill
> extra cache-lines,
> or extra partial write memory transactions.
Good input, Konstantin. I will take this into consideration when optimizing the copy loops.
> That's why I asked for some performance numbers here.
>
Got it.
The benefit of the patch is to avoid data cache pollution, and we agree about this.
So, to consider the other side of the coin, i.e. the potentially degraded memory copy throughput, I will measure both the NT copy and the normal copy using our application's packet capture feature, and provide both performance numbers.
next prev parent reply other threads:[~2022-07-30 9:51 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-19 15:26 Morten Brørup
2022-07-19 18:00 ` David Christensen
2022-07-19 18:41 ` Morten Brørup
2022-07-19 18:51 ` Stanisław Kardach
2022-07-19 22:15 ` Morten Brørup
2022-07-21 23:19 ` Konstantin Ananyev
2022-07-22 10:44 ` Morten Brørup
2022-07-24 13:35 ` Konstantin Ananyev
2022-07-24 22:18 ` Morten Brørup
2022-07-29 10:00 ` Konstantin Ananyev
2022-07-29 10:46 ` Morten Brørup
2022-07-29 11:50 ` Konstantin Ananyev
2022-07-29 17:17 ` Morten Brørup
2022-07-29 22:00 ` Konstantin Ananyev
2022-07-30 9:51 ` Morten Brørup [this message]
2022-08-02 9:05 ` Konstantin Ananyev
2022-07-29 12:13 ` Konstantin Ananyev
2022-07-29 16:05 ` Stephen Hemminger
2022-07-29 17:29 ` Morten Brørup
2022-08-07 20:40 ` Mattias Rönnblom
2022-08-09 9:24 ` Morten Brørup
2022-08-09 11:53 ` Mattias Rönnblom
2022-10-09 16:16 ` Morten Brørup
2022-07-29 18:13 ` Morten Brørup
2022-07-29 19:49 ` Konstantin Ananyev
2022-07-29 20:26 ` Morten Brørup
2022-07-29 21:34 ` Konstantin Ananyev
2022-08-07 20:20 ` Mattias Rönnblom
2022-08-09 9:34 ` Morten Brørup
2022-08-09 11:56 ` Mattias Rönnblom
2022-08-10 21:05 ` Honnappa Nagarahalli
2022-08-11 11:50 ` Mattias Rönnblom
2022-08-11 16:26 ` Honnappa Nagarahalli
2022-07-25 1:17 ` Honnappa Nagarahalli
2022-07-27 10:26 ` Morten Brørup
2022-07-27 17:37 ` Honnappa Nagarahalli
2022-07-27 18:49 ` Morten Brørup
2022-07-27 19:12 ` Stephen Hemminger
2022-07-28 9:00 ` Morten Brørup
2022-07-27 19:52 ` Honnappa Nagarahalli
2022-07-27 22:02 ` Stanisław Kardach
2022-07-28 10:51 ` Morten Brørup
2022-07-29 9:21 ` Konstantin Ananyev
2022-08-07 20:25 ` Mattias Rönnblom
2022-08-09 9:46 ` Morten Brørup
2022-08-09 12:05 ` Mattias Rönnblom
2022-08-09 15:00 ` Morten Brørup
2022-08-10 11:47 ` Mattias Rönnblom
2022-08-09 15:26 ` Stephen Hemminger
2022-08-09 17:24 ` Morten Brørup
2022-08-10 11:59 ` Mattias Rönnblom
2022-08-10 12:12 ` Morten Brørup
2022-08-10 11:55 ` Mattias Rönnblom
2022-08-10 12:18 ` Morten Brørup
2022-08-10 21:20 ` Honnappa Nagarahalli
2022-08-11 11:53 ` Mattias Rönnblom
2022-08-11 22:24 ` Honnappa Nagarahalli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=98CBD80474FA8B44BF855DF32C47DC35D87214@smartserver.smartshare.dk \
--to=mb@smartsharesystems.com \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=drc@linux.vnet.ibm.com \
--cc=kda@semihalf.com \
--cc=konstantin.ananyev@huawei.com \
--cc=konstantin.v.ananyev@yandex.ru \
--cc=ruifeng.wang@arm.com \
--cc=viktorin@rehivetech.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).