DPDK patches and discussions
 help / color / mirror / Atom feed
* [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
@ 2022-06-15  7:16 bugzilla
  2022-06-15 14:40 ` Morten Brørup
  2022-10-10 10:40 ` bugzilla
  0 siblings, 2 replies; 74+ messages in thread
From: bugzilla @ 2022-06-15  7:16 UTC (permalink / raw)
  To: dev

https://bugs.dpdk.org/show_bug.cgi?id=1035

            Bug ID: 1035
           Summary: __rte_raw_cksum() crash with misaligned pointer
           Product: DPDK
           Version: 21.11
          Hardware: All
                OS: All
            Status: UNCONFIRMED
          Severity: normal
          Priority: Normal
         Component: ethdev
          Assignee: dev@dpdk.org
          Reporter: emil.berg@ericsson.com
  Target Milestone: ---

See rte_raw_cksum() in rte_ip.h, which is part of the public API. See also the
subfunction __rte_raw_cksum().

_rte_raw_cksum assumes that the buffer over which the checksum is calculated is
an even address (divisible by two). See for example this stack overflow post:
https://stackoverflow.com/questions/46790550/c-undefined-behavior-strict-aliasing-rule-or-incorrect-alignment

The post explains that there is undefined behavior in C11 when "conversion
between two pointer types produces a result that is incorrectly aligned". When
the buf argument starts on an odd address we thus have undefined behavior,
since a pointer is cast from void* to uint16_t*.

In most cases (at least on x86) that isn't a problem, but with higher
optimization levels it may break due to vector instructions. This new function
seems to be easier to optimize by the compiler, resulting in a crash when the
buf argument is odd. Please note that the undefined behavior is present in
earlier versions of dpdk as well.

Now you're probably thinking: "Just align your buffers". The problem is that we
have a packet buffer which is aligned. The checksum is calculated on a subset
of that aligned packet buffer, and that sometimes lies on odd addresses.

The question remains if this is an issue with dpdk or not. Maybe you do the
assumption that odd addresses are never passed in when calculating the
checksum, and that's all right. But perhaps a public API comment should be
added in that case. Or perhaps you use it incorrectly as well, and in that case
something should be done about it.

Triggering this bug seems to require high optimization levels as well as odd
values of buf.

Of interest may be that in older version of dpdk you added the a comment: /*
workaround gcc strict-aliasing warning */ showing that you willfully ignored
the warning.

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-15  7:16 [Bug 1035] __rte_raw_cksum() crash with misaligned pointer bugzilla
@ 2022-06-15 14:40 ` Morten Brørup
  2022-06-16  5:44   ` Emil Berg
  2022-10-10 10:40 ` bugzilla
  1 sibling, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-15 14:40 UTC (permalink / raw)
  To: emil.berg, bugzilla; +Cc: dev

> From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
> Sent: Wednesday, 15 June 2022 09.16
> 
> https://bugs.dpdk.org/show_bug.cgi?id=1035
> 
>             Bug ID: 1035
>            Summary: __rte_raw_cksum() crash with misaligned pointer
>            Product: DPDK
>            Version: 21.11
>           Hardware: All
>                 OS: All
>             Status: UNCONFIRMED
>           Severity: normal
>           Priority: Normal
>          Component: ethdev
>           Assignee: dev@dpdk.org
>           Reporter: emil.berg@ericsson.com
>   Target Milestone: ---
> 
> See rte_raw_cksum() in rte_ip.h, which is part of the public API. See
> also the
> subfunction __rte_raw_cksum().
> 
> _rte_raw_cksum assumes that the buffer over which the checksum is
> calculated is
> an even address (divisible by two). See for example this stack overflow
> post:
> https://stackoverflow.com/questions/46790550/c-undefined-behavior-
> strict-aliasing-rule-or-incorrect-alignment
> 
> The post explains that there is undefined behavior in C11 when
> "conversion
> between two pointer types produces a result that is incorrectly
> aligned". When
> the buf argument starts on an odd address we thus have undefined
> behavior,
> since a pointer is cast from void* to uint16_t*.
> 
> In most cases (at least on x86) that isn't a problem, but with higher
> optimization levels it may break due to vector instructions. This new
> function
> seems to be easier to optimize by the compiler, resulting in a crash
> when the
> buf argument is odd. Please note that the undefined behavior is present
> in
> earlier versions of dpdk as well.
> 
> Now you're probably thinking: "Just align your buffers". The problem is
> that we
> have a packet buffer which is aligned. The checksum is calculated on a
> subset
> of that aligned packet buffer, and that sometimes lies on odd
> addresses.
> 
> The question remains if this is an issue with dpdk or not.

I can imagine other systems doing what you describe too. So it needs to be addressed.

Off the top of my head, an easy fix would be updating __rte_raw_cksum() like this:

static inline uint32_t
__rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
{
	if (likely((buf & 1) == 0)) {
		/* The buffer is 16 bit aligned. */
		Keep the existing, optimized implementation here.
	} else {
		/* The buffer is not 16 bit aligned. */
		Add a new odd-buf tolerant implementation here.
	}
}

However, I'm not sure that it covers your scenario!

The checksum is 16 bit wide, so if you calculate the checksum of e.g. 4 bytes of memory starting at offset 1 in a 6 byte packet buffer, the memory block can be treated as either 4 or 6 bytes relative to the data covered by the checksum, i.e.:

A: XX [01 02] [03 04] XX --> cksum = [04 06]

B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]

Which one do you need?

Perhaps an additional function is required to support your use case, and the documentation for rte_raw_cksum() and __rte_raw_cksum() needs to reflect that the buffer must be 16 bit aligned.

Or the rte_raw_cksum() function can be modified to support an odd buffer pointer as outlined above, with documentation added about alignment of the running checksum.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-15 14:40 ` Morten Brørup
@ 2022-06-16  5:44   ` Emil Berg
  2022-06-16  6:27     ` Morten Brørup
  2022-06-16  6:32     ` Emil Berg
  0 siblings, 2 replies; 74+ messages in thread
From: Emil Berg @ 2022-06-16  5:44 UTC (permalink / raw)
  To: Morten Brørup, bugzilla; +Cc: dev

Hi!

We want the B option, i.e. the 6 bytes option. Perhaps adding alignment detection to __rte_raw_cksum() is a good idea.

A minor comment but I think buf & 1 won't work since buf isn't an integral type, but something along that way.

I'm starting to think about an efficient way to do this.

Thank you!

-----Original Message-----
From: Morten Brørup <mb@smartsharesystems.com> 
Sent: den 15 juni 2022 16:41
To: Emil Berg <emil.berg@ericsson.com>; bugzilla@dpdk.org
Cc: dev@dpdk.org
Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer

> From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
> Sent: Wednesday, 15 June 2022 09.16
> 
> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
> 5555731-2e92ae6bf759c0c5&q=1&e=b3fc70af-5d37-4ffb-b34d-9a51927f5f6d&u=
> https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%3D1035
> 
>             Bug ID: 1035
>            Summary: __rte_raw_cksum() crash with misaligned pointer
>            Product: DPDK
>            Version: 21.11
>           Hardware: All
>                 OS: All
>             Status: UNCONFIRMED
>           Severity: normal
>           Priority: Normal
>          Component: ethdev
>           Assignee: dev@dpdk.org
>           Reporter: emil.berg@ericsson.com
>   Target Milestone: ---
> 
> See rte_raw_cksum() in rte_ip.h, which is part of the public API. See 
> also the subfunction __rte_raw_cksum().
> 
> _rte_raw_cksum assumes that the buffer over which the checksum is 
> calculated is an even address (divisible by two). See for example this 
> stack overflow
> post:
> https://stackoverflow.com/questions/46790550/c-undefined-behavior-
> strict-aliasing-rule-or-incorrect-alignment
> 
> The post explains that there is undefined behavior in C11 when 
> "conversion between two pointer types produces a result that is 
> incorrectly aligned". When the buf argument starts on an odd address 
> we thus have undefined behavior, since a pointer is cast from void* to 
> uint16_t*.
> 
> In most cases (at least on x86) that isn't a problem, but with higher 
> optimization levels it may break due to vector instructions. This new 
> function seems to be easier to optimize by the compiler, resulting in 
> a crash when the buf argument is odd. Please note that the undefined 
> behavior is present in earlier versions of dpdk as well.
> 
> Now you're probably thinking: "Just align your buffers". The problem 
> is that we have a packet buffer which is aligned. The checksum is 
> calculated on a subset of that aligned packet buffer, and that 
> sometimes lies on odd addresses.
> 
> The question remains if this is an issue with dpdk or not.

I can imagine other systems doing what you describe too. So it needs to be addressed.

Off the top of my head, an easy fix would be updating __rte_raw_cksum() like this:

static inline uint32_t
__rte_raw_cksum(const void *buf, size_t len, uint32_t sum) {
	if (likely((buf & 1) == 0)) {
		/* The buffer is 16 bit aligned. */
		Keep the existing, optimized implementation here.
	} else {
		/* The buffer is not 16 bit aligned. */
		Add a new odd-buf tolerant implementation here.
	}
}

However, I'm not sure that it covers your scenario!

The checksum is 16 bit wide, so if you calculate the checksum of e.g. 4 bytes of memory starting at offset 1 in a 6 byte packet buffer, the memory block can be treated as either 4 or 6 bytes relative to the data covered by the checksum, i.e.:

A: XX [01 02] [03 04] XX --> cksum = [04 06]

B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]

Which one do you need?

Perhaps an additional function is required to support your use case, and the documentation for rte_raw_cksum() and __rte_raw_cksum() needs to reflect that the buffer must be 16 bit aligned.

Or the rte_raw_cksum() function can be modified to support an odd buffer pointer as outlined above, with documentation added about alignment of the running checksum.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-16  5:44   ` Emil Berg
@ 2022-06-16  6:27     ` Morten Brørup
  2022-06-16  6:32     ` Emil Berg
  1 sibling, 0 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-16  6:27 UTC (permalink / raw)
  To: Emil Berg, bugzilla; +Cc: dev, Olivier Matz

+CC Olivier Matz <olivier.matz@6wind.com>, Network Headers maintainer

> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 15 juni 2022 16:41
> 
> > From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
> > Sent: Wednesday, 15 June 2022 09.16
> >
> > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 45444
> > 5555731-2e92ae6bf759c0c5&q=1&e=b3fc70af-5d37-4ffb-b34d-
> 9a51927f5f6d&u=
> > https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%3D1035
> >
> >             Bug ID: 1035
> >            Summary: __rte_raw_cksum() crash with misaligned pointer
> >            Product: DPDK
> >            Version: 21.11
> >           Hardware: All
> >                 OS: All
> >             Status: UNCONFIRMED
> >           Severity: normal
> >           Priority: Normal
> >          Component: ethdev
> >           Assignee: dev@dpdk.org
> >           Reporter: emil.berg@ericsson.com
> >   Target Milestone: ---
> >
> > See rte_raw_cksum() in rte_ip.h, which is part of the public API. See
> > also the subfunction __rte_raw_cksum().
> >
> > _rte_raw_cksum assumes that the buffer over which the checksum is
> > calculated is an even address (divisible by two). See for example
> this
> > stack overflow
> > post:
> > https://stackoverflow.com/questions/46790550/c-undefined-behavior-
> > strict-aliasing-rule-or-incorrect-alignment
> >
> > The post explains that there is undefined behavior in C11 when
> > "conversion between two pointer types produces a result that is
> > incorrectly aligned". When the buf argument starts on an odd address
> > we thus have undefined behavior, since a pointer is cast from void*
> to
> > uint16_t*.
> >
> > In most cases (at least on x86) that isn't a problem, but with higher
> > optimization levels it may break due to vector instructions. This new
> > function seems to be easier to optimize by the compiler, resulting in
> > a crash when the buf argument is odd. Please note that the undefined
> > behavior is present in earlier versions of dpdk as well.
> >
> > Now you're probably thinking: "Just align your buffers". The problem
> > is that we have a packet buffer which is aligned. The checksum is
> > calculated on a subset of that aligned packet buffer, and that
> > sometimes lies on odd addresses.
> >
> > The question remains if this is an issue with dpdk or not.
> 
> I can imagine other systems doing what you describe too. So it needs to
> be addressed.
> 
> Off the top of my head, an easy fix would be updating __rte_raw_cksum()
> like this:
> 
> static inline uint32_t
> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) {
> 	if (likely((buf & 1) == 0)) {
> 		/* The buffer is 16 bit aligned. */
> 		Keep the existing, optimized implementation here.
> 	} else {
> 		/* The buffer is not 16 bit aligned. */
> 		Add a new odd-buf tolerant implementation here.
> 	}
> }
> 
> However, I'm not sure that it covers your scenario!
> 
> The checksum is 16 bit wide, so if you calculate the checksum of e.g. 4
> bytes of memory starting at offset 1 in a 6 byte packet buffer, the
> memory block can be treated as either 4 or 6 bytes relative to the data
> covered by the checksum, i.e.:
> 
> A: XX [01 02] [03 04] XX --> cksum = [04 06]
> 
> B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]
> 
> Which one do you need?
> 
> Perhaps an additional function is required to support your use case,
> and the documentation for rte_raw_cksum() and __rte_raw_cksum() needs
> to reflect that the buffer must be 16 bit aligned.
> 
> Or the rte_raw_cksum() function can be modified to support an odd
> buffer pointer as outlined above, with documentation added about
> alignment of the running checksum.

> -----Original Message-----
> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Thursday, 16 June 2022 07.45
> 
> Hi!
> 
> We want the B option, i.e. the 6 bytes option. Perhaps adding alignment
> detection to __rte_raw_cksum() is a good idea.

With option B, the invariant that the running checksum is being calculated on a 16 bit aligned packet buffer remains unchanged. So I think that support for option B is appropriate to add to __rte_raw_cksum(), rather than adding a separate function.

> 
> A minor comment but I think buf & 1 won't work since buf isn't an
> integral type, but something along that way.
> 
> I'm starting to think about an efficient way to do this.
> 
> Thank you!

Sounds good. Please CC me on your patch, when ready for review. :-)


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-16  5:44   ` Emil Berg
  2022-06-16  6:27     ` Morten Brørup
@ 2022-06-16  6:32     ` Emil Berg
  2022-06-16  6:44       ` Morten Brørup
  2022-06-16 14:09       ` [Bug 1035] __rte_raw_cksum() crash with misaligned pointer Mattias Rönnblom
  1 sibling, 2 replies; 74+ messages in thread
From: Emil Berg @ 2022-06-16  6:32 UTC (permalink / raw)
  To: Morten Brørup, bugzilla; +Cc: dev

I've been sketching on an efficient solution to this. What about something along the way below? I've run it with the combinations of:
even buf, even len
even buf, odd len
odd buf, even len
odd buf, odd len

and it seems to give the same results as the older version of __rte_raw_cksum, before 21.03. I ran it without optimizations and such to ensure the compiler didn't insert vector instructions and such so the results were comparable.

static inline uint32_t
__rte_raw_cksum_newest(const void *buf, size_t len, uint32_t sum)
{
	const uint8_t *end = buf + len;

	uint32_t sum_even = 0;
	for (const uint8_t *p = buf + 1; p < end; p += 2) {
		sum_even += *p;
	}
	sum += sum_even << 8;

	uint32_t sum_odd = 0;
	for (const uint8_t *p = buf; p < end; p += 2) {
		sum_odd += *p;
	}
	sum += sum_odd;

	return sum;
}

/Emil

-----Original Message-----
From: Emil Berg 
Sent: den 16 juni 2022 07:45
To: Morten Brørup <mb@smartsharesystems.com>; bugzilla@dpdk.org
Cc: dev@dpdk.org
Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer

Hi!

We want the B option, i.e. the 6 bytes option. Perhaps adding alignment detection to __rte_raw_cksum() is a good idea.

A minor comment but I think buf & 1 won't work since buf isn't an integral type, but something along that way.

I'm starting to think about an efficient way to do this.

Thank you!

-----Original Message-----
From: Morten Brørup <mb@smartsharesystems.com>
Sent: den 15 juni 2022 16:41
To: Emil Berg <emil.berg@ericsson.com>; bugzilla@dpdk.org
Cc: dev@dpdk.org
Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer

> From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
> Sent: Wednesday, 15 June 2022 09.16
> 
> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
> 5555731-2e92ae6bf759c0c5&q=1&e=b3fc70af-5d37-4ffb-b34d-9a51927f5f6d&u=
> https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%3D1035
> 
>             Bug ID: 1035
>            Summary: __rte_raw_cksum() crash with misaligned pointer
>            Product: DPDK
>            Version: 21.11
>           Hardware: All
>                 OS: All
>             Status: UNCONFIRMED
>           Severity: normal
>           Priority: Normal
>          Component: ethdev
>           Assignee: dev@dpdk.org
>           Reporter: emil.berg@ericsson.com
>   Target Milestone: ---
> 
> See rte_raw_cksum() in rte_ip.h, which is part of the public API. See 
> also the subfunction __rte_raw_cksum().
> 
> _rte_raw_cksum assumes that the buffer over which the checksum is 
> calculated is an even address (divisible by two). See for example this 
> stack overflow
> post:
> https://stackoverflow.com/questions/46790550/c-undefined-behavior-
> strict-aliasing-rule-or-incorrect-alignment
> 
> The post explains that there is undefined behavior in C11 when 
> "conversion between two pointer types produces a result that is 
> incorrectly aligned". When the buf argument starts on an odd address 
> we thus have undefined behavior, since a pointer is cast from void* to 
> uint16_t*.
> 
> In most cases (at least on x86) that isn't a problem, but with higher 
> optimization levels it may break due to vector instructions. This new 
> function seems to be easier to optimize by the compiler, resulting in 
> a crash when the buf argument is odd. Please note that the undefined 
> behavior is present in earlier versions of dpdk as well.
> 
> Now you're probably thinking: "Just align your buffers". The problem 
> is that we have a packet buffer which is aligned. The checksum is 
> calculated on a subset of that aligned packet buffer, and that 
> sometimes lies on odd addresses.
> 
> The question remains if this is an issue with dpdk or not.

I can imagine other systems doing what you describe too. So it needs to be addressed.

Off the top of my head, an easy fix would be updating __rte_raw_cksum() like this:

static inline uint32_t
__rte_raw_cksum(const void *buf, size_t len, uint32_t sum) {
	if (likely((buf & 1) == 0)) {
		/* The buffer is 16 bit aligned. */
		Keep the existing, optimized implementation here.
	} else {
		/* The buffer is not 16 bit aligned. */
		Add a new odd-buf tolerant implementation here.
	}
}

However, I'm not sure that it covers your scenario!

The checksum is 16 bit wide, so if you calculate the checksum of e.g. 4 bytes of memory starting at offset 1 in a 6 byte packet buffer, the memory block can be treated as either 4 or 6 bytes relative to the data covered by the checksum, i.e.:

A: XX [01 02] [03 04] XX --> cksum = [04 06]

B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]

Which one do you need?

Perhaps an additional function is required to support your use case, and the documentation for rte_raw_cksum() and __rte_raw_cksum() needs to reflect that the buffer must be 16 bit aligned.

Or the rte_raw_cksum() function can be modified to support an odd buffer pointer as outlined above, with documentation added about alignment of the running checksum.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-16  6:32     ` Emil Berg
@ 2022-06-16  6:44       ` Morten Brørup
  2022-06-16 13:58         ` Mattias Rönnblom
  2022-06-16 14:09       ` [Bug 1035] __rte_raw_cksum() crash with misaligned pointer Mattias Rönnblom
  1 sibling, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-16  6:44 UTC (permalink / raw)
  To: Emil Berg, bugzilla; +Cc: dev, Olivier Matz

+CC Olivier Matz <olivier.matz@6wind.com>, Network Headers maintainer

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 15 juni 2022 16:41
> To: Emil Berg <emil.berg@ericsson.com>; bugzilla@dpdk.org
> Cc: dev@dpdk.org
> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
> 
> > From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
> > Sent: Wednesday, 15 June 2022 09.16
> >
> > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 45444
> > 5555731-2e92ae6bf759c0c5&q=1&e=b3fc70af-5d37-4ffb-b34d-
> 9a51927f5f6d&u=
> > https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%3D1035
> >
> >             Bug ID: 1035
> >            Summary: __rte_raw_cksum() crash with misaligned pointer
> >            Product: DPDK
> >            Version: 21.11
> >           Hardware: All
> >                 OS: All
> >             Status: UNCONFIRMED
> >           Severity: normal
> >           Priority: Normal
> >          Component: ethdev
> >           Assignee: dev@dpdk.org
> >           Reporter: emil.berg@ericsson.com
> >   Target Milestone: ---
> >
> > See rte_raw_cksum() in rte_ip.h, which is part of the public API. See
> > also the subfunction __rte_raw_cksum().
> >
> > _rte_raw_cksum assumes that the buffer over which the checksum is
> > calculated is an even address (divisible by two). See for example
> this
> > stack overflow
> > post:
> > https://stackoverflow.com/questions/46790550/c-undefined-behavior-
> > strict-aliasing-rule-or-incorrect-alignment
> >
> > The post explains that there is undefined behavior in C11 when
> > "conversion between two pointer types produces a result that is
> > incorrectly aligned". When the buf argument starts on an odd address
> > we thus have undefined behavior, since a pointer is cast from void*
> to
> > uint16_t*.
> >
> > In most cases (at least on x86) that isn't a problem, but with higher
> > optimization levels it may break due to vector instructions. This new
> > function seems to be easier to optimize by the compiler, resulting in
> > a crash when the buf argument is odd. Please note that the undefined
> > behavior is present in earlier versions of dpdk as well.
> >
> > Now you're probably thinking: "Just align your buffers". The problem
> > is that we have a packet buffer which is aligned. The checksum is
> > calculated on a subset of that aligned packet buffer, and that
> > sometimes lies on odd addresses.
> >
> > The question remains if this is an issue with dpdk or not.
> 
> I can imagine other systems doing what you describe too. So it needs to
> be addressed.
> 
> Off the top of my head, an easy fix would be updating __rte_raw_cksum()
> like this:
> 
> static inline uint32_t
> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) {
> 	if (likely((buf & 1) == 0)) {
> 		/* The buffer is 16 bit aligned. */
> 		Keep the existing, optimized implementation here.
> 	} else {
> 		/* The buffer is not 16 bit aligned. */
> 		Add a new odd-buf tolerant implementation here.
> 	}
> }
> 
> However, I'm not sure that it covers your scenario!
> 
> The checksum is 16 bit wide, so if you calculate the checksum of e.g. 4
> bytes of memory starting at offset 1 in a 6 byte packet buffer, the
> memory block can be treated as either 4 or 6 bytes relative to the data
> covered by the checksum, i.e.:
> 
> A: XX [01 02] [03 04] XX --> cksum = [04 06]
> 
> B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]
> 
> Which one do you need?
> 
> Perhaps an additional function is required to support your use case,
> and the documentation for rte_raw_cksum() and __rte_raw_cksum() needs
> to reflect that the buffer must be 16 bit aligned.
> 
> Or the rte_raw_cksum() function can be modified to support an odd
> buffer pointer as outlined above, with documentation added about
> alignment of the running checksum.

> -----Original Message-----
> From: Emil Berg
> Sent: den 16 juni 2022 07:45
> To: Morten Brørup <mb@smartsharesystems.com>; bugzilla@dpdk.org
> Cc: dev@dpdk.org
> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
> 
> Hi!
> 
> We want the B option, i.e. the 6 bytes option. Perhaps adding alignment
> detection to __rte_raw_cksum() is a good idea.
> 
> A minor comment but I think buf & 1 won't work since buf isn't an
> integral type, but something along that way.
> 
> I'm starting to think about an efficient way to do this.
> 
> Thank you!

> -----Original Message-----
> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Thursday, 16 June 2022 08.32
> To: Morten Brørup; bugzilla@dpdk.org
> Cc: dev@dpdk.org
> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
> 
> I've been sketching on an efficient solution to this. What about
> something along the way below? I've run it with the combinations of:
> even buf, even len
> even buf, odd len
> odd buf, even len
> odd buf, odd len
> 
> and it seems to give the same results as the older version of
> __rte_raw_cksum, before 21.03. I ran it without optimizations and such
> to ensure the compiler didn't insert vector instructions and such so
> the results were comparable.

The performance, when using an aligned buffer, needs to be comparable with full compiler optimization, or the patch will not be accepted.

> 
> static inline uint32_t
> __rte_raw_cksum_newest(const void *buf, size_t len, uint32_t sum)
> {
> 	const uint8_t *end = buf + len;
> 
> 	uint32_t sum_even = 0;
> 	for (const uint8_t *p = buf + 1; p < end; p += 2) {
> 		sum_even += *p;
> 	}
> 	sum += sum_even << 8;
> 
> 	uint32_t sum_odd = 0;
> 	for (const uint8_t *p = buf; p < end; p += 2) {
> 		sum_odd += *p;
> 	}
> 	sum += sum_odd;
> 
> 	return sum;
> }

This function does not work on both little and big endian, when mixed with other checksum functions.

The checksum functions read the buffer in CPU native endian and stores the checksum in CPU native endian; and magically it becomes correct in network endian. Please refer to RFC 1071 for the details. Perhaps you can also find some inspiration in RFC 1141.

PS: Please don't top post.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-16  6:44       ` Morten Brørup
@ 2022-06-16 13:58         ` Mattias Rönnblom
  2022-06-16 14:36           ` Morten Brørup
  2022-06-17  7:32           ` Morten Brørup
  0 siblings, 2 replies; 74+ messages in thread
From: Mattias Rönnblom @ 2022-06-16 13:58 UTC (permalink / raw)
  To: Morten Brørup, Emil Berg, bugzilla; +Cc: dev, Olivier Matz

On 2022-06-16 08:44, Morten Brørup wrote:
> +CC Olivier Matz <olivier.matz@6wind.com>, Network Headers maintainer
> 
>> -----Original Message-----
>> From: Morten Brørup <mb@smartsharesystems.com>
>> Sent: den 15 juni 2022 16:41
>> To: Emil Berg <emil.berg@ericsson.com>; bugzilla@dpdk.org
>> Cc: dev@dpdk.org
>> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
>>
>>> From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
>>> Sent: Wednesday, 15 June 2022 09.16
>>>
>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
>> 45444
>>> 5555731-2e92ae6bf759c0c5&q=1&e=b3fc70af-5d37-4ffb-b34d-
>> 9a51927f5f6d&u=
>>> https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%3D1035
>>>
>>>              Bug ID: 1035
>>>             Summary: __rte_raw_cksum() crash with misaligned pointer
>>>             Product: DPDK
>>>             Version: 21.11
>>>            Hardware: All
>>>                  OS: All
>>>              Status: UNCONFIRMED
>>>            Severity: normal
>>>            Priority: Normal
>>>           Component: ethdev
>>>            Assignee: dev@dpdk.org
>>>            Reporter: emil.berg@ericsson.com
>>>    Target Milestone: ---
>>>
>>> See rte_raw_cksum() in rte_ip.h, which is part of the public API. See
>>> also the subfunction __rte_raw_cksum().
>>>
>>> _rte_raw_cksum assumes that the buffer over which the checksum is
>>> calculated is an even address (divisible by two). See for example
>> this
>>> stack overflow
>>> post:
>>> https://stackoverflow.com/questions/46790550/c-undefined-behavior-
>>> strict-aliasing-rule-or-incorrect-alignment
>>>
>>> The post explains that there is undefined behavior in C11 when
>>> "conversion between two pointer types produces a result that is
>>> incorrectly aligned". When the buf argument starts on an odd address
>>> we thus have undefined behavior, since a pointer is cast from void*
>> to
>>> uint16_t*.
>>>
>>> In most cases (at least on x86) that isn't a problem, but with higher
>>> optimization levels it may break due to vector instructions. This new
>>> function seems to be easier to optimize by the compiler, resulting in
>>> a crash when the buf argument is odd. Please note that the undefined
>>> behavior is present in earlier versions of dpdk as well.
>>>
>>> Now you're probably thinking: "Just align your buffers". The problem
>>> is that we have a packet buffer which is aligned. The checksum is
>>> calculated on a subset of that aligned packet buffer, and that
>>> sometimes lies on odd addresses.
>>>
>>> The question remains if this is an issue with dpdk or not.
>>
>> I can imagine other systems doing what you describe too. So it needs to
>> be addressed.
>>
>> Off the top of my head, an easy fix would be updating __rte_raw_cksum()
>> like this:
>>
>> static inline uint32_t
>> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) {
>> 	if (likely((buf & 1) == 0)) {
>> 		/* The buffer is 16 bit aligned. */
>> 		Keep the existing, optimized implementation here.
>> 	} else {
>> 		/* The buffer is not 16 bit aligned. */
>> 		Add a new odd-buf tolerant implementation here.
>> 	}
>> }
>>
>> However, I'm not sure that it covers your scenario!
>>
>> The checksum is 16 bit wide, so if you calculate the checksum of e.g. 4
>> bytes of memory starting at offset 1 in a 6 byte packet buffer, the
>> memory block can be treated as either 4 or 6 bytes relative to the data
>> covered by the checksum, i.e.:
>>
>> A: XX [01 02] [03 04] XX --> cksum = [04 06]
>>
>> B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]
>>
>> Which one do you need?
>>
>> Perhaps an additional function is required to support your use case,
>> and the documentation for rte_raw_cksum() and __rte_raw_cksum() needs
>> to reflect that the buffer must be 16 bit aligned.
>>
>> Or the rte_raw_cksum() function can be modified to support an odd
>> buffer pointer as outlined above, with documentation added about
>> alignment of the running checksum.
> 
>> -----Original Message-----
>> From: Emil Berg
>> Sent: den 16 juni 2022 07:45
>> To: Morten Brørup <mb@smartsharesystems.com>; bugzilla@dpdk.org
>> Cc: dev@dpdk.org
>> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
>>
>> Hi!
>>
>> We want the B option, i.e. the 6 bytes option. Perhaps adding alignment
>> detection to __rte_raw_cksum() is a good idea.
>>
>> A minor comment but I think buf & 1 won't work since buf isn't an
>> integral type, but something along that way.
>>
>> I'm starting to think about an efficient way to do this.
>>
>> Thank you!
> 
>> -----Original Message-----
>> From: Emil Berg [mailto:emil.berg@ericsson.com]
>> Sent: Thursday, 16 June 2022 08.32
>> To: Morten Brørup; bugzilla@dpdk.org
>> Cc: dev@dpdk.org
>> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
>>
>> I've been sketching on an efficient solution to this. What about
>> something along the way below? I've run it with the combinations of:
>> even buf, even len
>> even buf, odd len
>> odd buf, even len
>> odd buf, odd len
>>
>> and it seems to give the same results as the older version of
>> __rte_raw_cksum, before 21.03. I ran it without optimizations and such
>> to ensure the compiler didn't insert vector instructions and such so
>> the results were comparable.
> 
> The performance, when using an aligned buffer, needs to be comparable with full compiler optimization, or the patch will not be accepted.
> 
I think the question is: does rte_raw_cksum() have any alignment 
requirements, from an API contract point of view? The documentation says 
nothing about any such. In that case, it seems reasonable to me to 
assume that there are none.

>>
>> static inline uint32_t
>> __rte_raw_cksum_newest(const void *buf, size_t len, uint32_t sum)
>> {
>> 	const uint8_t *end = buf + len;
>>
>> 	uint32_t sum_even = 0;
>> 	for (const uint8_t *p = buf + 1; p < end; p += 2) {
>> 		sum_even += *p;
>> 	}
>> 	sum += sum_even << 8;
>>
>> 	uint32_t sum_odd = 0;
>> 	for (const uint8_t *p = buf; p < end; p += 2) {
>> 		sum_odd += *p;
>> 	}
>> 	sum += sum_odd;
>>
>> 	return sum;
>> }
> 
> This function does not work on both little and big endian, when mixed with other checksum functions.
> 
> The checksum functions read the buffer in CPU native endian and stores the checksum in CPU native endian; and magically it becomes correct in network endian. Please refer to RFC 1071 for the details. Perhaps you can also find some inspiration in RFC 1141.
> 
> PS: Please don't top post.
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-16  6:32     ` Emil Berg
  2022-06-16  6:44       ` Morten Brørup
@ 2022-06-16 14:09       ` Mattias Rönnblom
  1 sibling, 0 replies; 74+ messages in thread
From: Mattias Rönnblom @ 2022-06-16 14:09 UTC (permalink / raw)
  To: Emil Berg, Morten Brørup, bugzilla; +Cc: dev

On 2022-06-16 08:32, Emil Berg wrote:
> I've been sketching on an efficient solution to this. What about something along the way below? I've run it with the combinations of:
> even buf, even len
> even buf, odd len
> odd buf, even len
> odd buf, odd len
> 
> and it seems to give the same results as the older version of __rte_raw_cksum, before 21.03. I ran it without optimizations and such to ensure the compiler didn't insert vector instructions and such so the results were comparable.
> 

...but you *want* the compiler to vectorize this code. There's much to 
gain, and it can likely be done also in the non-aligned case. What you 
don't want is for the compiler to assume the data is 16-bit aligned (and 
output SIMD load/store instructions which require alignment).

I don't see why you just can't take the current implementation, and 
replace the direct assignment ("*u16_buf") with a temporary variable, 
and a memcpy(). This also eliminates the need for the may_alias 
attribute (at least on the u16_buf pointer).


> static inline uint32_t
> __rte_raw_cksum_newest(const void *buf, size_t len, uint32_t sum)
> {
> 	const uint8_t *end = buf + len;
> 
> 	uint32_t sum_even = 0;
> 	for (const uint8_t *p = buf + 1; p < end; p += 2) {
> 		sum_even += *p;
> 	}
> 	sum += sum_even << 8;
> 
> 	uint32_t sum_odd = 0;
> 	for (const uint8_t *p = buf; p < end; p += 2) {
> 		sum_odd += *p;
> 	}
> 	sum += sum_odd;
> 
> 	return sum;
> }
> 
> /Emil
> 
> -----Original Message-----
> From: Emil Berg
> Sent: den 16 juni 2022 07:45
> To: Morten Brørup <mb@smartsharesystems.com>; bugzilla@dpdk.org
> Cc: dev@dpdk.org
> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
> 
> Hi!
> 
> We want the B option, i.e. the 6 bytes option. Perhaps adding alignment detection to __rte_raw_cksum() is a good idea.
> 
> A minor comment but I think buf & 1 won't work since buf isn't an integral type, but something along that way.
> 
> I'm starting to think about an efficient way to do this.
> 
> Thank you!
> 
> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 15 juni 2022 16:41
> To: Emil Berg <emil.berg@ericsson.com>; bugzilla@dpdk.org
> Cc: dev@dpdk.org
> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
> 
>> From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
>> Sent: Wednesday, 15 June 2022 09.16
>>
>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
>> 5555731-2e92ae6bf759c0c5&q=1&e=b3fc70af-5d37-4ffb-b34d-9a51927f5f6d&u=
>> https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%3D1035
>>
>>              Bug ID: 1035
>>             Summary: __rte_raw_cksum() crash with misaligned pointer
>>             Product: DPDK
>>             Version: 21.11
>>            Hardware: All
>>                  OS: All
>>              Status: UNCONFIRMED
>>            Severity: normal
>>            Priority: Normal
>>           Component: ethdev
>>            Assignee: dev@dpdk.org
>>            Reporter: emil.berg@ericsson.com
>>    Target Milestone: ---
>>
>> See rte_raw_cksum() in rte_ip.h, which is part of the public API. See
>> also the subfunction __rte_raw_cksum().
>>
>> _rte_raw_cksum assumes that the buffer over which the checksum is
>> calculated is an even address (divisible by two). See for example this
>> stack overflow
>> post:
>> https://stackoverflow.com/questions/46790550/c-undefined-behavior-
>> strict-aliasing-rule-or-incorrect-alignment
>>
>> The post explains that there is undefined behavior in C11 when
>> "conversion between two pointer types produces a result that is
>> incorrectly aligned". When the buf argument starts on an odd address
>> we thus have undefined behavior, since a pointer is cast from void* to
>> uint16_t*.
>>
>> In most cases (at least on x86) that isn't a problem, but with higher
>> optimization levels it may break due to vector instructions. This new
>> function seems to be easier to optimize by the compiler, resulting in
>> a crash when the buf argument is odd. Please note that the undefined
>> behavior is present in earlier versions of dpdk as well.
>>
>> Now you're probably thinking: "Just align your buffers". The problem
>> is that we have a packet buffer which is aligned. The checksum is
>> calculated on a subset of that aligned packet buffer, and that
>> sometimes lies on odd addresses.
>>
>> The question remains if this is an issue with dpdk or not.
> 
> I can imagine other systems doing what you describe too. So it needs to be addressed.
> 
> Off the top of my head, an easy fix would be updating __rte_raw_cksum() like this:
> 
> static inline uint32_t
> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) {
> 	if (likely((buf & 1) == 0)) {
> 		/* The buffer is 16 bit aligned. */
> 		Keep the existing, optimized implementation here.
> 	} else {
> 		/* The buffer is not 16 bit aligned. */
> 		Add a new odd-buf tolerant implementation here.
> 	}
> }
> 
> However, I'm not sure that it covers your scenario!
> 
> The checksum is 16 bit wide, so if you calculate the checksum of e.g. 4 bytes of memory starting at offset 1 in a 6 byte packet buffer, the memory block can be treated as either 4 or 6 bytes relative to the data covered by the checksum, i.e.:
> 
> A: XX [01 02] [03 04] XX --> cksum = [04 06]
> 
> B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]
> 
> Which one do you need?
> 
> Perhaps an additional function is required to support your use case, and the documentation for rte_raw_cksum() and __rte_raw_cksum() needs to reflect that the buffer must be 16 bit aligned.
> 
> Or the rte_raw_cksum() function can be modified to support an odd buffer pointer as outlined above, with documentation added about alignment of the running checksum.
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-16 13:58         ` Mattias Rönnblom
@ 2022-06-16 14:36           ` Morten Brørup
  2022-06-17  7:32           ` Morten Brørup
  1 sibling, 0 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-16 14:36 UTC (permalink / raw)
  To: Mattias Rönnblom, Emil Berg, bugzilla; +Cc: dev, Olivier Matz

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Thursday, 16 June 2022 15.58
> 
> On 2022-06-16 08:44, Morten Brørup wrote:
> > +CC Olivier Matz <olivier.matz@6wind.com>, Network Headers maintainer
> >
> >> -----Original Message-----
> >> From: Morten Brørup <mb@smartsharesystems.com>
> >> Sent: den 15 juni 2022 16:41
> >> To: Emil Berg <emil.berg@ericsson.com>; bugzilla@dpdk.org
> >> Cc: dev@dpdk.org
> >> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned
> pointer
> >>
> >>> From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
> >>> Sent: Wednesday, 15 June 2022 09.16
> >>>
> >>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> >> 45444
> >>> 5555731-2e92ae6bf759c0c5&q=1&e=b3fc70af-5d37-4ffb-b34d-
> >> 9a51927f5f6d&u=
> >>> https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%3D1035
> >>>
> >>>              Bug ID: 1035
> >>>             Summary: __rte_raw_cksum() crash with misaligned
> pointer
> >>>             Product: DPDK
> >>>             Version: 21.11
> >>>            Hardware: All
> >>>                  OS: All
> >>>              Status: UNCONFIRMED
> >>>            Severity: normal
> >>>            Priority: Normal
> >>>           Component: ethdev
> >>>            Assignee: dev@dpdk.org
> >>>            Reporter: emil.berg@ericsson.com
> >>>    Target Milestone: ---
> >>>
> >>> See rte_raw_cksum() in rte_ip.h, which is part of the public API.
> See
> >>> also the subfunction __rte_raw_cksum().
> >>>
> >>> _rte_raw_cksum assumes that the buffer over which the checksum is
> >>> calculated is an even address (divisible by two). See for example
> >> this
> >>> stack overflow
> >>> post:
> >>> https://stackoverflow.com/questions/46790550/c-undefined-behavior-
> >>> strict-aliasing-rule-or-incorrect-alignment
> >>>
> >>> The post explains that there is undefined behavior in C11 when
> >>> "conversion between two pointer types produces a result that is
> >>> incorrectly aligned". When the buf argument starts on an odd
> address
> >>> we thus have undefined behavior, since a pointer is cast from void*
> >> to
> >>> uint16_t*.
> >>>
> >>> In most cases (at least on x86) that isn't a problem, but with
> higher
> >>> optimization levels it may break due to vector instructions. This
> new
> >>> function seems to be easier to optimize by the compiler, resulting
> in
> >>> a crash when the buf argument is odd. Please note that the
> undefined
> >>> behavior is present in earlier versions of dpdk as well.
> >>>
> >>> Now you're probably thinking: "Just align your buffers". The
> problem
> >>> is that we have a packet buffer which is aligned. The checksum is
> >>> calculated on a subset of that aligned packet buffer, and that
> >>> sometimes lies on odd addresses.
> >>>
> >>> The question remains if this is an issue with dpdk or not.
> >>
> >> I can imagine other systems doing what you describe too. So it needs
> to
> >> be addressed.
> >>
> >> Off the top of my head, an easy fix would be updating
> __rte_raw_cksum()
> >> like this:
> >>
> >> static inline uint32_t
> >> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) {
> >> 	if (likely((buf & 1) == 0)) {
> >> 		/* The buffer is 16 bit aligned. */
> >> 		Keep the existing, optimized implementation here.
> >> 	} else {
> >> 		/* The buffer is not 16 bit aligned. */
> >> 		Add a new odd-buf tolerant implementation here.
> >> 	}
> >> }
> >>
> >> However, I'm not sure that it covers your scenario!
> >>
> >> The checksum is 16 bit wide, so if you calculate the checksum of
> e.g. 4
> >> bytes of memory starting at offset 1 in a 6 byte packet buffer, the
> >> memory block can be treated as either 4 or 6 bytes relative to the
> data
> >> covered by the checksum, i.e.:
> >>
> >> A: XX [01 02] [03 04] XX --> cksum = [04 06]
> >>
> >> B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]
> >>
> >> Which one do you need?
> >>
> >> Perhaps an additional function is required to support your use case,
> >> and the documentation for rte_raw_cksum() and __rte_raw_cksum()
> needs
> >> to reflect that the buffer must be 16 bit aligned.
> >>
> >> Or the rte_raw_cksum() function can be modified to support an odd
> >> buffer pointer as outlined above, with documentation added about
> >> alignment of the running checksum.
> >
> >> -----Original Message-----
> >> From: Emil Berg
> >> Sent: den 16 juni 2022 07:45
> >> To: Morten Brørup <mb@smartsharesystems.com>; bugzilla@dpdk.org
> >> Cc: dev@dpdk.org
> >> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned
> pointer
> >>
> >> Hi!
> >>
> >> We want the B option, i.e. the 6 bytes option. Perhaps adding
> alignment
> >> detection to __rte_raw_cksum() is a good idea.
> >>
> >> A minor comment but I think buf & 1 won't work since buf isn't an
> >> integral type, but something along that way.
> >>
> >> I'm starting to think about an efficient way to do this.
> >>
> >> Thank you!
> >
> >> -----Original Message-----
> >> From: Emil Berg [mailto:emil.berg@ericsson.com]
> >> Sent: Thursday, 16 June 2022 08.32
> >> To: Morten Brørup; bugzilla@dpdk.org
> >> Cc: dev@dpdk.org
> >> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned
> pointer
> >>
> >> I've been sketching on an efficient solution to this. What about
> >> something along the way below? I've run it with the combinations of:
> >> even buf, even len
> >> even buf, odd len
> >> odd buf, even len
> >> odd buf, odd len
> >>
> >> and it seems to give the same results as the older version of
> >> __rte_raw_cksum, before 21.03. I ran it without optimizations and
> such
> >> to ensure the compiler didn't insert vector instructions and such so
> >> the results were comparable.
> >
> > The performance, when using an aligned buffer, needs to be comparable
> with full compiler optimization, or the patch will not be accepted.
> >
> I think the question is: does rte_raw_cksum() have any alignment
> requirements, from an API contract point of view? The documentation
> says
> nothing about any such. In that case, it seems reasonable to me to
> assume that there are none.

The packet buffer must be 16 bit aligned. Many structures in DPDK are designed with this invariant, e.g. the Ethernet header (struct rte_ether_hdr) [1].

I agree that the documentation could mention this invariant in more places.

When calculating the checksum of a part of packet buffer, it should allow that part of the buffer to start at a non-16 bit aligned address, which is what Emil needs, and this is what I suggest adding support for in this function.

Regarding any implementation suggestions, please refer to my examples A and B above: The running checksum for a partial buffer of 4 bytes differs, depending on how you calculate it. The checksum calculation must see the buffer as part of the packet buffer, i.e. aligned with the (16 bit aligned) packet buffer and the (16 bit aligned) running checksum, as described by example B.

[1] https://elixir.bootlin.com/dpdk/latest/source/lib/net/rte_ether.h#L273


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-16 13:58         ` Mattias Rönnblom
  2022-06-16 14:36           ` Morten Brørup
@ 2022-06-17  7:32           ` Morten Brørup
  2022-06-17  8:45             ` [PATCH] net: fix checksum with unaligned buffer Morten Brørup
                               ` (3 more replies)
  1 sibling, 4 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-17  7:32 UTC (permalink / raw)
  To: Mattias Rönnblom, Emil Berg, bugzilla; +Cc: dev, Olivier Matz

> From: Morten Brørup
> Sent: Thursday, 16 June 2022 16.36
> 
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Thursday, 16 June 2022 15.58
> >
> > On 2022-06-16 08:44, Morten Brørup wrote:
> > > +CC Olivier Matz <olivier.matz@6wind.com>, Network Headers
> maintainer
> > >
> > >> -----Original Message-----
> > >> From: Morten Brørup <mb@smartsharesystems.com>
> > >> Sent: den 15 juni 2022 16:41
> > >> To: Emil Berg <emil.berg@ericsson.com>; bugzilla@dpdk.org
> > >> Cc: dev@dpdk.org
> > >> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned
> > pointer
> > >>
> > >>> From: bugzilla@dpdk.org [mailto:bugzilla@dpdk.org]
> > >>> Sent: Wednesday, 15 June 2022 09.16
> > >>>
> > >>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> > >> 45444
> > >>> 5555731-2e92ae6bf759c0c5&q=1&e=b3fc70af-5d37-4ffb-b34d-
> > >> 9a51927f5f6d&u=
> > >>> https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%3D1035
> > >>>
> > >>>              Bug ID: 1035
> > >>>             Summary: __rte_raw_cksum() crash with misaligned
> > pointer
> > >>>             Product: DPDK
> > >>>             Version: 21.11
> > >>>            Hardware: All
> > >>>                  OS: All
> > >>>              Status: UNCONFIRMED
> > >>>            Severity: normal
> > >>>            Priority: Normal
> > >>>           Component: ethdev
> > >>>            Assignee: dev@dpdk.org
> > >>>            Reporter: emil.berg@ericsson.com
> > >>>    Target Milestone: ---
> > >>>
> > >>> See rte_raw_cksum() in rte_ip.h, which is part of the public API.
> > See
> > >>> also the subfunction __rte_raw_cksum().
> > >>>
> > >>> _rte_raw_cksum assumes that the buffer over which the checksum is
> > >>> calculated is an even address (divisible by two). See for example
> > >> this
> > >>> stack overflow
> > >>> post:
> > >>> https://stackoverflow.com/questions/46790550/c-undefined-
> behavior-
> > >>> strict-aliasing-rule-or-incorrect-alignment
> > >>>
> > >>> The post explains that there is undefined behavior in C11 when
> > >>> "conversion between two pointer types produces a result that is
> > >>> incorrectly aligned". When the buf argument starts on an odd
> > address
> > >>> we thus have undefined behavior, since a pointer is cast from
> void*
> > >> to
> > >>> uint16_t*.
> > >>>
> > >>> In most cases (at least on x86) that isn't a problem, but with
> > higher
> > >>> optimization levels it may break due to vector instructions. This
> > new
> > >>> function seems to be easier to optimize by the compiler,
> resulting
> > in
> > >>> a crash when the buf argument is odd. Please note that the
> > undefined
> > >>> behavior is present in earlier versions of dpdk as well.
> > >>>
> > >>> Now you're probably thinking: "Just align your buffers". The
> > problem
> > >>> is that we have a packet buffer which is aligned. The checksum is
> > >>> calculated on a subset of that aligned packet buffer, and that
> > >>> sometimes lies on odd addresses.
> > >>>
> > >>> The question remains if this is an issue with dpdk or not.
> > >>
> > >> I can imagine other systems doing what you describe too. So it
> needs
> > to
> > >> be addressed.
> > >>
> > >> Off the top of my head, an easy fix would be updating
> > __rte_raw_cksum()
> > >> like this:
> > >>
> > >> static inline uint32_t
> > >> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) {
> > >> 	if (likely((buf & 1) == 0)) {
> > >> 		/* The buffer is 16 bit aligned. */
> > >> 		Keep the existing, optimized implementation here.
> > >> 	} else {
> > >> 		/* The buffer is not 16 bit aligned. */
> > >> 		Add a new odd-buf tolerant implementation here.
> > >> 	}
> > >> }
> > >>
> > >> However, I'm not sure that it covers your scenario!
> > >>
> > >> The checksum is 16 bit wide, so if you calculate the checksum of
> > e.g. 4
> > >> bytes of memory starting at offset 1 in a 6 byte packet buffer,
> the
> > >> memory block can be treated as either 4 or 6 bytes relative to the
> > data
> > >> covered by the checksum, i.e.:
> > >>
> > >> A: XX [01 02] [03 04] XX --> cksum = [04 06]
> > >>
> > >> B: [XX 01] [02 03] [04 XX] --> cksum = [06 04]
> > >>
> > >> Which one do you need?
> > >>
> > >> Perhaps an additional function is required to support your use
> case,
> > >> and the documentation for rte_raw_cksum() and __rte_raw_cksum()
> > needs
> > >> to reflect that the buffer must be 16 bit aligned.
> > >>
> > >> Or the rte_raw_cksum() function can be modified to support an odd
> > >> buffer pointer as outlined above, with documentation added about
> > >> alignment of the running checksum.
> > >
> > >> -----Original Message-----
> > >> From: Emil Berg
> > >> Sent: den 16 juni 2022 07:45
> > >> To: Morten Brørup <mb@smartsharesystems.com>; bugzilla@dpdk.org
> > >> Cc: dev@dpdk.org
> > >> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned
> > pointer
> > >>
> > >> Hi!
> > >>
> > >> We want the B option, i.e. the 6 bytes option. Perhaps adding
> > alignment
> > >> detection to __rte_raw_cksum() is a good idea.
> > >>
> > >> A minor comment but I think buf & 1 won't work since buf isn't an
> > >> integral type, but something along that way.
> > >>
> > >> I'm starting to think about an efficient way to do this.
> > >>
> > >> Thank you!
> > >
> > >> -----Original Message-----
> > >> From: Emil Berg [mailto:emil.berg@ericsson.com]
> > >> Sent: Thursday, 16 June 2022 08.32
> > >> To: Morten Brørup; bugzilla@dpdk.org
> > >> Cc: dev@dpdk.org
> > >> Subject: RE: [Bug 1035] __rte_raw_cksum() crash with misaligned
> > pointer
> > >>
> > >> I've been sketching on an efficient solution to this. What about
> > >> something along the way below? I've run it with the combinations
> of:
> > >> even buf, even len
> > >> even buf, odd len
> > >> odd buf, even len
> > >> odd buf, odd len
> > >>
> > >> and it seems to give the same results as the older version of
> > >> __rte_raw_cksum, before 21.03. I ran it without optimizations and
> > such
> > >> to ensure the compiler didn't insert vector instructions and such
> so
> > >> the results were comparable.
> > >
> > > The performance, when using an aligned buffer, needs to be
> comparable
> > with full compiler optimization, or the patch will not be accepted.
> > >
> > I think the question is: does rte_raw_cksum() have any alignment
> > requirements, from an API contract point of view? The documentation
> > says
> > nothing about any such. In that case, it seems reasonable to me to
> > assume that there are none.
> 
> The packet buffer must be 16 bit aligned. Many structures in DPDK are
> designed with this invariant, e.g. the Ethernet header (struct
> rte_ether_hdr) [1].
> 
> I agree that the documentation could mention this invariant in more
> places.
> 
> When calculating the checksum of a part of packet buffer, it should
> allow that part of the buffer to start at a non-16 bit aligned address,
> which is what Emil needs, and this is what I suggest adding support for
> in this function.
> 
> Regarding any implementation suggestions, please refer to my examples A
> and B above: The running checksum for a partial buffer of 4 bytes
> differs, depending on how you calculate it. The checksum calculation
> must see the buffer as part of the packet buffer, i.e. aligned with the
> (16 bit aligned) packet buffer and the (16 bit aligned) running
> checksum, as described by example B.
> 
> [1]
> https://elixir.bootlin.com/dpdk/latest/source/lib/net/rte_ether.h#L273

I tried a few things in Godbolt, and will submit a patch shortly.

Please hang on...


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH] net: fix checksum with unaligned buffer
  2022-06-17  7:32           ` Morten Brørup
@ 2022-06-17  8:45             ` Morten Brørup
  2022-06-17  9:06               ` Morten Brørup
  2022-06-22 13:44             ` [PATCH v2] " Morten Brørup
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-17  8:45 UTC (permalink / raw)
  To: emil.berg, dev; +Cc: stable, bugzilla, hofors, olivier.matz, Morten Brørup

With this patch, the checksum can be calculated on an unligned part of
a packet buffer.
I.e. the buf parameter is no longer required to be 16 bit aligned.

The DPDK invariant that packet buffers must be 16 bit aligned remains
unchanged.
This invariant also defines how to calculate the 16 bit checksum on an
unaligned part of a packet buffer.

Bugzilla ID: 1035
Cc: stable@dpdk.org

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/net/rte_ip.h | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index b502481670..8e301d9c26 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
 	/* extend strict-aliasing rules */
 	typedef uint16_t __attribute__((__may_alias__)) u16_p;
-	const u16_p *u16_buf = (const u16_p *)buf;
-	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+	const u16_p *u16_buf;
+	const u16_p *end;
+
+	/* if buffer is unaligned, keeping it byte order independent */
+	if (unlikely((uintptr_t)buf & 1)) {
+		uint16_t first = 0;
+		if (unlikely(len == 0))
+			return 0;
+		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
+		sum += first;
+		buf = (const void *)((uintptr_t)buf + 1);
+		len--;
+	}
 
+	u16_buf = (const u16_p *)buf;
+	end = u16_buf + len / sizeof(*u16_buf);
 	for (; u16_buf != end; ++u16_buf)
 		sum += *u16_buf;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-17  8:45             ` [PATCH] net: fix checksum with unaligned buffer Morten Brørup
@ 2022-06-17  9:06               ` Morten Brørup
  2022-06-17 12:17                 ` Emil Berg
  2022-06-20 10:37                 ` Emil Berg
  0 siblings, 2 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-17  9:06 UTC (permalink / raw)
  To: emil.berg; +Cc: stable, bugzilla, hofors, olivier.matz, dev

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Friday, 17 June 2022 10.45
> 
> With this patch, the checksum can be calculated on an unligned part of
> a packet buffer.
> I.e. the buf parameter is no longer required to be 16 bit aligned.
> 
> The DPDK invariant that packet buffers must be 16 bit aligned remains
> unchanged.
> This invariant also defines how to calculate the 16 bit checksum on an
> unaligned part of a packet buffer.
> 
> Bugzilla ID: 1035
> Cc: stable@dpdk.org
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/net/rte_ip.h | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..8e301d9c26 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len,
> uint32_t sum)
>  {
>  	/* extend strict-aliasing rules */
>  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> -	const u16_p *u16_buf = (const u16_p *)buf;
> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> +	const u16_p *u16_buf;
> +	const u16_p *end;
> +
> +	/* if buffer is unaligned, keeping it byte order independent */
> +	if (unlikely((uintptr_t)buf & 1)) {
> +		uint16_t first = 0;
> +		if (unlikely(len == 0))
> +			return 0;
> +		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
> +		sum += first;
> +		buf = (const void *)((uintptr_t)buf + 1);
> +		len--;
> +	}
> 
> +	u16_buf = (const u16_p *)buf;
> +	end = u16_buf + len / sizeof(*u16_buf);
>  	for (; u16_buf != end; ++u16_buf)
>  		sum += *u16_buf;
> 
> --
> 2.17.1

@Emil, can you please test this patch with an unaligned buffer on your application to confirm that it produces the expected result.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-17  9:06               ` Morten Brørup
@ 2022-06-17 12:17                 ` Emil Berg
  2022-06-20 10:37                 ` Emil Berg
  1 sibling, 0 replies; 74+ messages in thread
From: Emil Berg @ 2022-06-17 12:17 UTC (permalink / raw)
  To: Morten Brørup; +Cc: stable, bugzilla, hofors, olivier.matz, dev

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 17 juni 2022 11:07
> To: Emil Berg <emil.berg@ericsson.com>
> Cc: stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com; dev@dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Friday, 17 June 2022 10.45
> >
> > With this patch, the checksum can be calculated on an unligned part of
> > a packet buffer.
> > I.e. the buf parameter is no longer required to be 16 bit aligned.
> >
> > The DPDK invariant that packet buffers must be 16 bit aligned remains
> > unchanged.
> > This invariant also defines how to calculate the 16 bit checksum on an
> > unaligned part of a packet buffer.
> >
> > Bugzilla ID: 1035
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/net/rte_ip.h | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > b502481670..8e301d9c26 100644
> > --- a/lib/net/rte_ip.h
> > +++ b/lib/net/rte_ip.h
> > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len,
> > uint32_t sum)  {
> >  	/* extend strict-aliasing rules */
> >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > -	const u16_p *u16_buf = (const u16_p *)buf;
> > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > +	const u16_p *u16_buf;
> > +	const u16_p *end;
> > +
> > +	/* if buffer is unaligned, keeping it byte order independent */
> > +	if (unlikely((uintptr_t)buf & 1)) {
> > +		uint16_t first = 0;
> > +		if (unlikely(len == 0))
> > +			return 0;
> > +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> > +		sum += first;
> > +		buf = (const void *)((uintptr_t)buf + 1);
> > +		len--;
> > +	}
> >
> > +	u16_buf = (const u16_p *)buf;
> > +	end = u16_buf + len / sizeof(*u16_buf);
> >  	for (; u16_buf != end; ++u16_buf)
> >  		sum += *u16_buf;
> >
> > --
> > 2.17.1
> 
> @Emil, can you please test this patch with an unaligned buffer on your
> application to confirm that it produces the expected result.

All right. I'll test, perhaps on Monday.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-17  9:06               ` Morten Brørup
  2022-06-17 12:17                 ` Emil Berg
@ 2022-06-20 10:37                 ` Emil Berg
  2022-06-20 10:57                   ` Morten Brørup
  1 sibling, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-20 10:37 UTC (permalink / raw)
  To: Morten Brørup; +Cc: stable, bugzilla, hofors, olivier.matz, dev

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 17 juni 2022 11:07
> To: Emil Berg <emil.berg@ericsson.com>
> Cc: stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com; dev@dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Friday, 17 June 2022 10.45
> >
> > With this patch, the checksum can be calculated on an unligned part of
> > a packet buffer.
> > I.e. the buf parameter is no longer required to be 16 bit aligned.
> >
> > The DPDK invariant that packet buffers must be 16 bit aligned remains
> > unchanged.
> > This invariant also defines how to calculate the 16 bit checksum on an
> > unaligned part of a packet buffer.
> >
> > Bugzilla ID: 1035
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > ---
> >  lib/net/rte_ip.h | 17 +++++++++++++++--
> >  1 file changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > b502481670..8e301d9c26 100644
> > --- a/lib/net/rte_ip.h
> > +++ b/lib/net/rte_ip.h
> > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len,
> > uint32_t sum)  {
> >  	/* extend strict-aliasing rules */
> >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > -	const u16_p *u16_buf = (const u16_p *)buf;
> > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > +	const u16_p *u16_buf;
> > +	const u16_p *end;
> > +
> > +	/* if buffer is unaligned, keeping it byte order independent */
> > +	if (unlikely((uintptr_t)buf & 1)) {
> > +		uint16_t first = 0;
> > +		if (unlikely(len == 0))
> > +			return 0;
> > +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> > +		sum += first;
> > +		buf = (const void *)((uintptr_t)buf + 1);
> > +		len--;
> > +	}
> >
> > +	u16_buf = (const u16_p *)buf;
> > +	end = u16_buf + len / sizeof(*u16_buf);
> >  	for (; u16_buf != end; ++u16_buf)
> >  		sum += *u16_buf;
> >
> > --
> > 2.17.1
> 
> @Emil, can you please test this patch with an unaligned buffer on your
> application to confirm that it produces the expected result.

Hi!

I tested the patch. It doesn't seem to produce the same results. I think the problem is that it always starts summing from an even address, the sum should always start from the first byte according to the checksum specification. Can I instead propose something Mattias Rönnblom sent me?
--------------------------------------------------------------------------------------------------------------
const void *end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * sizeof(uint16_t));

for (; buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
    uint16_t v;
    memcpy(&v, buf, sizeof(uint16_t));
    sum += v;
}

/* if length is odd, keeping it byte order independent */
if (unlikely(len % 2)) {
    uint16_t left = 0;
    *(unsigned char *)&left = *(const unsigned char *)end;
    sum += left;
}
--------------------------------------------------------------------------------------------------------------
Note that the last block is the same as before. Amazingly I see no measurable performance hit from this compared to the previous one (-O3, march=native). Looking at the previous the loop body may compile to (x86):
--------------------------------------------------------------------------------------------------------------
vmovdqa (%rdx),%xmm1
vpmovzxwd %xmm1,%xmm0
vpsrldq $0x8,%xmm1,%xmm1
vpmovzxwd %xmm1,%xmm1
vpaddd %xmm1,%xmm0,%xmm0
cmp    $0xf,%rax
jbe    0x7ff7a0dfb1a9
--------------------------------------------------------------------------------------------------------------
while Mattias' memcpy solution:
--------------------------------------------------------------------------------------------------------------
vmovdqu (%rcx),%ymm0
add    $0x20,%rcx
vpmovzxwd %xmm0,%ymm1
vextracti128 $0x1,%ymm0,%xmm0
vpmovzxwd %xmm0,%ymm0
vpaddd %ymm0,%ymm1,%ymm0
vpaddd %ymm0,%ymm2,%ymm2
cmp    %r9,%rcx
jne    0x555555556380
--------------------------------------------------------------------------------------------------------------
Thus two extra instructions in the loop, but I suspect it may be memory bound, leading to no measurable performance difference. 

Any comments?

/Emil

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-20 10:37                 ` Emil Berg
@ 2022-06-20 10:57                   ` Morten Brørup
  2022-06-21  7:16                     ` Emil Berg
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-20 10:57 UTC (permalink / raw)
  To: Emil Berg; +Cc: stable, bugzilla, hofors, olivier.matz, dev

> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Monday, 20 June 2022 12.38
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: den 17 juni 2022 11:07
> >
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Friday, 17 June 2022 10.45
> > >
> > > With this patch, the checksum can be calculated on an unligned part
> of
> > > a packet buffer.
> > > I.e. the buf parameter is no longer required to be 16 bit aligned.
> > >
> > > The DPDK invariant that packet buffers must be 16 bit aligned
> remains
> > > unchanged.
> > > This invariant also defines how to calculate the 16 bit checksum on
> an
> > > unaligned part of a packet buffer.
> > >
> > > Bugzilla ID: 1035
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > ---
> > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > b502481670..8e301d9c26 100644
> > > --- a/lib/net/rte_ip.h
> > > +++ b/lib/net/rte_ip.h
> > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len,
> > > uint32_t sum)  {
> > >  	/* extend strict-aliasing rules */
> > >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > +	const u16_p *u16_buf;
> > > +	const u16_p *end;
> > > +
> > > +	/* if buffer is unaligned, keeping it byte order independent */
> > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > +		uint16_t first = 0;
> > > +		if (unlikely(len == 0))
> > > +			return 0;
> > > +		((unsigned char *)&first)[1] = *(const unsigned
> > char *)buf;
> > > +		sum += first;
> > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > +		len--;
> > > +	}
> > >
> > > +	u16_buf = (const u16_p *)buf;
> > > +	end = u16_buf + len / sizeof(*u16_buf);
> > >  	for (; u16_buf != end; ++u16_buf)
> > >  		sum += *u16_buf;
> > >
> > > --
> > > 2.17.1
> >
> > @Emil, can you please test this patch with an unaligned buffer on
> your
> > application to confirm that it produces the expected result.
> 
> Hi!
> 
> I tested the patch. It doesn't seem to produce the same results. I
> think the problem is that it always starts summing from an even
> address, the sum should always start from the first byte according to
> the checksum specification. Can I instead propose something Mattias
> Rönnblom sent me?

I assume that it produces the same result when the "buf" parameter is aligned?

And when the "buf" parameter is unaligned, I don't expect it to produce the same results as the simple algorithm!

This was the whole point of the patch: I expect the overall packet buffer to be 16 bit aligned, and the checksum to be a partial checksum of such a 16 bit aligned packet buffer. When calling this function, I assume that the "buf" and "len" parameters point to a part of such a packet buffer. If these expectations are correct, the simple algorithm will produce incorrect results when "buf" is unaligned.

I was asking you to test if the checksum on the packet is correct when your application modifies an unaligned part of the packet and uses this function to update the checksum.


> -----------------------------------------------------------------------
> ---------------------------------------
> const void *end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) *
> sizeof(uint16_t));
> 
> for (; buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
>     uint16_t v;
>     memcpy(&v, buf, sizeof(uint16_t));
>     sum += v;
> }
> 
> /* if length is odd, keeping it byte order independent */
> if (unlikely(len % 2)) {
>     uint16_t left = 0;
>     *(unsigned char *)&left = *(const unsigned char *)end;
>     sum += left;
> }
> -----------------------------------------------------------------------
> ---------------------------------------
> Note that the last block is the same as before. Amazingly I see no
> measurable performance hit from this compared to the previous one (-O3,
> march=native). Looking at the previous the loop body may compile to
> (x86):
> -----------------------------------------------------------------------
> ---------------------------------------
> vmovdqa (%rdx),%xmm1
> vpmovzxwd %xmm1,%xmm0
> vpsrldq $0x8,%xmm1,%xmm1
> vpmovzxwd %xmm1,%xmm1
> vpaddd %xmm1,%xmm0,%xmm0
> cmp    $0xf,%rax
> jbe    0x7ff7a0dfb1a9
> -----------------------------------------------------------------------
> ---------------------------------------
> while Mattias' memcpy solution:
> -----------------------------------------------------------------------
> ---------------------------------------
> vmovdqu (%rcx),%ymm0
> add    $0x20,%rcx
> vpmovzxwd %xmm0,%ymm1
> vextracti128 $0x1,%ymm0,%xmm0
> vpmovzxwd %xmm0,%ymm0
> vpaddd %ymm0,%ymm1,%ymm0
> vpaddd %ymm0,%ymm2,%ymm2
> cmp    %r9,%rcx
> jne    0x555555556380
> -----------------------------------------------------------------------
> ---------------------------------------
> Thus two extra instructions in the loop, but I suspect it may be memory
> bound, leading to no measurable performance difference.
> 
> Any comments?
> 
> /Emil

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-20 10:57                   ` Morten Brørup
@ 2022-06-21  7:16                     ` Emil Berg
  2022-06-21  8:05                       ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-21  7:16 UTC (permalink / raw)
  To: Morten Brørup; +Cc: stable, bugzilla, hofors, olivier.matz, dev



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 20 juni 2022 12:58
> To: Emil Berg <emil.berg@ericsson.com>
> Cc: stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com; dev@dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > Sent: Monday, 20 June 2022 12.38
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: den 17 juni 2022 11:07
> > >
> > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > Sent: Friday, 17 June 2022 10.45
> > > >
> > > > With this patch, the checksum can be calculated on an unligned
> > > > part
> > of
> > > > a packet buffer.
> > > > I.e. the buf parameter is no longer required to be 16 bit aligned.
> > > >
> > > > The DPDK invariant that packet buffers must be 16 bit aligned
> > remains
> > > > unchanged.
> > > > This invariant also defines how to calculate the 16 bit checksum
> > > > on
> > an
> > > > unaligned part of a packet buffer.
> > > >
> > > > Bugzilla ID: 1035
> > > > Cc: stable@dpdk.org
> > > >
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > ---
> > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > > b502481670..8e301d9c26 100644
> > > > --- a/lib/net/rte_ip.h
> > > > +++ b/lib/net/rte_ip.h
> > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t len,
> > > > uint32_t sum)  {
> > > >  	/* extend strict-aliasing rules */
> > > >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > > +	const u16_p *u16_buf;
> > > > +	const u16_p *end;
> > > > +
> > > > +	/* if buffer is unaligned, keeping it byte order independent */
> > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > +		uint16_t first = 0;
> > > > +		if (unlikely(len == 0))
> > > > +			return 0;
> > > > +		((unsigned char *)&first)[1] = *(const unsigned
> > > char *)buf;
> > > > +		sum += first;
> > > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > > +		len--;
> > > > +	}
> > > >
> > > > +	u16_buf = (const u16_p *)buf;
> > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > >  	for (; u16_buf != end; ++u16_buf)
> > > >  		sum += *u16_buf;
> > > >
> > > > --
> > > > 2.17.1
> > >
> > > @Emil, can you please test this patch with an unaligned buffer on
> > your
> > > application to confirm that it produces the expected result.
> >
> > Hi!
> >
> > I tested the patch. It doesn't seem to produce the same results. I
> > think the problem is that it always starts summing from an even
> > address, the sum should always start from the first byte according to
> > the checksum specification. Can I instead propose something Mattias
> > Rönnblom sent me?
> 
> I assume that it produces the same result when the "buf" parameter is
> aligned?
> 
> And when the "buf" parameter is unaligned, I don't expect it to produce the
> same results as the simple algorithm!
> 
> This was the whole point of the patch: I expect the overall packet buffer to
> be 16 bit aligned, and the checksum to be a partial checksum of such a 16 bit
> aligned packet buffer. When calling this function, I assume that the "buf" and
> "len" parameters point to a part of such a packet buffer. If these
> expectations are correct, the simple algorithm will produce incorrect results
> when "buf" is unaligned.
> 
> I was asking you to test if the checksum on the packet is correct when your
> application modifies an unaligned part of the packet and uses this function to
> update the checksum.
> 

Now I understand your use case. Your use case seems to be about partial checksums, of which some partial checksums may start on unaligned addresses in an otherwise aligned packet. 

Our use case is about calculating the full checksum on a nested packet. That nested packet may start on unaligned addresses.

The difference is basically if we want to sum over aligned addresses or not, handling the heading and trailing bytes appropriately.

Your method does not work in our case since we want to treat the first two bytes as the first word in our case. But I do understand that both methods are useful.

Note that your method breaks the API. Previously (assuming no crashing due to low optimization levels, more accepting hardware, or a different compiler (version)) the current method would calculate the checksum assuming the first two bytes is the first word.

> 
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > const void *end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) *
> > sizeof(uint16_t));
> >
> > for (; buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> >     uint16_t v;
> >     memcpy(&v, buf, sizeof(uint16_t));
> >     sum += v;
> > }
> >
> > /* if length is odd, keeping it byte order independent */ if
> > (unlikely(len % 2)) {
> >     uint16_t left = 0;
> >     *(unsigned char *)&left = *(const unsigned char *)end;
> >     sum += left;
> > }
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > Note that the last block is the same as before. Amazingly I see no
> > measurable performance hit from this compared to the previous one
> > (-O3, march=native). Looking at the previous the loop body may compile
> > to
> > (x86):
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > vmovdqa (%rdx),%xmm1
> > vpmovzxwd %xmm1,%xmm0
> > vpsrldq $0x8,%xmm1,%xmm1
> > vpmovzxwd %xmm1,%xmm1
> > vpaddd %xmm1,%xmm0,%xmm0
> > cmp    $0xf,%rax
> > jbe    0x7ff7a0dfb1a9
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > while Mattias' memcpy solution:
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > vmovdqu (%rcx),%ymm0
> > add    $0x20,%rcx
> > vpmovzxwd %xmm0,%ymm1
> > vextracti128 $0x1,%ymm0,%xmm0
> > vpmovzxwd %xmm0,%ymm0
> > vpaddd %ymm0,%ymm1,%ymm0
> > vpaddd %ymm0,%ymm2,%ymm2
> > cmp    %r9,%rcx
> > jne    0x555555556380
> > ----------------------------------------------------------------------
> > -
> > ---------------------------------------
> > Thus two extra instructions in the loop, but I suspect it may be
> > memory bound, leading to no measurable performance difference.
> >
> > Any comments?
> >
> > /Emil

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-21  7:16                     ` Emil Berg
@ 2022-06-21  8:05                       ` Morten Brørup
  2022-06-21  8:23                         ` Bruce Richardson
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-21  8:05 UTC (permalink / raw)
  To: Emil Berg, Bruce Richardson, Stephen Hemminger
  Cc: stable, bugzilla, hofors, olivier.matz, dev

+TO: @Bruce and @Stephen: You signed off on the 16 bit alignment requirement. We need background info on this.

> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Tuesday, 21 June 2022 09.17
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: den 20 juni 2022 12:58
> >
> > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > Sent: Monday, 20 June 2022 12.38
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: den 17 juni 2022 11:07
> > > >
> > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > Sent: Friday, 17 June 2022 10.45
> > > > >
> > > > > With this patch, the checksum can be calculated on an unligned
> > > > > part
> > > of
> > > > > a packet buffer.
> > > > > I.e. the buf parameter is no longer required to be 16 bit
> aligned.
> > > > >
> > > > > The DPDK invariant that packet buffers must be 16 bit aligned
> > > remains
> > > > > unchanged.
> > > > > This invariant also defines how to calculate the 16 bit
> checksum
> > > > > on
> > > an
> > > > > unaligned part of a packet buffer.
> > > > >
> > > > > Bugzilla ID: 1035
> > > > > Cc: stable@dpdk.org
> > > > >
> > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > ---
> > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > > > b502481670..8e301d9c26 100644
> > > > > --- a/lib/net/rte_ip.h
> > > > > +++ b/lib/net/rte_ip.h
> > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t
> len,
> > > > > uint32_t sum)  {
> > > > >  	/* extend strict-aliasing rules */
> > > > >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > > > +	const u16_p *u16_buf;
> > > > > +	const u16_p *end;
> > > > > +
> > > > > +	/* if buffer is unaligned, keeping it byte order
> independent */
> > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > +		uint16_t first = 0;
> > > > > +		if (unlikely(len == 0))
> > > > > +			return 0;
> > > > > +		((unsigned char *)&first)[1] = *(const unsigned
> > > > char *)buf;
> > > > > +		sum += first;
> > > > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > > > +		len--;
> > > > > +	}
> > > > >
> > > > > +	u16_buf = (const u16_p *)buf;
> > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > >  		sum += *u16_buf;
> > > > >
> > > > > --
> > > > > 2.17.1
> > > >
> > > > @Emil, can you please test this patch with an unaligned buffer on
> > > your
> > > > application to confirm that it produces the expected result.
> > >
> > > Hi!
> > >
> > > I tested the patch. It doesn't seem to produce the same results. I
> > > think the problem is that it always starts summing from an even
> > > address, the sum should always start from the first byte according
> to
> > > the checksum specification. Can I instead propose something Mattias
> > > Rönnblom sent me?
> >
> > I assume that it produces the same result when the "buf" parameter is
> > aligned?
> >
> > And when the "buf" parameter is unaligned, I don't expect it to
> produce the
> > same results as the simple algorithm!
> >
> > This was the whole point of the patch: I expect the overall packet
> buffer to
> > be 16 bit aligned, and the checksum to be a partial checksum of such
> a 16 bit
> > aligned packet buffer. When calling this function, I assume that the
> "buf" and
> > "len" parameters point to a part of such a packet buffer. If these
> > expectations are correct, the simple algorithm will produce incorrect
> results
> > when "buf" is unaligned.
> >
> > I was asking you to test if the checksum on the packet is correct
> when your
> > application modifies an unaligned part of the packet and uses this
> function to
> > update the checksum.
> >
> 
> Now I understand your use case. Your use case seems to be about partial
> checksums, of which some partial checksums may start on unaligned
> addresses in an otherwise aligned packet.
> 
> Our use case is about calculating the full checksum on a nested packet.
> That nested packet may start on unaligned addresses.
> 
> The difference is basically if we want to sum over aligned addresses or
> not, handling the heading and trailing bytes appropriately.
> 
> Your method does not work in our case since we want to treat the first
> two bytes as the first word in our case. But I do understand that both
> methods are useful.

Yes, that certainly are two different use cases, requiring two different ways of calculating the 16 bit checksum.

> 
> Note that your method breaks the API. Previously (assuming no crashing
> due to low optimization levels, more accepting hardware, or a different
> compiler (version)) the current method would calculate the checksum
> assuming the first two bytes is the first word.
> 

Depending on the point of view, my patch either fixes a bug (where the checksum was calculated incorrectly when the buf pointer was unaligned) or breaks the API (by calculating the differently when the buffer is unaligned).

I cannot say with certainty which one is correct, but perhaps some of the people with a deeper DPDK track record can...

@Bruce and @Stephen, in 2019 you signed off on a patch [1] introducing a 16 bit alignment requirement to the Ethernet address structure.

It is my understanding that DPDK has an invariant requiring packets to be 16 bit aligned, which that patch supports. Is this invariant documented anywhere, or am I completely wrong? If I'm wrong, then the alignment requirement introduced in that patch needs to be removed, as well as any similar alignment requirements elsewhere in DPDK.

[1] http://git.dpdk.org/dpdk/commit/lib/librte_net/rte_ether.h?id=da5350ef29afd35c1adabe76f60832f3092269ad

@Emil, we should wait for a conclusion about the alignment invariant before we proceed.

If there is no such invariant, my patch is wrong, and we need to provide a v2 of the patch, which will then fit your use case.
If there is such an invariant, my patch is correct, and another function must be added for your use case.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] net: fix checksum with unaligned buffer
  2022-06-21  8:05                       ` Morten Brørup
@ 2022-06-21  8:23                         ` Bruce Richardson
  2022-06-21  9:35                           ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Bruce Richardson @ 2022-06-21  8:23 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Emil Berg, Stephen Hemminger, stable, bugzilla, hofors,
	olivier.matz, dev

On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup wrote:
> +TO: @Bruce and @Stephen: You signed off on the 16 bit alignment requirement. We need background info on this.
> 
> > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > Sent: Tuesday, 21 June 2022 09.17
> > 
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: den 20 juni 2022 12:58
> > >
> > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > Sent: Monday, 20 June 2022 12.38
> > > >
> > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > Sent: den 17 juni 2022 11:07
> > > > >
> > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > >
> > > > > > With this patch, the checksum can be calculated on an unligned
> > > > > > part
> > > > of
> > > > > > a packet buffer.
> > > > > > I.e. the buf parameter is no longer required to be 16 bit
> > aligned.
> > > > > >
> > > > > > The DPDK invariant that packet buffers must be 16 bit aligned
> > > > remains
> > > > > > unchanged.
> > > > > > This invariant also defines how to calculate the 16 bit
> > checksum
> > > > > > on
> > > > an
> > > > > > unaligned part of a packet buffer.
> > > > > >
> > > > > > Bugzilla ID: 1035
> > > > > > Cc: stable@dpdk.org
> > > > > >
> > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > > ---
> > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > > > > b502481670..8e301d9c26 100644
> > > > > > --- a/lib/net/rte_ip.h
> > > > > > +++ b/lib/net/rte_ip.h
> > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf, size_t
> > len,
> > > > > > uint32_t sum)  {
> > > > > >  	/* extend strict-aliasing rules */
> > > > > >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > > > > +	const u16_p *u16_buf;
> > > > > > +	const u16_p *end;
> > > > > > +
> > > > > > +	/* if buffer is unaligned, keeping it byte order
> > independent */
> > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > +		uint16_t first = 0;
> > > > > > +		if (unlikely(len == 0))
> > > > > > +			return 0;
> > > > > > +		((unsigned char *)&first)[1] = *(const unsigned
> > > > > char *)buf;
> > > > > > +		sum += first;
> > > > > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > > > > +		len--;
> > > > > > +	}
> > > > > >
> > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > >  		sum += *u16_buf;
> > > > > >
> > > > > > --
> > > > > > 2.17.1
> > > > >
> > > > > @Emil, can you please test this patch with an unaligned buffer on
> > > > your
> > > > > application to confirm that it produces the expected result.
> > > >
> > > > Hi!
> > > >
> > > > I tested the patch. It doesn't seem to produce the same results. I
> > > > think the problem is that it always starts summing from an even
> > > > address, the sum should always start from the first byte according
> > to
> > > > the checksum specification. Can I instead propose something Mattias
> > > > Rönnblom sent me?
> > >
> > > I assume that it produces the same result when the "buf" parameter is
> > > aligned?
> > >
> > > And when the "buf" parameter is unaligned, I don't expect it to
> > produce the
> > > same results as the simple algorithm!
> > >
> > > This was the whole point of the patch: I expect the overall packet
> > buffer to
> > > be 16 bit aligned, and the checksum to be a partial checksum of such
> > a 16 bit
> > > aligned packet buffer. When calling this function, I assume that the
> > "buf" and
> > > "len" parameters point to a part of such a packet buffer. If these
> > > expectations are correct, the simple algorithm will produce incorrect
> > results
> > > when "buf" is unaligned.
> > >
> > > I was asking you to test if the checksum on the packet is correct
> > when your
> > > application modifies an unaligned part of the packet and uses this
> > function to
> > > update the checksum.
> > >
> > 
> > Now I understand your use case. Your use case seems to be about partial
> > checksums, of which some partial checksums may start on unaligned
> > addresses in an otherwise aligned packet.
> > 
> > Our use case is about calculating the full checksum on a nested packet.
> > That nested packet may start on unaligned addresses.
> > 
> > The difference is basically if we want to sum over aligned addresses or
> > not, handling the heading and trailing bytes appropriately.
> > 
> > Your method does not work in our case since we want to treat the first
> > two bytes as the first word in our case. But I do understand that both
> > methods are useful.
> 
> Yes, that certainly are two different use cases, requiring two different ways of calculating the 16 bit checksum.
> 
> > 
> > Note that your method breaks the API. Previously (assuming no crashing
> > due to low optimization levels, more accepting hardware, or a different
> > compiler (version)) the current method would calculate the checksum
> > assuming the first two bytes is the first word.
> > 
> 
> Depending on the point of view, my patch either fixes a bug (where the checksum was calculated incorrectly when the buf pointer was unaligned) or breaks the API (by calculating the differently when the buffer is unaligned).
> 
> I cannot say with certainty which one is correct, but perhaps some of the people with a deeper DPDK track record can...
> 
> @Bruce and @Stephen, in 2019 you signed off on a patch [1] introducing a 16 bit alignment requirement to the Ethernet address structure.
> 
> It is my understanding that DPDK has an invariant requiring packets to be 16 bit aligned, which that patch supports. Is this invariant documented anywhere, or am I completely wrong? If I'm wrong, then the alignment requirement introduced in that patch needs to be removed, as well as any similar alignment requirements elsewhere in DPDK.

I don't believe it is explicitly documented as a global invariant, but I
think it should be unless there is a definite case where we need to allow
packets to be completely unaligned. Across all packet headers we looked at,
there was no tunneling protocol where the resulting packet was left
unaligned.

That said, if there are real use cases where we need to allow packets to
start at an unaligned address, then I agree with you that we need to roll
back the patch and work to ensure everything works with unaligned
addresses.

/Bruce

> 
> [1] http://git.dpdk.org/dpdk/commit/lib/librte_net/rte_ether.h?id=da5350ef29afd35c1adabe76f60832f3092269ad
> 
> @Emil, we should wait for a conclusion about the alignment invariant before we proceed.
> 
> If there is no such invariant, my patch is wrong, and we need to provide a v2 of the patch, which will then fit your use case.
> If there is such an invariant, my patch is correct, and another function must be added for your use case.
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-21  8:23                         ` Bruce Richardson
@ 2022-06-21  9:35                           ` Morten Brørup
  2022-06-22  6:26                             ` Emil Berg
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-21  9:35 UTC (permalink / raw)
  To: Emil Berg
  Cc: Bruce Richardson, Stephen Hemminger, stable, bugzilla, hofors,
	olivier.matz, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 21 June 2022 10.23
> 
> On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup wrote:
> > +TO: @Bruce and @Stephen: You signed off on the 16 bit alignment
> requirement. We need background info on this.
> >
> > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > Sent: Tuesday, 21 June 2022 09.17
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: den 20 juni 2022 12:58
> > > >
> > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > Sent: Monday, 20 June 2022 12.38
> > > > >
> > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > Sent: den 17 juni 2022 11:07
> > > > > >
> > > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > >
> > > > > > > With this patch, the checksum can be calculated on an
> unligned
> > > > > > > part
> > > > > of
> > > > > > > a packet buffer.
> > > > > > > I.e. the buf parameter is no longer required to be 16 bit
> > > aligned.
> > > > > > >
> > > > > > > The DPDK invariant that packet buffers must be 16 bit
> aligned
> > > > > remains
> > > > > > > unchanged.
> > > > > > > This invariant also defines how to calculate the 16 bit
> > > checksum
> > > > > > > on
> > > > > an
> > > > > > > unaligned part of a packet buffer.
> > > > > > >
> > > > > > > Bugzilla ID: 1035
> > > > > > > Cc: stable@dpdk.org
> > > > > > >
> > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > ---
> > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > > > > > b502481670..8e301d9c26 100644
> > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf,
> size_t
> > > len,
> > > > > > > uint32_t sum)  {
> > > > > > >  	/* extend strict-aliasing rules */
> > > > > > >  	typedef uint16_t __attribute__((__may_alias__))
> u16_p;
> > > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > +	const u16_p *u16_buf;
> > > > > > > +	const u16_p *end;
> > > > > > > +
> > > > > > > +	/* if buffer is unaligned, keeping it byte order
> > > independent */
> > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > +		uint16_t first = 0;
> > > > > > > +		if (unlikely(len == 0))
> > > > > > > +			return 0;
> > > > > > > +		((unsigned char *)&first)[1] = *(const unsigned
> > > > > > char *)buf;
> > > > > > > +		sum += first;
> > > > > > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > > > > > +		len--;
> > > > > > > +	}
> > > > > > >
> > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > >  		sum += *u16_buf;
> > > > > > >
> > > > > > > --
> > > > > > > 2.17.1
> > > > > >
> > > > > > @Emil, can you please test this patch with an unaligned
> buffer on
> > > > > your
> > > > > > application to confirm that it produces the expected result.
> > > > >
> > > > > Hi!
> > > > >
> > > > > I tested the patch. It doesn't seem to produce the same
> results. I
> > > > > think the problem is that it always starts summing from an even
> > > > > address, the sum should always start from the first byte
> according
> > > to
> > > > > the checksum specification. Can I instead propose something
> Mattias
> > > > > Rönnblom sent me?
> > > >
> > > > I assume that it produces the same result when the "buf"
> parameter is
> > > > aligned?
> > > >
> > > > And when the "buf" parameter is unaligned, I don't expect it to
> > > produce the
> > > > same results as the simple algorithm!
> > > >
> > > > This was the whole point of the patch: I expect the overall
> packet
> > > buffer to
> > > > be 16 bit aligned, and the checksum to be a partial checksum of
> such
> > > a 16 bit
> > > > aligned packet buffer. When calling this function, I assume that
> the
> > > "buf" and
> > > > "len" parameters point to a part of such a packet buffer. If
> these
> > > > expectations are correct, the simple algorithm will produce
> incorrect
> > > results
> > > > when "buf" is unaligned.
> > > >
> > > > I was asking you to test if the checksum on the packet is correct
> > > when your
> > > > application modifies an unaligned part of the packet and uses
> this
> > > function to
> > > > update the checksum.
> > > >
> > >
> > > Now I understand your use case. Your use case seems to be about
> partial
> > > checksums, of which some partial checksums may start on unaligned
> > > addresses in an otherwise aligned packet.
> > >
> > > Our use case is about calculating the full checksum on a nested
> packet.
> > > That nested packet may start on unaligned addresses.
> > >
> > > The difference is basically if we want to sum over aligned
> addresses or
> > > not, handling the heading and trailing bytes appropriately.
> > >
> > > Your method does not work in our case since we want to treat the
> first
> > > two bytes as the first word in our case. But I do understand that
> both
> > > methods are useful.
> >
> > Yes, that certainly are two different use cases, requiring two
> different ways of calculating the 16 bit checksum.
> >
> > >
> > > Note that your method breaks the API. Previously (assuming no
> crashing
> > > due to low optimization levels, more accepting hardware, or a
> different
> > > compiler (version)) the current method would calculate the checksum
> > > assuming the first two bytes is the first word.
> > >
> >
> > Depending on the point of view, my patch either fixes a bug (where
> the checksum was calculated incorrectly when the buf pointer was
> unaligned) or breaks the API (by calculating the differently when the
> buffer is unaligned).
> >
> > I cannot say with certainty which one is correct, but perhaps some of
> the people with a deeper DPDK track record can...
> >
> > @Bruce and @Stephen, in 2019 you signed off on a patch [1]
> introducing a 16 bit alignment requirement to the Ethernet address
> structure.
> >
> > It is my understanding that DPDK has an invariant requiring packets
> to be 16 bit aligned, which that patch supports. Is this invariant
> documented anywhere, or am I completely wrong? If I'm wrong, then the
> alignment requirement introduced in that patch needs to be removed, as
> well as any similar alignment requirements elsewhere in DPDK.
> 
> I don't believe it is explicitly documented as a global invariant, but
> I
> think it should be unless there is a definite case where we need to
> allow
> packets to be completely unaligned. Across all packet headers we looked
> at,
> there was no tunneling protocol where the resulting packet was left
> unaligned.
> 
> That said, if there are real use cases where we need to allow packets
> to
> start at an unaligned address, then I agree with you that we need to
> roll
> back the patch and work to ensure everything works with unaligned
> addresses.
> 
> /Bruce
>

@Emil, can you please describe or refer to which tunneling protocol you are using, where the nested packet can be unaligned?

I am asking to determine if your use case is exotic (maybe some Ericsson proprietary protocol), or more generic (rooted in some standard protocol). This information affects the DPDK community's opinion about how it should be supported by DPDK.

If possible, please provide more details about the tunneling protocol and nested packets... E.g. do the nested packets also contain Layer 2 (Ethernet, VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP, UDP, etc.)? And how about ARP packets and Layer 2 control protocol packets (STP, LACP, etc.)?

> >
> > [1]
> http://git.dpdk.org/dpdk/commit/lib/librte_net/rte_ether.h?id=da5350ef2
> 9afd35c1adabe76f60832f3092269ad
> >
> > @Emil, we should wait for a conclusion about the alignment invariant
> before we proceed.
> >
> > If there is no such invariant, my patch is wrong, and we need to
> provide a v2 of the patch, which will then fit your use case.
> > If there is such an invariant, my patch is correct, and another
> function must be added for your use case.
> >


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-21  9:35                           ` Morten Brørup
@ 2022-06-22  6:26                             ` Emil Berg
  2022-06-22  9:18                               ` Bruce Richardson
  0 siblings, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-22  6:26 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Bruce Richardson, Stephen Hemminger, stable, bugzilla, hofors,
	olivier.matz, dev



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 21 juni 2022 11:35
> To: Emil Berg <emil.berg@ericsson.com>
> Cc: Bruce Richardson <bruce.richardson@intel.com>; Stephen Hemminger
> <stephen@networkplumber.org>; stable@dpdk.org; bugzilla@dpdk.org;
> hofors@lysator.liu.se; olivier.matz@6wind.com; dev@dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Tuesday, 21 June 2022 10.23
> >
> > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup wrote:
> > > +TO: @Bruce and @Stephen: You signed off on the 16 bit alignment
> > requirement. We need background info on this.
> > >
> > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > Sent: Tuesday, 21 June 2022 09.17
> > > >
> > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > Sent: den 20 juni 2022 12:58
> > > > >
> > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > >
> > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > >
> > > > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > >
> > > > > > > > With this patch, the checksum can be calculated on an
> > unligned
> > > > > > > > part
> > > > > > of
> > > > > > > > a packet buffer.
> > > > > > > > I.e. the buf parameter is no longer required to be 16 bit
> > > > aligned.
> > > > > > > >
> > > > > > > > The DPDK invariant that packet buffers must be 16 bit
> > aligned
> > > > > > remains
> > > > > > > > unchanged.
> > > > > > > > This invariant also defines how to calculate the 16 bit
> > > > checksum
> > > > > > > > on
> > > > > > an
> > > > > > > > unaligned part of a packet buffer.
> > > > > > > >
> > > > > > > > Bugzilla ID: 1035
> > > > > > > > Cc: stable@dpdk.org
> > > > > > > >
> > > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > ---
> > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf,
> > size_t
> > > > len,
> > > > > > > > uint32_t sum)  {
> > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > >  	typedef uint16_t __attribute__((__may_alias__))
> > u16_p;
> > > > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > +	const u16_p *end;
> > > > > > > > +
> > > > > > > > +	/* if buffer is unaligned, keeping it byte order
> > > > independent */
> > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > +		uint16_t first = 0;
> > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > +			return 0;
> > > > > > > > +		((unsigned char *)&first)[1] = *(const unsigned
> > > > > > > char *)buf;
> > > > > > > > +		sum += first;
> > > > > > > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > > > > > > +		len--;
> > > > > > > > +	}
> > > > > > > >
> > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > >  		sum += *u16_buf;
> > > > > > > >
> > > > > > > > --
> > > > > > > > 2.17.1
> > > > > > >
> > > > > > > @Emil, can you please test this patch with an unaligned
> > buffer on
> > > > > > your
> > > > > > > application to confirm that it produces the expected result.
> > > > > >
> > > > > > Hi!
> > > > > >
> > > > > > I tested the patch. It doesn't seem to produce the same
> > results. I
> > > > > > think the problem is that it always starts summing from an
> > > > > > even address, the sum should always start from the first byte
> > according
> > > > to
> > > > > > the checksum specification. Can I instead propose something
> > Mattias
> > > > > > Rönnblom sent me?
> > > > >
> > > > > I assume that it produces the same result when the "buf"
> > parameter is
> > > > > aligned?
> > > > >
> > > > > And when the "buf" parameter is unaligned, I don't expect it to
> > > > produce the
> > > > > same results as the simple algorithm!
> > > > >
> > > > > This was the whole point of the patch: I expect the overall
> > packet
> > > > buffer to
> > > > > be 16 bit aligned, and the checksum to be a partial checksum of
> > such
> > > > a 16 bit
> > > > > aligned packet buffer. When calling this function, I assume that
> > the
> > > > "buf" and
> > > > > "len" parameters point to a part of such a packet buffer. If
> > these
> > > > > expectations are correct, the simple algorithm will produce
> > incorrect
> > > > results
> > > > > when "buf" is unaligned.
> > > > >
> > > > > I was asking you to test if the checksum on the packet is
> > > > > correct
> > > > when your
> > > > > application modifies an unaligned part of the packet and uses
> > this
> > > > function to
> > > > > update the checksum.
> > > > >
> > > >
> > > > Now I understand your use case. Your use case seems to be about
> > partial
> > > > checksums, of which some partial checksums may start on unaligned
> > > > addresses in an otherwise aligned packet.
> > > >
> > > > Our use case is about calculating the full checksum on a nested
> > packet.
> > > > That nested packet may start on unaligned addresses.
> > > >
> > > > The difference is basically if we want to sum over aligned
> > addresses or
> > > > not, handling the heading and trailing bytes appropriately.
> > > >
> > > > Your method does not work in our case since we want to treat the
> > first
> > > > two bytes as the first word in our case. But I do understand that
> > both
> > > > methods are useful.
> > >
> > > Yes, that certainly are two different use cases, requiring two
> > different ways of calculating the 16 bit checksum.
> > >
> > > >
> > > > Note that your method breaks the API. Previously (assuming no
> > crashing
> > > > due to low optimization levels, more accepting hardware, or a
> > different
> > > > compiler (version)) the current method would calculate the
> > > > checksum assuming the first two bytes is the first word.
> > > >
> > >
> > > Depending on the point of view, my patch either fixes a bug (where
> > the checksum was calculated incorrectly when the buf pointer was
> > unaligned) or breaks the API (by calculating the differently when the
> > buffer is unaligned).
> > >
> > > I cannot say with certainty which one is correct, but perhaps some
> > > of
> > the people with a deeper DPDK track record can...
> > >
> > > @Bruce and @Stephen, in 2019 you signed off on a patch [1]
> > introducing a 16 bit alignment requirement to the Ethernet address
> > structure.
> > >
> > > It is my understanding that DPDK has an invariant requiring packets
> > to be 16 bit aligned, which that patch supports. Is this invariant
> > documented anywhere, or am I completely wrong? If I'm wrong, then the
> > alignment requirement introduced in that patch needs to be removed, as
> > well as any similar alignment requirements elsewhere in DPDK.
> >
> > I don't believe it is explicitly documented as a global invariant, but
> > I think it should be unless there is a definite case where we need to
> > allow packets to be completely unaligned. Across all packet headers we
> > looked at, there was no tunneling protocol where the resulting packet
> > was left unaligned.
> >
> > That said, if there are real use cases where we need to allow packets
> > to start at an unaligned address, then I agree with you that we need
> > to roll back the patch and work to ensure everything works with
> > unaligned addresses.
> >
> > /Bruce
> >
> 
> @Emil, can you please describe or refer to which tunneling protocol you are
> using, where the nested packet can be unaligned?
> 
> I am asking to determine if your use case is exotic (maybe some Ericsson
> proprietary protocol), or more generic (rooted in some standard protocol).
> This information affects the DPDK community's opinion about how it should
> be supported by DPDK.
> 
> If possible, please provide more details about the tunneling protocol and
> nested packets... E.g. do the nested packets also contain Layer 2 (Ethernet,
> VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP, UDP, etc.)? And how
> about ARP packets and Layer 2 control protocol packets (STP, LACP, etc.)?
> 

Well, if you append or adjust an odd number of bytes (e.g. a PDCP header) from a previously aligned payload the entire packet will then be unaligned.

> > >
> > > [1]
> > https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444
> > 5555731-713e91ae28ea4a95&q=1&e=91f8f355-4366-43bd-ac93-
> 4c2f375f8d25&u=
> >
> http%3A%2F%2Fgit.dpdk.org%2Fdpdk%2Fcommit%2Flib%2Flibrte_net%2Frt
> e_eth
> > er.h%3Fid%3Dda5350ef2
> > 9afd35c1adabe76f60832f3092269ad
> > >
> > > @Emil, we should wait for a conclusion about the alignment invariant
> > before we proceed.
> > >
> > > If there is no such invariant, my patch is wrong, and we need to
> > provide a v2 of the patch, which will then fit your use case.
> > > If there is such an invariant, my patch is correct, and another
> > function must be added for your use case.
> > >

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH] net: fix checksum with unaligned buffer
  2022-06-22  6:26                             ` Emil Berg
@ 2022-06-22  9:18                               ` Bruce Richardson
  2022-06-22 11:26                                 ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Bruce Richardson @ 2022-06-22  9:18 UTC (permalink / raw)
  To: Emil Berg
  Cc: Morten Brørup, Stephen Hemminger, stable, bugzilla, hofors,
	olivier.matz, dev

On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> 
> 
> > -----Original Message-----
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: den 21 juni 2022 11:35
> > To: Emil Berg <emil.berg@ericsson.com>
> > Cc: Bruce Richardson <bruce.richardson@intel.com>; Stephen Hemminger
> > <stephen@networkplumber.org>; stable@dpdk.org; bugzilla@dpdk.org;
> > hofors@lysator.liu.se; olivier.matz@6wind.com; dev@dpdk.org
> > Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> > 
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Tuesday, 21 June 2022 10.23
> > >
> > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup wrote:
> > > > +TO: @Bruce and @Stephen: You signed off on the 16 bit alignment
> > > requirement. We need background info on this.
> > > >
> > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > >
> > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > Sent: den 20 juni 2022 12:58
> > > > > >
> > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > >
> > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > >
> > > > > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > >
> > > > > > > > > With this patch, the checksum can be calculated on an
> > > unligned
> > > > > > > > > part
> > > > > > > of
> > > > > > > > > a packet buffer.
> > > > > > > > > I.e. the buf parameter is no longer required to be 16 bit
> > > > > aligned.
> > > > > > > > >
> > > > > > > > > The DPDK invariant that packet buffers must be 16 bit
> > > aligned
> > > > > > > remains
> > > > > > > > > unchanged.
> > > > > > > > > This invariant also defines how to calculate the 16 bit
> > > > > checksum
> > > > > > > > > on
> > > > > > > an
> > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > >
> > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > >
> > > > > > > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > ---
> > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf,
> > > size_t
> > > > > len,
> > > > > > > > > uint32_t sum)  {
> > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > >  	typedef uint16_t __attribute__((__may_alias__))
> > > u16_p;
> > > > > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > > > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > +	const u16_p *end;
> > > > > > > > > +
> > > > > > > > > +	/* if buffer is unaligned, keeping it byte order
> > > > > independent */
> > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > +			return 0;
> > > > > > > > > +		((unsigned char *)&first)[1] = *(const unsigned
> > > > > > > > char *)buf;
> > > > > > > > > +		sum += first;
> > > > > > > > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > > > > > > > +		len--;
> > > > > > > > > +	}
> > > > > > > > >
> > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > >  		sum += *u16_buf;
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > 2.17.1
> > > > > > > >
> > > > > > > > @Emil, can you please test this patch with an unaligned
> > > buffer on
> > > > > > > your
> > > > > > > > application to confirm that it produces the expected result.
> > > > > > >
> > > > > > > Hi!
> > > > > > >
> > > > > > > I tested the patch. It doesn't seem to produce the same
> > > results. I
> > > > > > > think the problem is that it always starts summing from an
> > > > > > > even address, the sum should always start from the first byte
> > > according
> > > > > to
> > > > > > > the checksum specification. Can I instead propose something
> > > Mattias
> > > > > > > Rönnblom sent me?
> > > > > >
> > > > > > I assume that it produces the same result when the "buf"
> > > parameter is
> > > > > > aligned?
> > > > > >
> > > > > > And when the "buf" parameter is unaligned, I don't expect it to
> > > > > produce the
> > > > > > same results as the simple algorithm!
> > > > > >
> > > > > > This was the whole point of the patch: I expect the overall
> > > packet
> > > > > buffer to
> > > > > > be 16 bit aligned, and the checksum to be a partial checksum of
> > > such
> > > > > a 16 bit
> > > > > > aligned packet buffer. When calling this function, I assume that
> > > the
> > > > > "buf" and
> > > > > > "len" parameters point to a part of such a packet buffer. If
> > > these
> > > > > > expectations are correct, the simple algorithm will produce
> > > incorrect
> > > > > results
> > > > > > when "buf" is unaligned.
> > > > > >
> > > > > > I was asking you to test if the checksum on the packet is
> > > > > > correct
> > > > > when your
> > > > > > application modifies an unaligned part of the packet and uses
> > > this
> > > > > function to
> > > > > > update the checksum.
> > > > > >
> > > > >
> > > > > Now I understand your use case. Your use case seems to be about
> > > partial
> > > > > checksums, of which some partial checksums may start on unaligned
> > > > > addresses in an otherwise aligned packet.
> > > > >
> > > > > Our use case is about calculating the full checksum on a nested
> > > packet.
> > > > > That nested packet may start on unaligned addresses.
> > > > >
> > > > > The difference is basically if we want to sum over aligned
> > > addresses or
> > > > > not, handling the heading and trailing bytes appropriately.
> > > > >
> > > > > Your method does not work in our case since we want to treat the
> > > first
> > > > > two bytes as the first word in our case. But I do understand that
> > > both
> > > > > methods are useful.
> > > >
> > > > Yes, that certainly are two different use cases, requiring two
> > > different ways of calculating the 16 bit checksum.
> > > >
> > > > >
> > > > > Note that your method breaks the API. Previously (assuming no
> > > crashing
> > > > > due to low optimization levels, more accepting hardware, or a
> > > different
> > > > > compiler (version)) the current method would calculate the
> > > > > checksum assuming the first two bytes is the first word.
> > > > >
> > > >
> > > > Depending on the point of view, my patch either fixes a bug (where
> > > the checksum was calculated incorrectly when the buf pointer was
> > > unaligned) or breaks the API (by calculating the differently when the
> > > buffer is unaligned).
> > > >
> > > > I cannot say with certainty which one is correct, but perhaps some
> > > > of
> > > the people with a deeper DPDK track record can...
> > > >
> > > > @Bruce and @Stephen, in 2019 you signed off on a patch [1]
> > > introducing a 16 bit alignment requirement to the Ethernet address
> > > structure.
> > > >
> > > > It is my understanding that DPDK has an invariant requiring packets
> > > to be 16 bit aligned, which that patch supports. Is this invariant
> > > documented anywhere, or am I completely wrong? If I'm wrong, then the
> > > alignment requirement introduced in that patch needs to be removed, as
> > > well as any similar alignment requirements elsewhere in DPDK.
> > >
> > > I don't believe it is explicitly documented as a global invariant, but
> > > I think it should be unless there is a definite case where we need to
> > > allow packets to be completely unaligned. Across all packet headers we
> > > looked at, there was no tunneling protocol where the resulting packet
> > > was left unaligned.
> > >
> > > That said, if there are real use cases where we need to allow packets
> > > to start at an unaligned address, then I agree with you that we need
> > > to roll back the patch and work to ensure everything works with
> > > unaligned addresses.
> > >
> > > /Bruce
> > >
> > 
> > @Emil, can you please describe or refer to which tunneling protocol you are
> > using, where the nested packet can be unaligned?
> > 
> > I am asking to determine if your use case is exotic (maybe some Ericsson
> > proprietary protocol), or more generic (rooted in some standard protocol).
> > This information affects the DPDK community's opinion about how it should
> > be supported by DPDK.
> > 
> > If possible, please provide more details about the tunneling protocol and
> > nested packets... E.g. do the nested packets also contain Layer 2 (Ethernet,
> > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP, UDP, etc.)? And how
> > about ARP packets and Layer 2 control protocol packets (STP, LACP, etc.)?
> > 
> 
> Well, if you append or adjust an odd number of bytes (e.g. a PDCP header) from a previously aligned payload the entire packet will then be unaligned.
> 

If PDCP headers can leave the rest of the packet field unaligned, then we
had better remove the alignment restrictions through all of DPDK.

/Bruce

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-22  9:18                               ` Bruce Richardson
@ 2022-06-22 11:26                                 ` Morten Brørup
  2022-06-22 12:25                                   ` Emil Berg
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-22 11:26 UTC (permalink / raw)
  To: Bruce Richardson, Emil Berg
  Cc: Stephen Hemminger, stable, bugzilla, hofors, olivier.matz, dev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 22 June 2022 11.18
> 
> On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: den 21 juni 2022 11:35
> > >
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Tuesday, 21 June 2022 10.23
> > > >
> > > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup wrote:
> > > > > +TO: @Bruce and @Stephen: You signed off on the 16 bit
> alignment
> > > > requirement. We need background info on this.
> > > > >
> > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > > >
> > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > Sent: den 20 juni 2022 12:58
> > > > > > >
> > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > > >
> > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > > >
> > > > > > > > > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > > >
> > > > > > > > > > With this patch, the checksum can be calculated on an
> > > > unligned
> > > > > > > > > > part
> > > > > > > > of
> > > > > > > > > > a packet buffer.
> > > > > > > > > > I.e. the buf parameter is no longer required to be 16
> bit
> > > > > > aligned.
> > > > > > > > > >
> > > > > > > > > > The DPDK invariant that packet buffers must be 16 bit
> > > > aligned
> > > > > > > > remains
> > > > > > > > > > unchanged.
> > > > > > > > > > This invariant also defines how to calculate the 16
> bit
> > > > > > checksum
> > > > > > > > > > on
> > > > > > > > an
> > > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > > >
> > > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Morten Brørup
> <mb@smartsharesystems.com>
> > > > > > > > > > ---
> > > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index
> > > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void *buf,
> > > > size_t
> > > > > > len,
> > > > > > > > > > uint32_t sum)  {
> > > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > > >  	typedef uint16_t __attribute__((__may_alias__))
> > > > u16_p;
> > > > > > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > > > > > -	const u16_p *end = u16_buf + len /
> sizeof(*u16_buf);
> > > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > > +	const u16_p *end;
> > > > > > > > > > +
> > > > > > > > > > +	/* if buffer is unaligned, keeping it byte
> order
> > > > > > independent */
> > > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > > +			return 0;
> > > > > > > > > > +		((unsigned char *)&first)[1] = *(const
> unsigned
> > > > > > > > > char *)buf;
> > > > > > > > > > +		sum += first;
> > > > > > > > > > +		buf = (const void *)((uintptr_t)buf + 1);
> > > > > > > > > > +		len--;
> > > > > > > > > > +	}
> > > > > > > > > >
> > > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > > >  		sum += *u16_buf;
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > 2.17.1
> > > > > > > > >
> > > > > > > > > @Emil, can you please test this patch with an unaligned
> > > > buffer on
> > > > > > > > your
> > > > > > > > > application to confirm that it produces the expected
> result.
> > > > > > > >
> > > > > > > > Hi!
> > > > > > > >
> > > > > > > > I tested the patch. It doesn't seem to produce the same
> > > > results. I
> > > > > > > > think the problem is that it always starts summing from
> an
> > > > > > > > even address, the sum should always start from the first
> byte
> > > > according
> > > > > > to
> > > > > > > > the checksum specification. Can I instead propose
> something
> > > > Mattias
> > > > > > > > Rönnblom sent me?
> > > > > > >
> > > > > > > I assume that it produces the same result when the "buf"
> > > > parameter is
> > > > > > > aligned?
> > > > > > >
> > > > > > > And when the "buf" parameter is unaligned, I don't expect
> it to
> > > > > > produce the
> > > > > > > same results as the simple algorithm!
> > > > > > >
> > > > > > > This was the whole point of the patch: I expect the overall
> > > > packet
> > > > > > buffer to
> > > > > > > be 16 bit aligned, and the checksum to be a partial
> checksum of
> > > > such
> > > > > > a 16 bit
> > > > > > > aligned packet buffer. When calling this function, I assume
> that
> > > > the
> > > > > > "buf" and
> > > > > > > "len" parameters point to a part of such a packet buffer.
> If
> > > > these
> > > > > > > expectations are correct, the simple algorithm will produce
> > > > incorrect
> > > > > > results
> > > > > > > when "buf" is unaligned.
> > > > > > >
> > > > > > > I was asking you to test if the checksum on the packet is
> > > > > > > correct
> > > > > > when your
> > > > > > > application modifies an unaligned part of the packet and
> uses
> > > > this
> > > > > > function to
> > > > > > > update the checksum.
> > > > > > >
> > > > > >
> > > > > > Now I understand your use case. Your use case seems to be
> about
> > > > partial
> > > > > > checksums, of which some partial checksums may start on
> unaligned
> > > > > > addresses in an otherwise aligned packet.
> > > > > >
> > > > > > Our use case is about calculating the full checksum on a
> nested
> > > > packet.
> > > > > > That nested packet may start on unaligned addresses.
> > > > > >
> > > > > > The difference is basically if we want to sum over aligned
> > > > addresses or
> > > > > > not, handling the heading and trailing bytes appropriately.
> > > > > >
> > > > > > Your method does not work in our case since we want to treat
> the
> > > > first
> > > > > > two bytes as the first word in our case. But I do understand
> that
> > > > both
> > > > > > methods are useful.
> > > > >
> > > > > Yes, that certainly are two different use cases, requiring two
> > > > different ways of calculating the 16 bit checksum.
> > > > >
> > > > > >
> > > > > > Note that your method breaks the API. Previously (assuming no
> > > > crashing
> > > > > > due to low optimization levels, more accepting hardware, or a
> > > > different
> > > > > > compiler (version)) the current method would calculate the
> > > > > > checksum assuming the first two bytes is the first word.
> > > > > >
> > > > >
> > > > > Depending on the point of view, my patch either fixes a bug
> (where
> > > > the checksum was calculated incorrectly when the buf pointer was
> > > > unaligned) or breaks the API (by calculating the differently when
> the
> > > > buffer is unaligned).
> > > > >
> > > > > I cannot say with certainty which one is correct, but perhaps
> some
> > > > > of
> > > > the people with a deeper DPDK track record can...
> > > > >
> > > > > @Bruce and @Stephen, in 2019 you signed off on a patch [1]
> > > > introducing a 16 bit alignment requirement to the Ethernet
> address
> > > > structure.
> > > > >
> > > > > It is my understanding that DPDK has an invariant requiring
> packets
> > > > to be 16 bit aligned, which that patch supports. Is this
> invariant
> > > > documented anywhere, or am I completely wrong? If I'm wrong, then
> the
> > > > alignment requirement introduced in that patch needs to be
> removed, as
> > > > well as any similar alignment requirements elsewhere in DPDK.
> > > >
> > > > I don't believe it is explicitly documented as a global
> invariant, but
> > > > I think it should be unless there is a definite case where we
> need to
> > > > allow packets to be completely unaligned. Across all packet
> headers we
> > > > looked at, there was no tunneling protocol where the resulting
> packet
> > > > was left unaligned.
> > > >
> > > > That said, if there are real use cases where we need to allow
> packets
> > > > to start at an unaligned address, then I agree with you that we
> need
> > > > to roll back the patch and work to ensure everything works with
> > > > unaligned addresses.
> > > >
> > > > /Bruce
> > > >
> > >
> > > @Emil, can you please describe or refer to which tunneling protocol
> you are
> > > using, where the nested packet can be unaligned?
> > >
> > > I am asking to determine if your use case is exotic (maybe some
> Ericsson
> > > proprietary protocol), or more generic (rooted in some standard
> protocol).
> > > This information affects the DPDK community's opinion about how it
> should
> > > be supported by DPDK.
> > >
> > > If possible, please provide more details about the tunneling
> protocol and
> > > nested packets... E.g. do the nested packets also contain Layer 2
> (Ethernet,
> > > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP, UDP,
> etc.)? And how
> > > about ARP packets and Layer 2 control protocol packets (STP, LACP,
> etc.)?
> > >
> >
> > Well, if you append or adjust an odd number of bytes (e.g. a PDCP
> header) from a previously aligned payload the entire packet will then
> be unaligned.
> >
> 
> If PDCP headers can leave the rest of the packet field unaligned, then
> we
> had better remove the alignment restrictions through all of DPDK.
> 
> /Bruce

Re-reading the details regarding unaligned pointers in C11, as posted by Emil in Bugzilla [2], I interpret it as follows: Any 16 bit or wider pointer type a must point to data aligned with that type, i.e. a pointer of the type "uint16_t *" must point to 16 bit aligned data, and a pointer of the type "uint64_t *" must point to 64 bit aligned data. Please, someone tell me I got this wrong, and wake me up from my nightmare!

Updating DPDK's packet structures to fully support this C11 limitation with unaligned access would be a nightmare, as we would need to use byte arrays for all structure fields. Functions would also be unable to use other pointer types than "void *" and "char *", which seems to be the actual problem in the __rte_raw_cksum() function. I guess that it also would prevent the compiler from auto-vectorizing the functions.

I am usually a big proponent of academically correct solutions, but such a change would be too wide ranging, so I would like to narrow it down to the actual use case, and perhaps extrapolate a bit from there.

@Emil: Do you only need to calculate the checksum of the (potentially unaligned) embedded packet? Or do you also need to use other DPDK functions with the embedded packet, potentially accessing it at an unaligned address?

I'm trying to determine the scope of this C11 pointer alignment limitation for your use case, i.e. whether or not other DPDK functions need to be updated to support unaligned packet access too.

[2] https://bugs.dpdk.org/show_bug.cgi?id=1035


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-22 11:26                                 ` Morten Brørup
@ 2022-06-22 12:25                                   ` Emil Berg
  2022-06-22 14:01                                     ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-22 12:25 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: Stephen Hemminger, stable, bugzilla, hofors, olivier.matz, dev



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 22 juni 2022 13:26
> To: Bruce Richardson <bruce.richardson@intel.com>; Emil Berg
> <emil.berg@ericsson.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>;
> stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com; dev@dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Wednesday, 22 June 2022 11.18
> >
> > On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: den 21 juni 2022 11:35
> > > >
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > Sent: Tuesday, 21 June 2022 10.23
> > > > >
> > > > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup wrote:
> > > > > > +TO: @Bruce and @Stephen: You signed off on the 16 bit
> > alignment
> > > > > requirement. We need background info on this.
> > > > > >
> > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > > > >
> > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > Sent: den 20 juni 2022 12:58
> > > > > > > >
> > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > > > >
> > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > > > >
> > > > > > > > > > > From: Morten Brørup
> > > > > > > > > > > [mailto:mb@smartsharesystems.com]
> > > > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > > > >
> > > > > > > > > > > With this patch, the checksum can be calculated on
> > > > > > > > > > > an
> > > > > unligned
> > > > > > > > > > > part
> > > > > > > > > of
> > > > > > > > > > > a packet buffer.
> > > > > > > > > > > I.e. the buf parameter is no longer required to be
> > > > > > > > > > > 16
> > bit
> > > > > > > aligned.
> > > > > > > > > > >
> > > > > > > > > > > The DPDK invariant that packet buffers must be 16
> > > > > > > > > > > bit
> > > > > aligned
> > > > > > > > > remains
> > > > > > > > > > > unchanged.
> > > > > > > > > > > This invariant also defines how to calculate the 16
> > bit
> > > > > > > checksum
> > > > > > > > > > > on
> > > > > > > > > an
> > > > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > > > >
> > > > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Morten Brørup
> > <mb@smartsharesystems.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> > index
> > > > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void
> > > > > > > > > > > *buf,
> > > > > size_t
> > > > > > > len,
> > > > > > > > > > > uint32_t sum)  {
> > > > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > > > >  	typedef uint16_t
> __attribute__((__may_alias__))
> > > > > u16_p;
> > > > > > > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > > > > > > -	const u16_p *end = u16_buf + len /
> > sizeof(*u16_buf);
> > > > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > > > +	const u16_p *end;
> > > > > > > > > > > +
> > > > > > > > > > > +	/* if buffer is unaligned, keeping it byte
> > order
> > > > > > > independent */
> > > > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > > > +			return 0;
> > > > > > > > > > > +		((unsigned char *)&first)[1] =
> *(const
> > unsigned
> > > > > > > > > > char *)buf;
> > > > > > > > > > > +		sum += first;
> > > > > > > > > > > +		buf = (const void *)((uintptr_t)buf
> + 1);
> > > > > > > > > > > +		len--;
> > > > > > > > > > > +	}
> > > > > > > > > > >
> > > > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > > > >  		sum += *u16_buf;
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > 2.17.1
> > > > > > > > > >
> > > > > > > > > > @Emil, can you please test this patch with an
> > > > > > > > > > unaligned
> > > > > buffer on
> > > > > > > > > your
> > > > > > > > > > application to confirm that it produces the expected
> > result.
> > > > > > > > >
> > > > > > > > > Hi!
> > > > > > > > >
> > > > > > > > > I tested the patch. It doesn't seem to produce the same
> > > > > results. I
> > > > > > > > > think the problem is that it always starts summing from
> > an
> > > > > > > > > even address, the sum should always start from the first
> > byte
> > > > > according
> > > > > > > to
> > > > > > > > > the checksum specification. Can I instead propose
> > something
> > > > > Mattias
> > > > > > > > > Rönnblom sent me?
> > > > > > > >
> > > > > > > > I assume that it produces the same result when the "buf"
> > > > > parameter is
> > > > > > > > aligned?
> > > > > > > >
> > > > > > > > And when the "buf" parameter is unaligned, I don't expect
> > it to
> > > > > > > produce the
> > > > > > > > same results as the simple algorithm!
> > > > > > > >
> > > > > > > > This was the whole point of the patch: I expect the
> > > > > > > > overall
> > > > > packet
> > > > > > > buffer to
> > > > > > > > be 16 bit aligned, and the checksum to be a partial
> > checksum of
> > > > > such
> > > > > > > a 16 bit
> > > > > > > > aligned packet buffer. When calling this function, I
> > > > > > > > assume
> > that
> > > > > the
> > > > > > > "buf" and
> > > > > > > > "len" parameters point to a part of such a packet buffer.
> > If
> > > > > these
> > > > > > > > expectations are correct, the simple algorithm will
> > > > > > > > produce
> > > > > incorrect
> > > > > > > results
> > > > > > > > when "buf" is unaligned.
> > > > > > > >
> > > > > > > > I was asking you to test if the checksum on the packet is
> > > > > > > > correct
> > > > > > > when your
> > > > > > > > application modifies an unaligned part of the packet and
> > uses
> > > > > this
> > > > > > > function to
> > > > > > > > update the checksum.
> > > > > > > >
> > > > > > >
> > > > > > > Now I understand your use case. Your use case seems to be
> > about
> > > > > partial
> > > > > > > checksums, of which some partial checksums may start on
> > unaligned
> > > > > > > addresses in an otherwise aligned packet.
> > > > > > >
> > > > > > > Our use case is about calculating the full checksum on a
> > nested
> > > > > packet.
> > > > > > > That nested packet may start on unaligned addresses.
> > > > > > >
> > > > > > > The difference is basically if we want to sum over aligned
> > > > > addresses or
> > > > > > > not, handling the heading and trailing bytes appropriately.
> > > > > > >
> > > > > > > Your method does not work in our case since we want to treat
> > the
> > > > > first
> > > > > > > two bytes as the first word in our case. But I do understand
> > that
> > > > > both
> > > > > > > methods are useful.
> > > > > >
> > > > > > Yes, that certainly are two different use cases, requiring two
> > > > > different ways of calculating the 16 bit checksum.
> > > > > >
> > > > > > >
> > > > > > > Note that your method breaks the API. Previously (assuming
> > > > > > > no
> > > > > crashing
> > > > > > > due to low optimization levels, more accepting hardware, or
> > > > > > > a
> > > > > different
> > > > > > > compiler (version)) the current method would calculate the
> > > > > > > checksum assuming the first two bytes is the first word.
> > > > > > >
> > > > > >
> > > > > > Depending on the point of view, my patch either fixes a bug
> > (where
> > > > > the checksum was calculated incorrectly when the buf pointer was
> > > > > unaligned) or breaks the API (by calculating the differently
> > > > > when
> > the
> > > > > buffer is unaligned).
> > > > > >
> > > > > > I cannot say with certainty which one is correct, but perhaps
> > some
> > > > > > of
> > > > > the people with a deeper DPDK track record can...
> > > > > >
> > > > > > @Bruce and @Stephen, in 2019 you signed off on a patch [1]
> > > > > introducing a 16 bit alignment requirement to the Ethernet
> > address
> > > > > structure.
> > > > > >
> > > > > > It is my understanding that DPDK has an invariant requiring
> > packets
> > > > > to be 16 bit aligned, which that patch supports. Is this
> > invariant
> > > > > documented anywhere, or am I completely wrong? If I'm wrong,
> > > > > then
> > the
> > > > > alignment requirement introduced in that patch needs to be
> > removed, as
> > > > > well as any similar alignment requirements elsewhere in DPDK.
> > > > >
> > > > > I don't believe it is explicitly documented as a global
> > invariant, but
> > > > > I think it should be unless there is a definite case where we
> > need to
> > > > > allow packets to be completely unaligned. Across all packet
> > headers we
> > > > > looked at, there was no tunneling protocol where the resulting
> > packet
> > > > > was left unaligned.
> > > > >
> > > > > That said, if there are real use cases where we need to allow
> > packets
> > > > > to start at an unaligned address, then I agree with you that we
> > need
> > > > > to roll back the patch and work to ensure everything works with
> > > > > unaligned addresses.
> > > > >
> > > > > /Bruce
> > > > >
> > > >
> > > > @Emil, can you please describe or refer to which tunneling
> > > > protocol
> > you are
> > > > using, where the nested packet can be unaligned?
> > > >
> > > > I am asking to determine if your use case is exotic (maybe some
> > Ericsson
> > > > proprietary protocol), or more generic (rooted in some standard
> > protocol).
> > > > This information affects the DPDK community's opinion about how it
> > should
> > > > be supported by DPDK.
> > > >
> > > > If possible, please provide more details about the tunneling
> > protocol and
> > > > nested packets... E.g. do the nested packets also contain Layer 2
> > (Ethernet,
> > > > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP, UDP,
> > etc.)? And how
> > > > about ARP packets and Layer 2 control protocol packets (STP, LACP,
> > etc.)?
> > > >
> > >
> > > Well, if you append or adjust an odd number of bytes (e.g. a PDCP
> > header) from a previously aligned payload the entire packet will then
> > be unaligned.
> > >
> >
> > If PDCP headers can leave the rest of the packet field unaligned, then
> > we had better remove the alignment restrictions through all of DPDK.
> >
> > /Bruce
> 
> Re-reading the details regarding unaligned pointers in C11, as posted by Emil
> in Bugzilla [2], I interpret it as follows: Any 16 bit or wider pointer type a must
> point to data aligned with that type, i.e. a pointer of the type "uint16_t *"
> must point to 16 bit aligned data, and a pointer of the type "uint64_t *" must
> point to 64 bit aligned data. Please, someone tell me I got this wrong, and
> wake me up from my nightmare!
> 
> Updating DPDK's packet structures to fully support this C11 limitation with
> unaligned access would be a nightmare, as we would need to use byte arrays
> for all structure fields. Functions would also be unable to use other pointer
> types than "void *" and "char *", which seems to be the actual problem in
> the __rte_raw_cksum() function. I guess that it also would prevent the
> compiler from auto-vectorizing the functions.
> 
> I am usually a big proponent of academically correct solutions, but such a
> change would be too wide ranging, so I would like to narrow it down to the
> actual use case, and perhaps extrapolate a bit from there.
> 
> @Emil: Do you only need to calculate the checksum of the (potentially
> unaligned) embedded packet? Or do you also need to use other DPDK
> functions with the embedded packet, potentially accessing it at an unaligned
> address?
> 
> I'm trying to determine the scope of this C11 pointer alignment limitation for
> your use case, i.e. whether or not other DPDK functions need to be updated
> to support unaligned packet access too.
> 
> [2] https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-
> 454445554331-2ffe58e5caaeb74e&q=1&e=3f0544d3-8a71-4676-b4f9-
> 27e0952f7de0&u=https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%
> 3D1035

That's my interpretation of the standard as well; For example an uint16_t* must be on even addresses. If not it is undefined behavior. I think this is a bigger problem on ARM for example.

Without being that invested in dpdk, adding unaligned support for everything seems like a steep step, but I'm not sure what it entails in practice.

We are actually only interested in the checksumming.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2] net: fix checksum with unaligned buffer
  2022-06-17  7:32           ` Morten Brørup
  2022-06-17  8:45             ` [PATCH] net: fix checksum with unaligned buffer Morten Brørup
@ 2022-06-22 13:44             ` Morten Brørup
  2022-06-22 13:54             ` [PATCH v3] " Morten Brørup
  2022-06-23 12:39             ` [PATCH v4] " Morten Brørup
  3 siblings, 0 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-22 13:44 UTC (permalink / raw)
  To: emil.berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, hofors, olivier.matz, Morten Brørup

With this patch, the checksum can be calculated on an unligned buffer.
I.e. the buf parameter is no longer required to be 16 bit aligned.

The checksum is still calculated using a 16 bit aligned pointer, so the
compiler can auto-vectorize the function's inner loop.

When the buffer is unaligned, the first byte of the buffer is handled
separately. Futhermore, the calculated checksum of the buffer is byte
shifted before being added to the initial checksum, to compensate for the
checksum having been calculated on the buffer shifted by one byte.

v2:
* Do not assume that the buffer is part of an aligned packet buffer.

Bugzilla ID: 1035
Cc: stable@dpdk.org

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/net/rte_ip.h | 31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index b502481670..ad46bdb443 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -162,20 +162,41 @@ __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
 	/* extend strict-aliasing rules */
 	typedef uint16_t __attribute__((__may_alias__)) u16_p;
-	const u16_p *u16_buf = (const u16_p *)buf;
-	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+	const u16_p *u16_buf;
+	const u16_p *end;
+	uint32_t bsum = 0;
+	const bool unaligned = (uintptr_t)buf & 1;
+
+	/* if buffer is unaligned, keeping it byte order independent */
+	if (unlikely(unaligned)) {
+		uint16_t first = 0;
+		if (unlikely(len == 0))
+			return 0;
+		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
+		bsum += first;
+		buf = (const void *)((uintptr_t)buf + 1);
+		len--;
+	}
 
+	/* aligned access for compiler auto-vectorization */
+	u16_buf = (const u16_p *)buf;
+	end = u16_buf + len / sizeof(*u16_buf);
 	for (; u16_buf != end; ++u16_buf)
-		sum += *u16_buf;
+		bsum += *u16_buf;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
 		uint16_t left = 0;
 		*(unsigned char *)&left = *(const unsigned char *)end;
-		sum += left;
+		bsum += left;
+	}
+
+	/* if buffer is unaligned, swap the checksum bytes */
+	if (unlikely(unaligned)) {
+		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & 0x00FF00FF) << 8;
 	}
 
-	return sum;
+	return sum + bsum;
 }
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3] net: fix checksum with unaligned buffer
  2022-06-17  7:32           ` Morten Brørup
  2022-06-17  8:45             ` [PATCH] net: fix checksum with unaligned buffer Morten Brørup
  2022-06-22 13:44             ` [PATCH v2] " Morten Brørup
@ 2022-06-22 13:54             ` Morten Brørup
  2022-06-23 12:39             ` [PATCH v4] " Morten Brørup
  3 siblings, 0 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-22 13:54 UTC (permalink / raw)
  To: emil.berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, hofors, olivier.matz, Morten Brørup

With this patch, the checksum can be calculated on an unligned buffer.
I.e. the buf parameter is no longer required to be 16 bit aligned.

The checksum is still calculated using a 16 bit aligned pointer, so the
compiler can auto-vectorize the function's inner loop.

When the buffer is unaligned, the first byte of the buffer is handled
separately. Furthermore, the calculated checksum of the buffer is byte
shifted before being added to the initial checksum, to compensate for the
checksum having been calculated on the buffer shifted by one byte.

v3:
* Remove braces from single statement block.
* Fix typo in commit message.
v2:
* Do not assume that the buffer is part of an aligned packet buffer.

Bugzilla ID: 1035
Cc: stable@dpdk.org

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/net/rte_ip.h | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index b502481670..3fad448085 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -162,20 +162,40 @@ __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
 	/* extend strict-aliasing rules */
 	typedef uint16_t __attribute__((__may_alias__)) u16_p;
-	const u16_p *u16_buf = (const u16_p *)buf;
-	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+	const u16_p *u16_buf;
+	const u16_p *end;
+	uint32_t bsum = 0;
+	const bool unaligned = (uintptr_t)buf & 1;
+
+	/* if buffer is unaligned, keeping it byte order independent */
+	if (unlikely(unaligned)) {
+		uint16_t first = 0;
+		if (unlikely(len == 0))
+			return 0;
+		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
+		bsum += first;
+		buf = (const void *)((uintptr_t)buf + 1);
+		len--;
+	}
 
+	/* aligned access for compiler auto-vectorization */
+	u16_buf = (const u16_p *)buf;
+	end = u16_buf + len / sizeof(*u16_buf);
 	for (; u16_buf != end; ++u16_buf)
-		sum += *u16_buf;
+		bsum += *u16_buf;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
 		uint16_t left = 0;
 		*(unsigned char *)&left = *(const unsigned char *)end;
-		sum += left;
+		bsum += left;
 	}
 
-	return sum;
+	/* if buffer is unaligned, swap the checksum bytes */
+	if (unlikely(unaligned))
+		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & 0x00FF00FF) << 8;
+
+	return sum + bsum;
 }
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-22 12:25                                   ` Emil Berg
@ 2022-06-22 14:01                                     ` Morten Brørup
  2022-06-22 14:03                                       ` Emil Berg
  2022-06-23  5:21                                       ` Emil Berg
  0 siblings, 2 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-22 14:01 UTC (permalink / raw)
  To: Emil Berg, Bruce Richardson
  Cc: Stephen Hemminger, stable, bugzilla, hofors, olivier.matz, dev

> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Wednesday, 22 June 2022 14.25
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: den 22 juni 2022 13:26
> >
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Wednesday, 22 June 2022 11.18
> > >
> > > On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> > > >
> > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > Sent: den 21 juni 2022 11:35
> > > > >
> > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > Sent: Tuesday, 21 June 2022 10.23
> > > > > >
> > > > > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup
> wrote:
> > > > > > > +TO: @Bruce and @Stephen: You signed off on the 16 bit
> > > alignment
> > > > > > requirement. We need background info on this.
> > > > > > >
> > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > > > > >
> > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > Sent: den 20 juni 2022 12:58
> > > > > > > > >
> > > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > > > > >
> > > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > > > > >
> > > > > > > > > > > > From: Morten Brørup
> > > > > > > > > > > > [mailto:mb@smartsharesystems.com]
> > > > > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > > > > >
> > > > > > > > > > > > With this patch, the checksum can be calculated
> on
> > > > > > > > > > > > an
> > > > > > unligned
> > > > > > > > > > > > part
> > > > > > > > > > of
> > > > > > > > > > > > a packet buffer.
> > > > > > > > > > > > I.e. the buf parameter is no longer required to
> be
> > > > > > > > > > > > 16
> > > bit
> > > > > > > > aligned.
> > > > > > > > > > > >
> > > > > > > > > > > > The DPDK invariant that packet buffers must be 16
> > > > > > > > > > > > bit
> > > > > > aligned
> > > > > > > > > > remains
> > > > > > > > > > > > unchanged.
> > > > > > > > > > > > This invariant also defines how to calculate the
> 16
> > > bit
> > > > > > > > checksum
> > > > > > > > > > > > on
> > > > > > > > > > an
> > > > > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > > > > >
> > > > > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Morten Brørup
> > > <mb@smartsharesystems.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > > > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> > > index
> > > > > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void
> > > > > > > > > > > > *buf,
> > > > > > size_t
> > > > > > > > len,
> > > > > > > > > > > > uint32_t sum)  {
> > > > > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > > > > >  	typedef uint16_t
> > __attribute__((__may_alias__))
> > > > > > u16_p;
> > > > > > > > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > -	const u16_p *end = u16_buf + len /
> > > sizeof(*u16_buf);
> > > > > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > > > > +	const u16_p *end;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/* if buffer is unaligned, keeping it byte
> > > order
> > > > > > > > independent */
> > > > > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > > > > +			return 0;
> > > > > > > > > > > > +		((unsigned char *)&first)[1] =
> > *(const
> > > unsigned
> > > > > > > > > > > char *)buf;
> > > > > > > > > > > > +		sum += first;
> > > > > > > > > > > > +		buf = (const void *)((uintptr_t)buf
> > + 1);
> > > > > > > > > > > > +		len--;
> > > > > > > > > > > > +	}
> > > > > > > > > > > >
> > > > > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > > > > >  		sum += *u16_buf;
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.17.1
> > > > > > > > > > >
> > > > > > > > > > > @Emil, can you please test this patch with an
> > > > > > > > > > > unaligned
> > > > > > buffer on
> > > > > > > > > > your
> > > > > > > > > > > application to confirm that it produces the
> expected
> > > result.
> > > > > > > > > >
> > > > > > > > > > Hi!
> > > > > > > > > >
> > > > > > > > > > I tested the patch. It doesn't seem to produce the
> same
> > > > > > results. I
> > > > > > > > > > think the problem is that it always starts summing
> from
> > > an
> > > > > > > > > > even address, the sum should always start from the
> first
> > > byte
> > > > > > according
> > > > > > > > to
> > > > > > > > > > the checksum specification. Can I instead propose
> > > something
> > > > > > Mattias
> > > > > > > > > > Rönnblom sent me?
> > > > > > > > >
> > > > > > > > > I assume that it produces the same result when the
> "buf"
> > > > > > parameter is
> > > > > > > > > aligned?
> > > > > > > > >
> > > > > > > > > And when the "buf" parameter is unaligned, I don't
> expect
> > > it to
> > > > > > > > produce the
> > > > > > > > > same results as the simple algorithm!
> > > > > > > > >
> > > > > > > > > This was the whole point of the patch: I expect the
> > > > > > > > > overall
> > > > > > packet
> > > > > > > > buffer to
> > > > > > > > > be 16 bit aligned, and the checksum to be a partial
> > > checksum of
> > > > > > such
> > > > > > > > a 16 bit
> > > > > > > > > aligned packet buffer. When calling this function, I
> > > > > > > > > assume
> > > that
> > > > > > the
> > > > > > > > "buf" and
> > > > > > > > > "len" parameters point to a part of such a packet
> buffer.
> > > If
> > > > > > these
> > > > > > > > > expectations are correct, the simple algorithm will
> > > > > > > > > produce
> > > > > > incorrect
> > > > > > > > results
> > > > > > > > > when "buf" is unaligned.
> > > > > > > > >
> > > > > > > > > I was asking you to test if the checksum on the packet
> is
> > > > > > > > > correct
> > > > > > > > when your
> > > > > > > > > application modifies an unaligned part of the packet
> and
> > > uses
> > > > > > this
> > > > > > > > function to
> > > > > > > > > update the checksum.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Now I understand your use case. Your use case seems to be
> > > about
> > > > > > partial
> > > > > > > > checksums, of which some partial checksums may start on
> > > unaligned
> > > > > > > > addresses in an otherwise aligned packet.
> > > > > > > >
> > > > > > > > Our use case is about calculating the full checksum on a
> > > nested
> > > > > > packet.
> > > > > > > > That nested packet may start on unaligned addresses.
> > > > > > > >
> > > > > > > > The difference is basically if we want to sum over
> aligned
> > > > > > addresses or
> > > > > > > > not, handling the heading and trailing bytes
> appropriately.
> > > > > > > >
> > > > > > > > Your method does not work in our case since we want to
> treat
> > > the
> > > > > > first
> > > > > > > > two bytes as the first word in our case. But I do
> understand
> > > that
> > > > > > both
> > > > > > > > methods are useful.
> > > > > > >
> > > > > > > Yes, that certainly are two different use cases, requiring
> two
> > > > > > different ways of calculating the 16 bit checksum.
> > > > > > >
> > > > > > > >
> > > > > > > > Note that your method breaks the API. Previously
> (assuming
> > > > > > > > no
> > > > > > crashing
> > > > > > > > due to low optimization levels, more accepting hardware,
> or
> > > > > > > > a
> > > > > > different
> > > > > > > > compiler (version)) the current method would calculate
> the
> > > > > > > > checksum assuming the first two bytes is the first word.
> > > > > > > >
> > > > > > >
> > > > > > > Depending on the point of view, my patch either fixes a bug
> > > (where
> > > > > > the checksum was calculated incorrectly when the buf pointer
> was
> > > > > > unaligned) or breaks the API (by calculating the differently
> > > > > > when
> > > the
> > > > > > buffer is unaligned).
> > > > > > >
> > > > > > > I cannot say with certainty which one is correct, but
> perhaps
> > > some
> > > > > > > of
> > > > > > the people with a deeper DPDK track record can...
> > > > > > >
> > > > > > > @Bruce and @Stephen, in 2019 you signed off on a patch [1]
> > > > > > introducing a 16 bit alignment requirement to the Ethernet
> > > address
> > > > > > structure.
> > > > > > >
> > > > > > > It is my understanding that DPDK has an invariant requiring
> > > packets
> > > > > > to be 16 bit aligned, which that patch supports. Is this
> > > invariant
> > > > > > documented anywhere, or am I completely wrong? If I'm wrong,
> > > > > > then
> > > the
> > > > > > alignment requirement introduced in that patch needs to be
> > > removed, as
> > > > > > well as any similar alignment requirements elsewhere in DPDK.
> > > > > >
> > > > > > I don't believe it is explicitly documented as a global
> > > invariant, but
> > > > > > I think it should be unless there is a definite case where we
> > > need to
> > > > > > allow packets to be completely unaligned. Across all packet
> > > headers we
> > > > > > looked at, there was no tunneling protocol where the
> resulting
> > > packet
> > > > > > was left unaligned.
> > > > > >
> > > > > > That said, if there are real use cases where we need to allow
> > > packets
> > > > > > to start at an unaligned address, then I agree with you that
> we
> > > need
> > > > > > to roll back the patch and work to ensure everything works
> with
> > > > > > unaligned addresses.
> > > > > >
> > > > > > /Bruce
> > > > > >
> > > > >
> > > > > @Emil, can you please describe or refer to which tunneling
> > > > > protocol
> > > you are
> > > > > using, where the nested packet can be unaligned?
> > > > >
> > > > > I am asking to determine if your use case is exotic (maybe some
> > > Ericsson
> > > > > proprietary protocol), or more generic (rooted in some standard
> > > protocol).
> > > > > This information affects the DPDK community's opinion about how
> it
> > > should
> > > > > be supported by DPDK.
> > > > >
> > > > > If possible, please provide more details about the tunneling
> > > protocol and
> > > > > nested packets... E.g. do the nested packets also contain Layer
> 2
> > > (Ethernet,
> > > > > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP, UDP,
> > > etc.)? And how
> > > > > about ARP packets and Layer 2 control protocol packets (STP,
> LACP,
> > > etc.)?
> > > > >
> > > >
> > > > Well, if you append or adjust an odd number of bytes (e.g. a PDCP
> > > header) from a previously aligned payload the entire packet will
> then
> > > be unaligned.
> > > >
> > >
> > > If PDCP headers can leave the rest of the packet field unaligned,
> then
> > > we had better remove the alignment restrictions through all of
> DPDK.
> > >
> > > /Bruce
> >
> > Re-reading the details regarding unaligned pointers in C11, as posted
> by Emil
> > in Bugzilla [2], I interpret it as follows: Any 16 bit or wider
> pointer type a must
> > point to data aligned with that type, i.e. a pointer of the type
> "uint16_t *"
> > must point to 16 bit aligned data, and a pointer of the type
> "uint64_t *" must
> > point to 64 bit aligned data. Please, someone tell me I got this
> wrong, and
> > wake me up from my nightmare!
> >
> > Updating DPDK's packet structures to fully support this C11
> limitation with
> > unaligned access would be a nightmare, as we would need to use byte
> arrays
> > for all structure fields. Functions would also be unable to use other
> pointer
> > types than "void *" and "char *", which seems to be the actual
> problem in
> > the __rte_raw_cksum() function. I guess that it also would prevent
> the
> > compiler from auto-vectorizing the functions.
> >
> > I am usually a big proponent of academically correct solutions, but
> such a
> > change would be too wide ranging, so I would like to narrow it down
> to the
> > actual use case, and perhaps extrapolate a bit from there.
> >
> > @Emil: Do you only need to calculate the checksum of the (potentially
> > unaligned) embedded packet? Or do you also need to use other DPDK
> > functions with the embedded packet, potentially accessing it at an
> unaligned
> > address?
> >
> > I'm trying to determine the scope of this C11 pointer alignment
> limitation for
> > your use case, i.e. whether or not other DPDK functions need to be
> updated
> > to support unaligned packet access too.
> >
> > [2] https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-
> > 454445554331-2ffe58e5caaeb74e&q=1&e=3f0544d3-8a71-4676-b4f9-
> > 27e0952f7de0&u=https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%
> > 3D1035
> 
> That's my interpretation of the standard as well; For example an
> uint16_t* must be on even addresses. If not it is undefined behavior. I
> think this is a bigger problem on ARM for example.
> 
> Without being that invested in dpdk, adding unaligned support for
> everything seems like a steep step, but I'm not sure what it entails in
> practice.
> 
> We are actually only interested in the checksumming.

Great! Then we can cancel the panic about rewriting DPDK Core completely. Although it might still need some review for similar alignment bugs, where we have been forcing the compiler shut up when trying to warn us. :-)

I have provided v3 of the patch, which should do as requested - and still allow the compiler to auto-vectorize.

@Emil, will you please test v3 of the patch?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-22 14:01                                     ` Morten Brørup
@ 2022-06-22 14:03                                       ` Emil Berg
  2022-06-23  5:21                                       ` Emil Berg
  1 sibling, 0 replies; 74+ messages in thread
From: Emil Berg @ 2022-06-22 14:03 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: Stephen Hemminger, stable, bugzilla, hofors, olivier.matz, dev

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 22 juni 2022 16:02
> To: Emil Berg <emil.berg@ericsson.com>; Bruce Richardson
> <bruce.richardson@intel.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>;
> stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com; dev@dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > Sent: Wednesday, 22 June 2022 14.25
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: den 22 juni 2022 13:26
> > >
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Wednesday, 22 June 2022 11.18
> > > >
> > > > On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> > > > >
> > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > Sent: den 21 juni 2022 11:35
> > > > > >
> > > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > > Sent: Tuesday, 21 June 2022 10.23
> > > > > > >
> > > > > > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup
> > wrote:
> > > > > > > > +TO: @Bruce and @Stephen: You signed off on the 16 bit
> > > > alignment
> > > > > > > requirement. We need background info on this.
> > > > > > > >
> > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > > > > > >
> > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > Sent: den 20 juni 2022 12:58
> > > > > > > > > >
> > > > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > > > > > >
> > > > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Morten Brørup
> > > > > > > > > > > > > [mailto:mb@smartsharesystems.com]
> > > > > > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > > > > > >
> > > > > > > > > > > > > With this patch, the checksum can be calculated
> > on
> > > > > > > > > > > > > an
> > > > > > > unligned
> > > > > > > > > > > > > part
> > > > > > > > > > > of
> > > > > > > > > > > > > a packet buffer.
> > > > > > > > > > > > > I.e. the buf parameter is no longer required to
> > be
> > > > > > > > > > > > > 16
> > > > bit
> > > > > > > > > aligned.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The DPDK invariant that packet buffers must be
> > > > > > > > > > > > > 16 bit
> > > > > > > aligned
> > > > > > > > > > > remains
> > > > > > > > > > > > > unchanged.
> > > > > > > > > > > > > This invariant also defines how to calculate the
> > 16
> > > > bit
> > > > > > > > > checksum
> > > > > > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Morten Brørup
> > > > <mb@smartsharesystems.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > > > > > >  1 file changed, 15 insertions(+), 2
> > > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> > > > index
> > > > > > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void
> > > > > > > > > > > > > *buf,
> > > > > > > size_t
> > > > > > > > > len,
> > > > > > > > > > > > > uint32_t sum)  {
> > > > > > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > > > > > >  	typedef uint16_t
> > > __attribute__((__may_alias__))
> > > > > > > u16_p;
> > > > > > > > > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > > -	const u16_p *end = u16_buf + len /
> > > > sizeof(*u16_buf);
> > > > > > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > > > > > +	const u16_p *end;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	/* if buffer is unaligned, keeping it byte
> > > > order
> > > > > > > > > independent */
> > > > > > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > > > > > +			return 0;
> > > > > > > > > > > > > +		((unsigned char *)&first)[1] =
> > > *(const
> > > > unsigned
> > > > > > > > > > > > char *)buf;
> > > > > > > > > > > > > +		sum += first;
> > > > > > > > > > > > > +		buf = (const void *)((uintptr_t)buf
> > > + 1);
> > > > > > > > > > > > > +		len--;
> > > > > > > > > > > > > +	}
> > > > > > > > > > > > >
> > > > > > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > > > > > >  		sum += *u16_buf;
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > 2.17.1
> > > > > > > > > > > >
> > > > > > > > > > > > @Emil, can you please test this patch with an
> > > > > > > > > > > > unaligned
> > > > > > > buffer on
> > > > > > > > > > > your
> > > > > > > > > > > > application to confirm that it produces the
> > expected
> > > > result.
> > > > > > > > > > >
> > > > > > > > > > > Hi!
> > > > > > > > > > >
> > > > > > > > > > > I tested the patch. It doesn't seem to produce the
> > same
> > > > > > > results. I
> > > > > > > > > > > think the problem is that it always starts summing
> > from
> > > > an
> > > > > > > > > > > even address, the sum should always start from the
> > first
> > > > byte
> > > > > > > according
> > > > > > > > > to
> > > > > > > > > > > the checksum specification. Can I instead propose
> > > > something
> > > > > > > Mattias
> > > > > > > > > > > Rönnblom sent me?
> > > > > > > > > >
> > > > > > > > > > I assume that it produces the same result when the
> > "buf"
> > > > > > > parameter is
> > > > > > > > > > aligned?
> > > > > > > > > >
> > > > > > > > > > And when the "buf" parameter is unaligned, I don't
> > expect
> > > > it to
> > > > > > > > > produce the
> > > > > > > > > > same results as the simple algorithm!
> > > > > > > > > >
> > > > > > > > > > This was the whole point of the patch: I expect the
> > > > > > > > > > overall
> > > > > > > packet
> > > > > > > > > buffer to
> > > > > > > > > > be 16 bit aligned, and the checksum to be a partial
> > > > checksum of
> > > > > > > such
> > > > > > > > > a 16 bit
> > > > > > > > > > aligned packet buffer. When calling this function, I
> > > > > > > > > > assume
> > > > that
> > > > > > > the
> > > > > > > > > "buf" and
> > > > > > > > > > "len" parameters point to a part of such a packet
> > buffer.
> > > > If
> > > > > > > these
> > > > > > > > > > expectations are correct, the simple algorithm will
> > > > > > > > > > produce
> > > > > > > incorrect
> > > > > > > > > results
> > > > > > > > > > when "buf" is unaligned.
> > > > > > > > > >
> > > > > > > > > > I was asking you to test if the checksum on the packet
> > is
> > > > > > > > > > correct
> > > > > > > > > when your
> > > > > > > > > > application modifies an unaligned part of the packet
> > and
> > > > uses
> > > > > > > this
> > > > > > > > > function to
> > > > > > > > > > update the checksum.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Now I understand your use case. Your use case seems to
> > > > > > > > > be
> > > > about
> > > > > > > partial
> > > > > > > > > checksums, of which some partial checksums may start on
> > > > unaligned
> > > > > > > > > addresses in an otherwise aligned packet.
> > > > > > > > >
> > > > > > > > > Our use case is about calculating the full checksum on a
> > > > nested
> > > > > > > packet.
> > > > > > > > > That nested packet may start on unaligned addresses.
> > > > > > > > >
> > > > > > > > > The difference is basically if we want to sum over
> > aligned
> > > > > > > addresses or
> > > > > > > > > not, handling the heading and trailing bytes
> > appropriately.
> > > > > > > > >
> > > > > > > > > Your method does not work in our case since we want to
> > treat
> > > > the
> > > > > > > first
> > > > > > > > > two bytes as the first word in our case. But I do
> > understand
> > > > that
> > > > > > > both
> > > > > > > > > methods are useful.
> > > > > > > >
> > > > > > > > Yes, that certainly are two different use cases, requiring
> > two
> > > > > > > different ways of calculating the 16 bit checksum.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Note that your method breaks the API. Previously
> > (assuming
> > > > > > > > > no
> > > > > > > crashing
> > > > > > > > > due to low optimization levels, more accepting hardware,
> > or
> > > > > > > > > a
> > > > > > > different
> > > > > > > > > compiler (version)) the current method would calculate
> > the
> > > > > > > > > checksum assuming the first two bytes is the first word.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Depending on the point of view, my patch either fixes a
> > > > > > > > bug
> > > > (where
> > > > > > > the checksum was calculated incorrectly when the buf pointer
> > was
> > > > > > > unaligned) or breaks the API (by calculating the differently
> > > > > > > when
> > > > the
> > > > > > > buffer is unaligned).
> > > > > > > >
> > > > > > > > I cannot say with certainty which one is correct, but
> > perhaps
> > > > some
> > > > > > > > of
> > > > > > > the people with a deeper DPDK track record can...
> > > > > > > >
> > > > > > > > @Bruce and @Stephen, in 2019 you signed off on a patch [1]
> > > > > > > introducing a 16 bit alignment requirement to the Ethernet
> > > > address
> > > > > > > structure.
> > > > > > > >
> > > > > > > > It is my understanding that DPDK has an invariant
> > > > > > > > requiring
> > > > packets
> > > > > > > to be 16 bit aligned, which that patch supports. Is this
> > > > invariant
> > > > > > > documented anywhere, or am I completely wrong? If I'm wrong,
> > > > > > > then
> > > > the
> > > > > > > alignment requirement introduced in that patch needs to be
> > > > removed, as
> > > > > > > well as any similar alignment requirements elsewhere in DPDK.
> > > > > > >
> > > > > > > I don't believe it is explicitly documented as a global
> > > > invariant, but
> > > > > > > I think it should be unless there is a definite case where
> > > > > > > we
> > > > need to
> > > > > > > allow packets to be completely unaligned. Across all packet
> > > > headers we
> > > > > > > looked at, there was no tunneling protocol where the
> > resulting
> > > > packet
> > > > > > > was left unaligned.
> > > > > > >
> > > > > > > That said, if there are real use cases where we need to
> > > > > > > allow
> > > > packets
> > > > > > > to start at an unaligned address, then I agree with you that
> > we
> > > > need
> > > > > > > to roll back the patch and work to ensure everything works
> > with
> > > > > > > unaligned addresses.
> > > > > > >
> > > > > > > /Bruce
> > > > > > >
> > > > > >
> > > > > > @Emil, can you please describe or refer to which tunneling
> > > > > > protocol
> > > > you are
> > > > > > using, where the nested packet can be unaligned?
> > > > > >
> > > > > > I am asking to determine if your use case is exotic (maybe
> > > > > > some
> > > > Ericsson
> > > > > > proprietary protocol), or more generic (rooted in some
> > > > > > standard
> > > > protocol).
> > > > > > This information affects the DPDK community's opinion about
> > > > > > how
> > it
> > > > should
> > > > > > be supported by DPDK.
> > > > > >
> > > > > > If possible, please provide more details about the tunneling
> > > > protocol and
> > > > > > nested packets... E.g. do the nested packets also contain
> > > > > > Layer
> > 2
> > > > (Ethernet,
> > > > > > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP,
> > > > > > UDP,
> > > > etc.)? And how
> > > > > > about ARP packets and Layer 2 control protocol packets (STP,
> > LACP,
> > > > etc.)?
> > > > > >
> > > > >
> > > > > Well, if you append or adjust an odd number of bytes (e.g. a
> > > > > PDCP
> > > > header) from a previously aligned payload the entire packet will
> > then
> > > > be unaligned.
> > > > >
> > > >
> > > > If PDCP headers can leave the rest of the packet field unaligned,
> > then
> > > > we had better remove the alignment restrictions through all of
> > DPDK.
> > > >
> > > > /Bruce
> > >
> > > Re-reading the details regarding unaligned pointers in C11, as
> > > posted
> > by Emil
> > > in Bugzilla [2], I interpret it as follows: Any 16 bit or wider
> > pointer type a must
> > > point to data aligned with that type, i.e. a pointer of the type
> > "uint16_t *"
> > > must point to 16 bit aligned data, and a pointer of the type
> > "uint64_t *" must
> > > point to 64 bit aligned data. Please, someone tell me I got this
> > wrong, and
> > > wake me up from my nightmare!
> > >
> > > Updating DPDK's packet structures to fully support this C11
> > limitation with
> > > unaligned access would be a nightmare, as we would need to use byte
> > arrays
> > > for all structure fields. Functions would also be unable to use
> > > other
> > pointer
> > > types than "void *" and "char *", which seems to be the actual
> > problem in
> > > the __rte_raw_cksum() function. I guess that it also would prevent
> > the
> > > compiler from auto-vectorizing the functions.
> > >
> > > I am usually a big proponent of academically correct solutions, but
> > such a
> > > change would be too wide ranging, so I would like to narrow it down
> > to the
> > > actual use case, and perhaps extrapolate a bit from there.
> > >
> > > @Emil: Do you only need to calculate the checksum of the
> > > (potentially
> > > unaligned) embedded packet? Or do you also need to use other DPDK
> > > functions with the embedded packet, potentially accessing it at an
> > unaligned
> > > address?
> > >
> > > I'm trying to determine the scope of this C11 pointer alignment
> > limitation for
> > > your use case, i.e. whether or not other DPDK functions need to be
> > updated
> > > to support unaligned packet access too.
> > >
> > > [2]
> > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-
> > > 454445554331-2ffe58e5caaeb74e&q=1&e=3f0544d3-8a71-4676-b4f9-
> > >
> 27e0952f7de0&u=https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%
> > > 3D1035
> >
> > That's my interpretation of the standard as well; For example an
> > uint16_t* must be on even addresses. If not it is undefined behavior.
> > I think this is a bigger problem on ARM for example.
> >
> > Without being that invested in dpdk, adding unaligned support for
> > everything seems like a steep step, but I'm not sure what it entails
> > in practice.
> >
> > We are actually only interested in the checksumming.
> 
> Great! Then we can cancel the panic about rewriting DPDK Core completely.
> Although it might still need some review for similar alignment bugs, where
> we have been forcing the compiler shut up when trying to warn us. :-)
> 
> I have provided v3 of the patch, which should do as requested - and still allow
> the compiler to auto-vectorize.
> 
> @Emil, will you please test v3 of the patch?

I will test the patch tomorrow.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-22 14:01                                     ` Morten Brørup
  2022-06-22 14:03                                       ` Emil Berg
@ 2022-06-23  5:21                                       ` Emil Berg
  2022-06-23  7:01                                         ` Morten Brørup
  1 sibling, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-23  5:21 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: Stephen Hemminger, stable, bugzilla, hofors, olivier.matz, dev



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 22 juni 2022 16:02
> To: Emil Berg <emil.berg@ericsson.com>; Bruce Richardson
> <bruce.richardson@intel.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>;
> stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com; dev@dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > Sent: Wednesday, 22 June 2022 14.25
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: den 22 juni 2022 13:26
> > >
> > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > Sent: Wednesday, 22 June 2022 11.18
> > > >
> > > > On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> > > > >
> > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > Sent: den 21 juni 2022 11:35
> > > > > >
> > > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > > Sent: Tuesday, 21 June 2022 10.23
> > > > > > >
> > > > > > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup
> > wrote:
> > > > > > > > +TO: @Bruce and @Stephen: You signed off on the 16 bit
> > > > alignment
> > > > > > > requirement. We need background info on this.
> > > > > > > >
> > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > > > > > >
> > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > Sent: den 20 juni 2022 12:58
> > > > > > > > > >
> > > > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > > > > > >
> > > > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Morten Brørup
> > > > > > > > > > > > > [mailto:mb@smartsharesystems.com]
> > > > > > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > > > > > >
> > > > > > > > > > > > > With this patch, the checksum can be calculated
> > on
> > > > > > > > > > > > > an
> > > > > > > unligned
> > > > > > > > > > > > > part
> > > > > > > > > > > of
> > > > > > > > > > > > > a packet buffer.
> > > > > > > > > > > > > I.e. the buf parameter is no longer required to
> > be
> > > > > > > > > > > > > 16
> > > > bit
> > > > > > > > > aligned.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The DPDK invariant that packet buffers must be
> > > > > > > > > > > > > 16 bit
> > > > > > > aligned
> > > > > > > > > > > remains
> > > > > > > > > > > > > unchanged.
> > > > > > > > > > > > > This invariant also defines how to calculate the
> > 16
> > > > bit
> > > > > > > > > checksum
> > > > > > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Morten Brørup
> > > > <mb@smartsharesystems.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > > > > > >  1 file changed, 15 insertions(+), 2
> > > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> > > > index
> > > > > > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const void
> > > > > > > > > > > > > *buf,
> > > > > > > size_t
> > > > > > > > > len,
> > > > > > > > > > > > > uint32_t sum)  {
> > > > > > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > > > > > >  	typedef uint16_t
> > > __attribute__((__may_alias__))
> > > > > > > u16_p;
> > > > > > > > > > > > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > > -	const u16_p *end = u16_buf + len /
> > > > sizeof(*u16_buf);
> > > > > > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > > > > > +	const u16_p *end;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +	/* if buffer is unaligned, keeping it byte
> > > > order
> > > > > > > > > independent */
> > > > > > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > > > > > +			return 0;
> > > > > > > > > > > > > +		((unsigned char *)&first)[1] =
> > > *(const
> > > > unsigned
> > > > > > > > > > > > char *)buf;
> > > > > > > > > > > > > +		sum += first;
> > > > > > > > > > > > > +		buf = (const void *)((uintptr_t)buf
> > > + 1);
> > > > > > > > > > > > > +		len--;
> > > > > > > > > > > > > +	}
> > > > > > > > > > > > >
> > > > > > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > > > > > >  		sum += *u16_buf;
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > 2.17.1
> > > > > > > > > > > >
> > > > > > > > > > > > @Emil, can you please test this patch with an
> > > > > > > > > > > > unaligned
> > > > > > > buffer on
> > > > > > > > > > > your
> > > > > > > > > > > > application to confirm that it produces the
> > expected
> > > > result.
> > > > > > > > > > >
> > > > > > > > > > > Hi!
> > > > > > > > > > >
> > > > > > > > > > > I tested the patch. It doesn't seem to produce the
> > same
> > > > > > > results. I
> > > > > > > > > > > think the problem is that it always starts summing
> > from
> > > > an
> > > > > > > > > > > even address, the sum should always start from the
> > first
> > > > byte
> > > > > > > according
> > > > > > > > > to
> > > > > > > > > > > the checksum specification. Can I instead propose
> > > > something
> > > > > > > Mattias
> > > > > > > > > > > Rönnblom sent me?
> > > > > > > > > >
> > > > > > > > > > I assume that it produces the same result when the
> > "buf"
> > > > > > > parameter is
> > > > > > > > > > aligned?
> > > > > > > > > >
> > > > > > > > > > And when the "buf" parameter is unaligned, I don't
> > expect
> > > > it to
> > > > > > > > > produce the
> > > > > > > > > > same results as the simple algorithm!
> > > > > > > > > >
> > > > > > > > > > This was the whole point of the patch: I expect the
> > > > > > > > > > overall
> > > > > > > packet
> > > > > > > > > buffer to
> > > > > > > > > > be 16 bit aligned, and the checksum to be a partial
> > > > checksum of
> > > > > > > such
> > > > > > > > > a 16 bit
> > > > > > > > > > aligned packet buffer. When calling this function, I
> > > > > > > > > > assume
> > > > that
> > > > > > > the
> > > > > > > > > "buf" and
> > > > > > > > > > "len" parameters point to a part of such a packet
> > buffer.
> > > > If
> > > > > > > these
> > > > > > > > > > expectations are correct, the simple algorithm will
> > > > > > > > > > produce
> > > > > > > incorrect
> > > > > > > > > results
> > > > > > > > > > when "buf" is unaligned.
> > > > > > > > > >
> > > > > > > > > > I was asking you to test if the checksum on the packet
> > is
> > > > > > > > > > correct
> > > > > > > > > when your
> > > > > > > > > > application modifies an unaligned part of the packet
> > and
> > > > uses
> > > > > > > this
> > > > > > > > > function to
> > > > > > > > > > update the checksum.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Now I understand your use case. Your use case seems to
> > > > > > > > > be
> > > > about
> > > > > > > partial
> > > > > > > > > checksums, of which some partial checksums may start on
> > > > unaligned
> > > > > > > > > addresses in an otherwise aligned packet.
> > > > > > > > >
> > > > > > > > > Our use case is about calculating the full checksum on a
> > > > nested
> > > > > > > packet.
> > > > > > > > > That nested packet may start on unaligned addresses.
> > > > > > > > >
> > > > > > > > > The difference is basically if we want to sum over
> > aligned
> > > > > > > addresses or
> > > > > > > > > not, handling the heading and trailing bytes
> > appropriately.
> > > > > > > > >
> > > > > > > > > Your method does not work in our case since we want to
> > treat
> > > > the
> > > > > > > first
> > > > > > > > > two bytes as the first word in our case. But I do
> > understand
> > > > that
> > > > > > > both
> > > > > > > > > methods are useful.
> > > > > > > >
> > > > > > > > Yes, that certainly are two different use cases, requiring
> > two
> > > > > > > different ways of calculating the 16 bit checksum.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Note that your method breaks the API. Previously
> > (assuming
> > > > > > > > > no
> > > > > > > crashing
> > > > > > > > > due to low optimization levels, more accepting hardware,
> > or
> > > > > > > > > a
> > > > > > > different
> > > > > > > > > compiler (version)) the current method would calculate
> > the
> > > > > > > > > checksum assuming the first two bytes is the first word.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Depending on the point of view, my patch either fixes a
> > > > > > > > bug
> > > > (where
> > > > > > > the checksum was calculated incorrectly when the buf pointer
> > was
> > > > > > > unaligned) or breaks the API (by calculating the differently
> > > > > > > when
> > > > the
> > > > > > > buffer is unaligned).
> > > > > > > >
> > > > > > > > I cannot say with certainty which one is correct, but
> > perhaps
> > > > some
> > > > > > > > of
> > > > > > > the people with a deeper DPDK track record can...
> > > > > > > >
> > > > > > > > @Bruce and @Stephen, in 2019 you signed off on a patch [1]
> > > > > > > introducing a 16 bit alignment requirement to the Ethernet
> > > > address
> > > > > > > structure.
> > > > > > > >
> > > > > > > > It is my understanding that DPDK has an invariant
> > > > > > > > requiring
> > > > packets
> > > > > > > to be 16 bit aligned, which that patch supports. Is this
> > > > invariant
> > > > > > > documented anywhere, or am I completely wrong? If I'm wrong,
> > > > > > > then
> > > > the
> > > > > > > alignment requirement introduced in that patch needs to be
> > > > removed, as
> > > > > > > well as any similar alignment requirements elsewhere in DPDK.
> > > > > > >
> > > > > > > I don't believe it is explicitly documented as a global
> > > > invariant, but
> > > > > > > I think it should be unless there is a definite case where
> > > > > > > we
> > > > need to
> > > > > > > allow packets to be completely unaligned. Across all packet
> > > > headers we
> > > > > > > looked at, there was no tunneling protocol where the
> > resulting
> > > > packet
> > > > > > > was left unaligned.
> > > > > > >
> > > > > > > That said, if there are real use cases where we need to
> > > > > > > allow
> > > > packets
> > > > > > > to start at an unaligned address, then I agree with you that
> > we
> > > > need
> > > > > > > to roll back the patch and work to ensure everything works
> > with
> > > > > > > unaligned addresses.
> > > > > > >
> > > > > > > /Bruce
> > > > > > >
> > > > > >
> > > > > > @Emil, can you please describe or refer to which tunneling
> > > > > > protocol
> > > > you are
> > > > > > using, where the nested packet can be unaligned?
> > > > > >
> > > > > > I am asking to determine if your use case is exotic (maybe
> > > > > > some
> > > > Ericsson
> > > > > > proprietary protocol), or more generic (rooted in some
> > > > > > standard
> > > > protocol).
> > > > > > This information affects the DPDK community's opinion about
> > > > > > how
> > it
> > > > should
> > > > > > be supported by DPDK.
> > > > > >
> > > > > > If possible, please provide more details about the tunneling
> > > > protocol and
> > > > > > nested packets... E.g. do the nested packets also contain
> > > > > > Layer
> > 2
> > > > (Ethernet,
> > > > > > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP,
> > > > > > UDP,
> > > > etc.)? And how
> > > > > > about ARP packets and Layer 2 control protocol packets (STP,
> > LACP,
> > > > etc.)?
> > > > > >
> > > > >
> > > > > Well, if you append or adjust an odd number of bytes (e.g. a
> > > > > PDCP
> > > > header) from a previously aligned payload the entire packet will
> > then
> > > > be unaligned.
> > > > >
> > > >
> > > > If PDCP headers can leave the rest of the packet field unaligned,
> > then
> > > > we had better remove the alignment restrictions through all of
> > DPDK.
> > > >
> > > > /Bruce
> > >
> > > Re-reading the details regarding unaligned pointers in C11, as
> > > posted
> > by Emil
> > > in Bugzilla [2], I interpret it as follows: Any 16 bit or wider
> > pointer type a must
> > > point to data aligned with that type, i.e. a pointer of the type
> > "uint16_t *"
> > > must point to 16 bit aligned data, and a pointer of the type
> > "uint64_t *" must
> > > point to 64 bit aligned data. Please, someone tell me I got this
> > wrong, and
> > > wake me up from my nightmare!
> > >
> > > Updating DPDK's packet structures to fully support this C11
> > limitation with
> > > unaligned access would be a nightmare, as we would need to use byte
> > arrays
> > > for all structure fields. Functions would also be unable to use
> > > other
> > pointer
> > > types than "void *" and "char *", which seems to be the actual
> > problem in
> > > the __rte_raw_cksum() function. I guess that it also would prevent
> > the
> > > compiler from auto-vectorizing the functions.
> > >
> > > I am usually a big proponent of academically correct solutions, but
> > such a
> > > change would be too wide ranging, so I would like to narrow it down
> > to the
> > > actual use case, and perhaps extrapolate a bit from there.
> > >
> > > @Emil: Do you only need to calculate the checksum of the
> > > (potentially
> > > unaligned) embedded packet? Or do you also need to use other DPDK
> > > functions with the embedded packet, potentially accessing it at an
> > unaligned
> > > address?
> > >
> > > I'm trying to determine the scope of this C11 pointer alignment
> > limitation for
> > > your use case, i.e. whether or not other DPDK functions need to be
> > updated
> > > to support unaligned packet access too.
> > >
> > > [2]
> > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-
> > > 454445554331-2ffe58e5caaeb74e&q=1&e=3f0544d3-8a71-4676-b4f9-
> > >
> 27e0952f7de0&u=https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%
> > > 3D1035
> >
> > That's my interpretation of the standard as well; For example an
> > uint16_t* must be on even addresses. If not it is undefined behavior.
> > I think this is a bigger problem on ARM for example.
> >
> > Without being that invested in dpdk, adding unaligned support for
> > everything seems like a steep step, but I'm not sure what it entails
> > in practice.
> >
> > We are actually only interested in the checksumming.
> 
> Great! Then we can cancel the panic about rewriting DPDK Core completely.
> Although it might still need some review for similar alignment bugs, where
> we have been forcing the compiler shut up when trying to warn us. :-)
> 
> I have provided v3 of the patch, which should do as requested - and still allow
> the compiler to auto-vectorize.
> 
> @Emil, will you please test v3 of the patch?

It seems to work in these two cases:
* Even address, even length
* Even address, odd length
But it breaks in these two cases:
* Odd address, even length (although it works for small buffers, probably when the sum fits inside a uint16_t integer or something)
* Odd address, odd length
I get (and like) the main idea of the algorithm but haven't yet figured out what the problem is with odd addresses.

/Emil

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-23  5:21                                       ` Emil Berg
@ 2022-06-23  7:01                                         ` Morten Brørup
  2022-06-23 11:39                                           ` Emil Berg
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-23  7:01 UTC (permalink / raw)
  To: Emil Berg, Bruce Richardson
  Cc: Stephen Hemminger, stable, bugzilla, hofors, olivier.matz, dev

> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Thursday, 23 June 2022 07.22
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: den 22 juni 2022 16:02
> >
> > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > Sent: Wednesday, 22 June 2022 14.25
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: den 22 juni 2022 13:26
> > > >
> > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > Sent: Wednesday, 22 June 2022 11.18
> > > > >
> > > > > On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> > > > > >
> > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > Sent: den 21 juni 2022 11:35
> > > > > > >
> > > > > > > > From: Bruce Richardson
> [mailto:bruce.richardson@intel.com]
> > > > > > > > Sent: Tuesday, 21 June 2022 10.23
> > > > > > > >
> > > > > > > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup
> > > wrote:
> > > > > > > > > +TO: @Bruce and @Stephen: You signed off on the 16 bit
> > > > > alignment
> > > > > > > > requirement. We need background info on this.
> > > > > > > > >
> > > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > > > > > > >
> > > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > > Sent: den 20 juni 2022 12:58
> > > > > > > > > > >
> > > > > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Morten Brørup
> > > > > > > > > > > > > > [mailto:mb@smartsharesystems.com]
> > > > > > > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > With this patch, the checksum can be
> calculated
> > > on
> > > > > > > > > > > > > > an
> > > > > > > > unligned
> > > > > > > > > > > > > > part
> > > > > > > > > > > > of
> > > > > > > > > > > > > > a packet buffer.
> > > > > > > > > > > > > > I.e. the buf parameter is no longer required
> to
> > > be
> > > > > > > > > > > > > > 16
> > > > > bit
> > > > > > > > > > aligned.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The DPDK invariant that packet buffers must
> be
> > > > > > > > > > > > > > 16 bit
> > > > > > > > aligned
> > > > > > > > > > > > remains
> > > > > > > > > > > > > > unchanged.
> > > > > > > > > > > > > > This invariant also defines how to calculate
> the
> > > 16
> > > > > bit
> > > > > > > > > > checksum
> > > > > > > > > > > > > > on
> > > > > > > > > > > > an
> > > > > > > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Morten Brørup
> > > > > <mb@smartsharesystems.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > > > > > > >  1 file changed, 15 insertions(+), 2
> > > > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/lib/net/rte_ip.h
> b/lib/net/rte_ip.h
> > > > > index
> > > > > > > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const
> void
> > > > > > > > > > > > > > *buf,
> > > > > > > > size_t
> > > > > > > > > > len,
> > > > > > > > > > > > > > uint32_t sum)  {
> > > > > > > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > > > > > > >  	typedef uint16_t
> > > > __attribute__((__may_alias__))
> > > > > > > > u16_p;
> > > > > > > > > > > > > > -	const u16_p *u16_buf = (const u16_p
> *)buf;
> > > > > > > > > > > > > > -	const u16_p *end = u16_buf + len /
> > > > > sizeof(*u16_buf);
> > > > > > > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > > > > > > +	const u16_p *end;
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +	/* if buffer is unaligned, keeping it
> byte
> > > > > order
> > > > > > > > > > independent */
> > > > > > > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > > > > > > +			return 0;
> > > > > > > > > > > > > > +		((unsigned char *)&first)[1] =
> > > > *(const
> > > > > unsigned
> > > > > > > > > > > > > char *)buf;
> > > > > > > > > > > > > > +		sum += first;
> > > > > > > > > > > > > > +		buf = (const void *)((uintptr_t)buf
> > > > + 1);
> > > > > > > > > > > > > > +		len--;
> > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > > > > > > >  		sum += *u16_buf;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.17.1
> > > > > > > > > > > > >
> > > > > > > > > > > > > @Emil, can you please test this patch with an
> > > > > > > > > > > > > unaligned
> > > > > > > > buffer on
> > > > > > > > > > > > your
> > > > > > > > > > > > > application to confirm that it produces the
> > > expected
> > > > > result.
> > > > > > > > > > > >
> > > > > > > > > > > > Hi!
> > > > > > > > > > > >
> > > > > > > > > > > > I tested the patch. It doesn't seem to produce
> the
> > > same
> > > > > > > > results. I
> > > > > > > > > > > > think the problem is that it always starts
> summing
> > > from
> > > > > an
> > > > > > > > > > > > even address, the sum should always start from
> the
> > > first
> > > > > byte
> > > > > > > > according
> > > > > > > > > > to
> > > > > > > > > > > > the checksum specification. Can I instead propose
> > > > > something
> > > > > > > > Mattias
> > > > > > > > > > > > Rönnblom sent me?
> > > > > > > > > > >
> > > > > > > > > > > I assume that it produces the same result when the
> > > "buf"
> > > > > > > > parameter is
> > > > > > > > > > > aligned?
> > > > > > > > > > >
> > > > > > > > > > > And when the "buf" parameter is unaligned, I don't
> > > expect
> > > > > it to
> > > > > > > > > > produce the
> > > > > > > > > > > same results as the simple algorithm!
> > > > > > > > > > >
> > > > > > > > > > > This was the whole point of the patch: I expect the
> > > > > > > > > > > overall
> > > > > > > > packet
> > > > > > > > > > buffer to
> > > > > > > > > > > be 16 bit aligned, and the checksum to be a partial
> > > > > checksum of
> > > > > > > > such
> > > > > > > > > > a 16 bit
> > > > > > > > > > > aligned packet buffer. When calling this function,
> I
> > > > > > > > > > > assume
> > > > > that
> > > > > > > > the
> > > > > > > > > > "buf" and
> > > > > > > > > > > "len" parameters point to a part of such a packet
> > > buffer.
> > > > > If
> > > > > > > > these
> > > > > > > > > > > expectations are correct, the simple algorithm will
> > > > > > > > > > > produce
> > > > > > > > incorrect
> > > > > > > > > > results
> > > > > > > > > > > when "buf" is unaligned.
> > > > > > > > > > >
> > > > > > > > > > > I was asking you to test if the checksum on the
> packet
> > > is
> > > > > > > > > > > correct
> > > > > > > > > > when your
> > > > > > > > > > > application modifies an unaligned part of the
> packet
> > > and
> > > > > uses
> > > > > > > > this
> > > > > > > > > > function to
> > > > > > > > > > > update the checksum.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Now I understand your use case. Your use case seems
> to
> > > > > > > > > > be
> > > > > about
> > > > > > > > partial
> > > > > > > > > > checksums, of which some partial checksums may start
> on
> > > > > unaligned
> > > > > > > > > > addresses in an otherwise aligned packet.
> > > > > > > > > >
> > > > > > > > > > Our use case is about calculating the full checksum
> on a
> > > > > nested
> > > > > > > > packet.
> > > > > > > > > > That nested packet may start on unaligned addresses.
> > > > > > > > > >
> > > > > > > > > > The difference is basically if we want to sum over
> > > aligned
> > > > > > > > addresses or
> > > > > > > > > > not, handling the heading and trailing bytes
> > > appropriately.
> > > > > > > > > >
> > > > > > > > > > Your method does not work in our case since we want
> to
> > > treat
> > > > > the
> > > > > > > > first
> > > > > > > > > > two bytes as the first word in our case. But I do
> > > understand
> > > > > that
> > > > > > > > both
> > > > > > > > > > methods are useful.
> > > > > > > > >
> > > > > > > > > Yes, that certainly are two different use cases,
> requiring
> > > two
> > > > > > > > different ways of calculating the 16 bit checksum.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Note that your method breaks the API. Previously
> > > (assuming
> > > > > > > > > > no
> > > > > > > > crashing
> > > > > > > > > > due to low optimization levels, more accepting
> hardware,
> > > or
> > > > > > > > > > a
> > > > > > > > different
> > > > > > > > > > compiler (version)) the current method would
> calculate
> > > the
> > > > > > > > > > checksum assuming the first two bytes is the first
> word.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Depending on the point of view, my patch either fixes a
> > > > > > > > > bug
> > > > > (where
> > > > > > > > the checksum was calculated incorrectly when the buf
> pointer
> > > was
> > > > > > > > unaligned) or breaks the API (by calculating the
> differently
> > > > > > > > when
> > > > > the
> > > > > > > > buffer is unaligned).
> > > > > > > > >
> > > > > > > > > I cannot say with certainty which one is correct, but
> > > perhaps
> > > > > some
> > > > > > > > > of
> > > > > > > > the people with a deeper DPDK track record can...
> > > > > > > > >
> > > > > > > > > @Bruce and @Stephen, in 2019 you signed off on a patch
> [1]
> > > > > > > > introducing a 16 bit alignment requirement to the
> Ethernet
> > > > > address
> > > > > > > > structure.
> > > > > > > > >
> > > > > > > > > It is my understanding that DPDK has an invariant
> > > > > > > > > requiring
> > > > > packets
> > > > > > > > to be 16 bit aligned, which that patch supports. Is this
> > > > > invariant
> > > > > > > > documented anywhere, or am I completely wrong? If I'm
> wrong,
> > > > > > > > then
> > > > > the
> > > > > > > > alignment requirement introduced in that patch needs to
> be
> > > > > removed, as
> > > > > > > > well as any similar alignment requirements elsewhere in
> DPDK.
> > > > > > > >
> > > > > > > > I don't believe it is explicitly documented as a global
> > > > > invariant, but
> > > > > > > > I think it should be unless there is a definite case
> where
> > > > > > > > we
> > > > > need to
> > > > > > > > allow packets to be completely unaligned. Across all
> packet
> > > > > headers we
> > > > > > > > looked at, there was no tunneling protocol where the
> > > resulting
> > > > > packet
> > > > > > > > was left unaligned.
> > > > > > > >
> > > > > > > > That said, if there are real use cases where we need to
> > > > > > > > allow
> > > > > packets
> > > > > > > > to start at an unaligned address, then I agree with you
> that
> > > we
> > > > > need
> > > > > > > > to roll back the patch and work to ensure everything
> works
> > > with
> > > > > > > > unaligned addresses.
> > > > > > > >
> > > > > > > > /Bruce
> > > > > > > >
> > > > > > >
> > > > > > > @Emil, can you please describe or refer to which tunneling
> > > > > > > protocol
> > > > > you are
> > > > > > > using, where the nested packet can be unaligned?
> > > > > > >
> > > > > > > I am asking to determine if your use case is exotic (maybe
> > > > > > > some
> > > > > Ericsson
> > > > > > > proprietary protocol), or more generic (rooted in some
> > > > > > > standard
> > > > > protocol).
> > > > > > > This information affects the DPDK community's opinion about
> > > > > > > how
> > > it
> > > > > should
> > > > > > > be supported by DPDK.
> > > > > > >
> > > > > > > If possible, please provide more details about the
> tunneling
> > > > > protocol and
> > > > > > > nested packets... E.g. do the nested packets also contain
> > > > > > > Layer
> > > 2
> > > > > (Ethernet,
> > > > > > > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP,
> > > > > > > UDP,
> > > > > etc.)? And how
> > > > > > > about ARP packets and Layer 2 control protocol packets
> (STP,
> > > LACP,
> > > > > etc.)?
> > > > > > >
> > > > > >
> > > > > > Well, if you append or adjust an odd number of bytes (e.g. a
> > > > > > PDCP
> > > > > header) from a previously aligned payload the entire packet
> will
> > > then
> > > > > be unaligned.
> > > > > >
> > > > >
> > > > > If PDCP headers can leave the rest of the packet field
> unaligned,
> > > then
> > > > > we had better remove the alignment restrictions through all of
> > > DPDK.
> > > > >
> > > > > /Bruce
> > > >
> > > > Re-reading the details regarding unaligned pointers in C11, as
> > > > posted
> > > by Emil
> > > > in Bugzilla [2], I interpret it as follows: Any 16 bit or wider
> > > pointer type a must
> > > > point to data aligned with that type, i.e. a pointer of the type
> > > "uint16_t *"
> > > > must point to 16 bit aligned data, and a pointer of the type
> > > "uint64_t *" must
> > > > point to 64 bit aligned data. Please, someone tell me I got this
> > > wrong, and
> > > > wake me up from my nightmare!
> > > >
> > > > Updating DPDK's packet structures to fully support this C11
> > > limitation with
> > > > unaligned access would be a nightmare, as we would need to use
> byte
> > > arrays
> > > > for all structure fields. Functions would also be unable to use
> > > > other
> > > pointer
> > > > types than "void *" and "char *", which seems to be the actual
> > > problem in
> > > > the __rte_raw_cksum() function. I guess that it also would
> prevent
> > > the
> > > > compiler from auto-vectorizing the functions.
> > > >
> > > > I am usually a big proponent of academically correct solutions,
> but
> > > such a
> > > > change would be too wide ranging, so I would like to narrow it
> down
> > > to the
> > > > actual use case, and perhaps extrapolate a bit from there.
> > > >
> > > > @Emil: Do you only need to calculate the checksum of the
> > > > (potentially
> > > > unaligned) embedded packet? Or do you also need to use other DPDK
> > > > functions with the embedded packet, potentially accessing it at
> an
> > > unaligned
> > > > address?
> > > >
> > > > I'm trying to determine the scope of this C11 pointer alignment
> > > limitation for
> > > > your use case, i.e. whether or not other DPDK functions need to
> be
> > > updated
> > > > to support unaligned packet access too.
> > > >
> > > > [2]
> > > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af-
> > > > 454445554331-2ffe58e5caaeb74e&q=1&e=3f0544d3-8a71-4676-b4f9-
> > > >
> > 27e0952f7de0&u=https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%
> > > > 3D1035
> > >
> > > That's my interpretation of the standard as well; For example an
> > > uint16_t* must be on even addresses. If not it is undefined
> behavior.
> > > I think this is a bigger problem on ARM for example.
> > >
> > > Without being that invested in dpdk, adding unaligned support for
> > > everything seems like a steep step, but I'm not sure what it
> entails
> > > in practice.
> > >
> > > We are actually only interested in the checksumming.
> >
> > Great! Then we can cancel the panic about rewriting DPDK Core
> completely.
> > Although it might still need some review for similar alignment bugs,
> where
> > we have been forcing the compiler shut up when trying to warn us. :-)
> >
> > I have provided v3 of the patch, which should do as requested - and
> still allow
> > the compiler to auto-vectorize.
> >
> > @Emil, will you please test v3 of the patch?
> 
> It seems to work in these two cases:
> * Even address, even length
> * Even address, odd length
> But it breaks in these two cases:
> * Odd address, even length (although it works for small buffers,
> probably when the sum fits inside a uint16_t integer or something)

Interesting observation, good analysis.

> * Odd address, odd length

Does this also work for small buffers?

> I get (and like) the main idea of the algorithm but haven't yet figured
> out what the problem is with odd addresses.

I wonder if I messed up the algorithm for swapping back the bytes in bsum after the calculation... Is the checksum also wrong when compiling without optimization?

And just to be sure: The algorithm requires that __rte_raw_cksum_reduce() is also applied to the sum. Please confirm that you call rte_raw_cksum() (or __rte_raw_cksum() followed by __rte_raw_cksum_reduce())?

> 
> /Emil


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-23  7:01                                         ` Morten Brørup
@ 2022-06-23 11:39                                           ` Emil Berg
  2022-06-23 12:18                                             ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-23 11:39 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: Stephen Hemminger, stable, bugzilla, hofors, olivier.matz, dev



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 23 juni 2022 09:01
> To: Emil Berg <emil.berg@ericsson.com>; Bruce Richardson
> <bruce.richardson@intel.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>;
> stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com; dev@dpdk.org
> Subject: RE: [PATCH] net: fix checksum with unaligned buffer
> 
> > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > Sent: Thursday, 23 June 2022 07.22
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: den 22 juni 2022 16:02
> > >
> > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > Sent: Wednesday, 22 June 2022 14.25
> > > >
> > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > Sent: den 22 juni 2022 13:26
> > > > >
> > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > Sent: Wednesday, 22 June 2022 11.18
> > > > > >
> > > > > > On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> > > > > > >
> > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > Sent: den 21 juni 2022 11:35
> > > > > > > >
> > > > > > > > > From: Bruce Richardson
> > [mailto:bruce.richardson@intel.com]
> > > > > > > > > Sent: Tuesday, 21 June 2022 10.23
> > > > > > > > >
> > > > > > > > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten Brørup
> > > > wrote:
> > > > > > > > > > +TO: @Bruce and @Stephen: You signed off on the 16 bit
> > > > > > alignment
> > > > > > > > > requirement. We need background info on this.
> > > > > > > > > >
> > > > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > > > > > > > >
> > > > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > > > Sent: den 20 juni 2022 12:58
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Morten Brørup
> <mb@smartsharesystems.com>
> > > > > > > > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Morten Brørup
> > > > > > > > > > > > > > > [mailto:mb@smartsharesystems.com]
> > > > > > > > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > With this patch, the checksum can be
> > calculated
> > > > on
> > > > > > > > > > > > > > > an
> > > > > > > > > unligned
> > > > > > > > > > > > > > > part
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > a packet buffer.
> > > > > > > > > > > > > > > I.e. the buf parameter is no longer required
> > to
> > > > be
> > > > > > > > > > > > > > > 16
> > > > > > bit
> > > > > > > > > > > aligned.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The DPDK invariant that packet buffers must
> > be
> > > > > > > > > > > > > > > 16 bit
> > > > > > > > > aligned
> > > > > > > > > > > > > remains
> > > > > > > > > > > > > > > unchanged.
> > > > > > > > > > > > > > > This invariant also defines how to calculate
> > the
> > > > 16
> > > > > > bit
> > > > > > > > > > > checksum
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > an
> > > > > > > > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Signed-off-by: Morten Brørup
> > > > > > <mb@smartsharesystems.com>
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > > > > > > > >  1 file changed, 15 insertions(+), 2
> > > > > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > diff --git a/lib/net/rte_ip.h
> > b/lib/net/rte_ip.h
> > > > > > index
> > > > > > > > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > > > > > > > @@ -162,9 +162,22 @@ __rte_raw_cksum(const
> > void
> > > > > > > > > > > > > > > *buf,
> > > > > > > > > size_t
> > > > > > > > > > > len,
> > > > > > > > > > > > > > > uint32_t sum)  {
> > > > > > > > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > > > > > > > >  	typedef uint16_t
> > > > > __attribute__((__may_alias__))
> > > > > > > > > u16_p;
> > > > > > > > > > > > > > > -	const u16_p *u16_buf = (const u16_p
> > *)buf;
> > > > > > > > > > > > > > > -	const u16_p *end = u16_buf + len /
> > > > > > sizeof(*u16_buf);
> > > > > > > > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > > > > > > > +	const u16_p *end;
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +	/* if buffer is unaligned, keeping it
> > byte
> > > > > > order
> > > > > > > > > > > independent */
> > > > > > > > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > > > > > > > +			return 0;
> > > > > > > > > > > > > > > +		((unsigned char *)&first)[1] =
> > > > > *(const
> > > > > > unsigned
> > > > > > > > > > > > > > char *)buf;
> > > > > > > > > > > > > > > +		sum += first;
> > > > > > > > > > > > > > > +		buf = (const void *)((uintptr_t)buf
> > > > > + 1);
> > > > > > > > > > > > > > > +		len--;
> > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > > > > +	end = u16_buf + len / sizeof(*u16_buf);
> > > > > > > > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > > > > > > > >  		sum += *u16_buf;
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > 2.17.1
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > @Emil, can you please test this patch with an
> > > > > > > > > > > > > > unaligned
> > > > > > > > > buffer on
> > > > > > > > > > > > > your
> > > > > > > > > > > > > > application to confirm that it produces the
> > > > expected
> > > > > > result.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi!
> > > > > > > > > > > > >
> > > > > > > > > > > > > I tested the patch. It doesn't seem to produce
> > the
> > > > same
> > > > > > > > > results. I
> > > > > > > > > > > > > think the problem is that it always starts
> > summing
> > > > from
> > > > > > an
> > > > > > > > > > > > > even address, the sum should always start from
> > the
> > > > first
> > > > > > byte
> > > > > > > > > according
> > > > > > > > > > > to
> > > > > > > > > > > > > the checksum specification. Can I instead
> > > > > > > > > > > > > propose
> > > > > > something
> > > > > > > > > Mattias
> > > > > > > > > > > > > Rönnblom sent me?
> > > > > > > > > > > >
> > > > > > > > > > > > I assume that it produces the same result when the
> > > > "buf"
> > > > > > > > > parameter is
> > > > > > > > > > > > aligned?
> > > > > > > > > > > >
> > > > > > > > > > > > And when the "buf" parameter is unaligned, I don't
> > > > expect
> > > > > > it to
> > > > > > > > > > > produce the
> > > > > > > > > > > > same results as the simple algorithm!
> > > > > > > > > > > >
> > > > > > > > > > > > This was the whole point of the patch: I expect
> > > > > > > > > > > > the overall
> > > > > > > > > packet
> > > > > > > > > > > buffer to
> > > > > > > > > > > > be 16 bit aligned, and the checksum to be a
> > > > > > > > > > > > partial
> > > > > > checksum of
> > > > > > > > > such
> > > > > > > > > > > a 16 bit
> > > > > > > > > > > > aligned packet buffer. When calling this function,
> > I
> > > > > > > > > > > > assume
> > > > > > that
> > > > > > > > > the
> > > > > > > > > > > "buf" and
> > > > > > > > > > > > "len" parameters point to a part of such a packet
> > > > buffer.
> > > > > > If
> > > > > > > > > these
> > > > > > > > > > > > expectations are correct, the simple algorithm
> > > > > > > > > > > > will produce
> > > > > > > > > incorrect
> > > > > > > > > > > results
> > > > > > > > > > > > when "buf" is unaligned.
> > > > > > > > > > > >
> > > > > > > > > > > > I was asking you to test if the checksum on the
> > packet
> > > > is
> > > > > > > > > > > > correct
> > > > > > > > > > > when your
> > > > > > > > > > > > application modifies an unaligned part of the
> > packet
> > > > and
> > > > > > uses
> > > > > > > > > this
> > > > > > > > > > > function to
> > > > > > > > > > > > update the checksum.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Now I understand your use case. Your use case seems
> > to
> > > > > > > > > > > be
> > > > > > about
> > > > > > > > > partial
> > > > > > > > > > > checksums, of which some partial checksums may start
> > on
> > > > > > unaligned
> > > > > > > > > > > addresses in an otherwise aligned packet.
> > > > > > > > > > >
> > > > > > > > > > > Our use case is about calculating the full checksum
> > on a
> > > > > > nested
> > > > > > > > > packet.
> > > > > > > > > > > That nested packet may start on unaligned addresses.
> > > > > > > > > > >
> > > > > > > > > > > The difference is basically if we want to sum over
> > > > aligned
> > > > > > > > > addresses or
> > > > > > > > > > > not, handling the heading and trailing bytes
> > > > appropriately.
> > > > > > > > > > >
> > > > > > > > > > > Your method does not work in our case since we want
> > to
> > > > treat
> > > > > > the
> > > > > > > > > first
> > > > > > > > > > > two bytes as the first word in our case. But I do
> > > > understand
> > > > > > that
> > > > > > > > > both
> > > > > > > > > > > methods are useful.
> > > > > > > > > >
> > > > > > > > > > Yes, that certainly are two different use cases,
> > requiring
> > > > two
> > > > > > > > > different ways of calculating the 16 bit checksum.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Note that your method breaks the API. Previously
> > > > (assuming
> > > > > > > > > > > no
> > > > > > > > > crashing
> > > > > > > > > > > due to low optimization levels, more accepting
> > hardware,
> > > > or
> > > > > > > > > > > a
> > > > > > > > > different
> > > > > > > > > > > compiler (version)) the current method would
> > calculate
> > > > the
> > > > > > > > > > > checksum assuming the first two bytes is the first
> > word.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Depending on the point of view, my patch either fixes
> > > > > > > > > > a bug
> > > > > > (where
> > > > > > > > > the checksum was calculated incorrectly when the buf
> > pointer
> > > > was
> > > > > > > > > unaligned) or breaks the API (by calculating the
> > differently
> > > > > > > > > when
> > > > > > the
> > > > > > > > > buffer is unaligned).
> > > > > > > > > >
> > > > > > > > > > I cannot say with certainty which one is correct, but
> > > > perhaps
> > > > > > some
> > > > > > > > > > of
> > > > > > > > > the people with a deeper DPDK track record can...
> > > > > > > > > >
> > > > > > > > > > @Bruce and @Stephen, in 2019 you signed off on a patch
> > [1]
> > > > > > > > > introducing a 16 bit alignment requirement to the
> > Ethernet
> > > > > > address
> > > > > > > > > structure.
> > > > > > > > > >
> > > > > > > > > > It is my understanding that DPDK has an invariant
> > > > > > > > > > requiring
> > > > > > packets
> > > > > > > > > to be 16 bit aligned, which that patch supports. Is this
> > > > > > invariant
> > > > > > > > > documented anywhere, or am I completely wrong? If I'm
> > wrong,
> > > > > > > > > then
> > > > > > the
> > > > > > > > > alignment requirement introduced in that patch needs to
> > be
> > > > > > removed, as
> > > > > > > > > well as any similar alignment requirements elsewhere in
> > DPDK.
> > > > > > > > >
> > > > > > > > > I don't believe it is explicitly documented as a global
> > > > > > invariant, but
> > > > > > > > > I think it should be unless there is a definite case
> > where
> > > > > > > > > we
> > > > > > need to
> > > > > > > > > allow packets to be completely unaligned. Across all
> > packet
> > > > > > headers we
> > > > > > > > > looked at, there was no tunneling protocol where the
> > > > resulting
> > > > > > packet
> > > > > > > > > was left unaligned.
> > > > > > > > >
> > > > > > > > > That said, if there are real use cases where we need to
> > > > > > > > > allow
> > > > > > packets
> > > > > > > > > to start at an unaligned address, then I agree with you
> > that
> > > > we
> > > > > > need
> > > > > > > > > to roll back the patch and work to ensure everything
> > works
> > > > with
> > > > > > > > > unaligned addresses.
> > > > > > > > >
> > > > > > > > > /Bruce
> > > > > > > > >
> > > > > > > >
> > > > > > > > @Emil, can you please describe or refer to which tunneling
> > > > > > > > protocol
> > > > > > you are
> > > > > > > > using, where the nested packet can be unaligned?
> > > > > > > >
> > > > > > > > I am asking to determine if your use case is exotic (maybe
> > > > > > > > some
> > > > > > Ericsson
> > > > > > > > proprietary protocol), or more generic (rooted in some
> > > > > > > > standard
> > > > > > protocol).
> > > > > > > > This information affects the DPDK community's opinion
> > > > > > > > about how
> > > > it
> > > > > > should
> > > > > > > > be supported by DPDK.
> > > > > > > >
> > > > > > > > If possible, please provide more details about the
> > tunneling
> > > > > > protocol and
> > > > > > > > nested packets... E.g. do the nested packets also contain
> > > > > > > > Layer
> > > > 2
> > > > > > (Ethernet,
> > > > > > > > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4 (TCP,
> > > > > > > > UDP,
> > > > > > etc.)? And how
> > > > > > > > about ARP packets and Layer 2 control protocol packets
> > (STP,
> > > > LACP,
> > > > > > etc.)?
> > > > > > > >
> > > > > > >
> > > > > > > Well, if you append or adjust an odd number of bytes (e.g. a
> > > > > > > PDCP
> > > > > > header) from a previously aligned payload the entire packet
> > will
> > > > then
> > > > > > be unaligned.
> > > > > > >
> > > > > >
> > > > > > If PDCP headers can leave the rest of the packet field
> > unaligned,
> > > > then
> > > > > > we had better remove the alignment restrictions through all of
> > > > DPDK.
> > > > > >
> > > > > > /Bruce
> > > > >
> > > > > Re-reading the details regarding unaligned pointers in C11, as
> > > > > posted
> > > > by Emil
> > > > > in Bugzilla [2], I interpret it as follows: Any 16 bit or wider
> > > > pointer type a must
> > > > > point to data aligned with that type, i.e. a pointer of the type
> > > > "uint16_t *"
> > > > > must point to 16 bit aligned data, and a pointer of the type
> > > > "uint64_t *" must
> > > > > point to 64 bit aligned data. Please, someone tell me I got this
> > > > wrong, and
> > > > > wake me up from my nightmare!
> > > > >
> > > > > Updating DPDK's packet structures to fully support this C11
> > > > limitation with
> > > > > unaligned access would be a nightmare, as we would need to use
> > byte
> > > > arrays
> > > > > for all structure fields. Functions would also be unable to use
> > > > > other
> > > > pointer
> > > > > types than "void *" and "char *", which seems to be the actual
> > > > problem in
> > > > > the __rte_raw_cksum() function. I guess that it also would
> > prevent
> > > > the
> > > > > compiler from auto-vectorizing the functions.
> > > > >
> > > > > I am usually a big proponent of academically correct solutions,
> > but
> > > > such a
> > > > > change would be too wide ranging, so I would like to narrow it
> > down
> > > > to the
> > > > > actual use case, and perhaps extrapolate a bit from there.
> > > > >
> > > > > @Emil: Do you only need to calculate the checksum of the
> > > > > (potentially
> > > > > unaligned) embedded packet? Or do you also need to use other
> > > > > DPDK functions with the embedded packet, potentially accessing
> > > > > it at
> > an
> > > > unaligned
> > > > > address?
> > > > >
> > > > > I'm trying to determine the scope of this C11 pointer alignment
> > > > limitation for
> > > > > your use case, i.e. whether or not other DPDK functions need to
> > be
> > > > updated
> > > > > to support unaligned packet access too.
> > > > >
> > > > > [2]
> > > > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-313273af
> > > > > -
> > > > > 454445554331-2ffe58e5caaeb74e&q=1&e=3f0544d3-8a71-4676-b4f9-
> > > > >
> > >
> 27e0952f7de0&u=https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%
> > > > > 3D1035
> > > >
> > > > That's my interpretation of the standard as well; For example an
> > > > uint16_t* must be on even addresses. If not it is undefined
> > behavior.
> > > > I think this is a bigger problem on ARM for example.
> > > >
> > > > Without being that invested in dpdk, adding unaligned support for
> > > > everything seems like a steep step, but I'm not sure what it
> > entails
> > > > in practice.
> > > >
> > > > We are actually only interested in the checksumming.
> > >
> > > Great! Then we can cancel the panic about rewriting DPDK Core
> > completely.
> > > Although it might still need some review for similar alignment bugs,
> > where
> > > we have been forcing the compiler shut up when trying to warn us.
> > > :-)
> > >
> > > I have provided v3 of the patch, which should do as requested - and
> > still allow
> > > the compiler to auto-vectorize.
> > >
> > > @Emil, will you please test v3 of the patch?
> >
> > It seems to work in these two cases:
> > * Even address, even length
> > * Even address, odd length
> > But it breaks in these two cases:
> > * Odd address, even length (although it works for small buffers,
> > probably when the sum fits inside a uint16_t integer or something)
> 
> Interesting observation, good analysis.
> 
> > * Odd address, odd length
> 
> Does this also work for small buffers?
> 
> > I get (and like) the main idea of the algorithm but haven't yet
> > figured out what the problem is with odd addresses.
> 
> I wonder if I messed up the algorithm for swapping back the bytes in bsum
> after the calculation... Is the checksum also wrong when compiling without
> optimization?
> 
> And just to be sure: The algorithm requires that __rte_raw_cksum_reduce()
> is also applied to the sum. Please confirm that you call rte_raw_cksum() (or
> __rte_raw_cksum() followed by __rte_raw_cksum_reduce())?
> 

Yes, I messed up. I didn't run the reduction part. When I do the output seems to be the same.

It seems to be about as fast as the previous algorithm, obviously. Both valgrind and fsanitize=undefined are happy.

Some minor improvements:
* #include <stdbool.h>?
* Use RTE_PTR_ADD to make the casts cleaner?
* I guess you could skip using 'bsum' and add to 'sum' instead, but that's a matter of preference
* Can't you just do bsum += *(const unsigned char *)buf; to avoid 'first', making it a bit more readable?

> >
> > /Emil

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH] net: fix checksum with unaligned buffer
  2022-06-23 11:39                                           ` Emil Berg
@ 2022-06-23 12:18                                             ` Morten Brørup
  0 siblings, 0 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-23 12:18 UTC (permalink / raw)
  To: Emil Berg, Bruce Richardson
  Cc: Stephen Hemminger, stable, bugzilla, hofors, olivier.matz, dev

> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Thursday, 23 June 2022 13.39
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: den 23 juni 2022 09:01
> >
> > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > Sent: Thursday, 23 June 2022 07.22
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: den 22 juni 2022 16:02
> > > >
> > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > Sent: Wednesday, 22 June 2022 14.25
> > > > >
> > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > Sent: den 22 juni 2022 13:26
> > > > > >
> > > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > > Sent: Wednesday, 22 June 2022 11.18
> > > > > > >
> > > > > > > On Wed, Jun 22, 2022 at 06:26:07AM +0000, Emil Berg wrote:
> > > > > > > >
> > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > Sent: den 21 juni 2022 11:35
> > > > > > > > >
> > > > > > > > > > From: Bruce Richardson
> > > [mailto:bruce.richardson@intel.com]
> > > > > > > > > > Sent: Tuesday, 21 June 2022 10.23
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 21, 2022 at 10:05:15AM +0200, Morten
> Brørup
> > > > > wrote:
> > > > > > > > > > > +TO: @Bruce and @Stephen: You signed off on the 16
> bit
> > > > > > > alignment
> > > > > > > > > > requirement. We need background info on this.
> > > > > > > > > > >
> > > > > > > > > > > > From: Emil Berg [mailto:emil.berg@ericsson.com]
> > > > > > > > > > > > Sent: Tuesday, 21 June 2022 09.17
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > > > > > > > > > > Sent: den 20 juni 2022 12:58
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Emil Berg
> [mailto:emil.berg@ericsson.com]
> > > > > > > > > > > > > > Sent: Monday, 20 June 2022 12.38
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Morten Brørup
> > <mb@smartsharesystems.com>
> > > > > > > > > > > > > > > Sent: den 17 juni 2022 11:07
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From: Morten Brørup
> > > > > > > > > > > > > > > > [mailto:mb@smartsharesystems.com]
> > > > > > > > > > > > > > > > Sent: Friday, 17 June 2022 10.45
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > With this patch, the checksum can be
> > > calculated
> > > > > on
> > > > > > > > > > > > > > > > an
> > > > > > > > > > unligned
> > > > > > > > > > > > > > > > part
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > a packet buffer.
> > > > > > > > > > > > > > > > I.e. the buf parameter is no longer
> required
> > > to
> > > > > be
> > > > > > > > > > > > > > > > 16
> > > > > > > bit
> > > > > > > > > > > > aligned.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The DPDK invariant that packet buffers
> must
> > > be
> > > > > > > > > > > > > > > > 16 bit
> > > > > > > > > > aligned
> > > > > > > > > > > > > > remains
> > > > > > > > > > > > > > > > unchanged.
> > > > > > > > > > > > > > > > This invariant also defines how to
> calculate
> > > the
> > > > > 16
> > > > > > > bit
> > > > > > > > > > > > checksum
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > unaligned part of a packet buffer.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Bugzilla ID: 1035
> > > > > > > > > > > > > > > > Cc: stable@dpdk.org
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Signed-off-by: Morten Brørup
> > > > > > > <mb@smartsharesystems.com>
> > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > >  lib/net/rte_ip.h | 17 +++++++++++++++--
> > > > > > > > > > > > > > > >  1 file changed, 15 insertions(+), 2
> > > > > > > > > > > > > > > > deletions(-)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > diff --git a/lib/net/rte_ip.h
> > > b/lib/net/rte_ip.h
> > > > > > > index
> > > > > > > > > > > > > > > > b502481670..8e301d9c26 100644
> > > > > > > > > > > > > > > > --- a/lib/net/rte_ip.h
> > > > > > > > > > > > > > > > +++ b/lib/net/rte_ip.h
> > > > > > > > > > > > > > > > @@ -162,9 +162,22 @@
> __rte_raw_cksum(const
> > > void
> > > > > > > > > > > > > > > > *buf,
> > > > > > > > > > size_t
> > > > > > > > > > > > len,
> > > > > > > > > > > > > > > > uint32_t sum)  {
> > > > > > > > > > > > > > > >  	/* extend strict-aliasing rules */
> > > > > > > > > > > > > > > >  	typedef uint16_t
> > > > > > __attribute__((__may_alias__))
> > > > > > > > > > u16_p;
> > > > > > > > > > > > > > > > -	const u16_p *u16_buf = (const u16_p
> > > *)buf;
> > > > > > > > > > > > > > > > -	const u16_p *end = u16_buf + len /
> > > > > > > sizeof(*u16_buf);
> > > > > > > > > > > > > > > > +	const u16_p *u16_buf;
> > > > > > > > > > > > > > > > +	const u16_p *end;
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +	/* if buffer is unaligned, keeping
> it
> > > byte
> > > > > > > order
> > > > > > > > > > > > independent */
> > > > > > > > > > > > > > > > +	if (unlikely((uintptr_t)buf & 1)) {
> > > > > > > > > > > > > > > > +		uint16_t first = 0;
> > > > > > > > > > > > > > > > +		if (unlikely(len == 0))
> > > > > > > > > > > > > > > > +			return 0;
> > > > > > > > > > > > > > > > +		((unsigned char *)&first)[1]
> =
> > > > > > *(const
> > > > > > > unsigned
> > > > > > > > > > > > > > > char *)buf;
> > > > > > > > > > > > > > > > +		sum += first;
> > > > > > > > > > > > > > > > +		buf = (const void
> *)((uintptr_t)buf
> > > > > > + 1);
> > > > > > > > > > > > > > > > +		len--;
> > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +	u16_buf = (const u16_p *)buf;
> > > > > > > > > > > > > > > > +	end = u16_buf + len /
> sizeof(*u16_buf);
> > > > > > > > > > > > > > > >  	for (; u16_buf != end; ++u16_buf)
> > > > > > > > > > > > > > > >  		sum += *u16_buf;
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > 2.17.1
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > @Emil, can you please test this patch with
> an
> > > > > > > > > > > > > > > unaligned
> > > > > > > > > > buffer on
> > > > > > > > > > > > > > your
> > > > > > > > > > > > > > > application to confirm that it produces the
> > > > > expected
> > > > > > > result.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I tested the patch. It doesn't seem to
> produce
> > > the
> > > > > same
> > > > > > > > > > results. I
> > > > > > > > > > > > > > think the problem is that it always starts
> > > summing
> > > > > from
> > > > > > > an
> > > > > > > > > > > > > > even address, the sum should always start
> from
> > > the
> > > > > first
> > > > > > > byte
> > > > > > > > > > according
> > > > > > > > > > > > to
> > > > > > > > > > > > > > the checksum specification. Can I instead
> > > > > > > > > > > > > > propose
> > > > > > > something
> > > > > > > > > > Mattias
> > > > > > > > > > > > > > Rönnblom sent me?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I assume that it produces the same result when
> the
> > > > > "buf"
> > > > > > > > > > parameter is
> > > > > > > > > > > > > aligned?
> > > > > > > > > > > > >
> > > > > > > > > > > > > And when the "buf" parameter is unaligned, I
> don't
> > > > > expect
> > > > > > > it to
> > > > > > > > > > > > produce the
> > > > > > > > > > > > > same results as the simple algorithm!
> > > > > > > > > > > > >
> > > > > > > > > > > > > This was the whole point of the patch: I expect
> > > > > > > > > > > > > the overall
> > > > > > > > > > packet
> > > > > > > > > > > > buffer to
> > > > > > > > > > > > > be 16 bit aligned, and the checksum to be a
> > > > > > > > > > > > > partial
> > > > > > > checksum of
> > > > > > > > > > such
> > > > > > > > > > > > a 16 bit
> > > > > > > > > > > > > aligned packet buffer. When calling this
> function,
> > > I
> > > > > > > > > > > > > assume
> > > > > > > that
> > > > > > > > > > the
> > > > > > > > > > > > "buf" and
> > > > > > > > > > > > > "len" parameters point to a part of such a
> packet
> > > > > buffer.
> > > > > > > If
> > > > > > > > > > these
> > > > > > > > > > > > > expectations are correct, the simple algorithm
> > > > > > > > > > > > > will produce
> > > > > > > > > > incorrect
> > > > > > > > > > > > results
> > > > > > > > > > > > > when "buf" is unaligned.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I was asking you to test if the checksum on the
> > > packet
> > > > > is
> > > > > > > > > > > > > correct
> > > > > > > > > > > > when your
> > > > > > > > > > > > > application modifies an unaligned part of the
> > > packet
> > > > > and
> > > > > > > uses
> > > > > > > > > > this
> > > > > > > > > > > > function to
> > > > > > > > > > > > > update the checksum.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Now I understand your use case. Your use case
> seems
> > > to
> > > > > > > > > > > > be
> > > > > > > about
> > > > > > > > > > partial
> > > > > > > > > > > > checksums, of which some partial checksums may
> start
> > > on
> > > > > > > unaligned
> > > > > > > > > > > > addresses in an otherwise aligned packet.
> > > > > > > > > > > >
> > > > > > > > > > > > Our use case is about calculating the full
> checksum
> > > on a
> > > > > > > nested
> > > > > > > > > > packet.
> > > > > > > > > > > > That nested packet may start on unaligned
> addresses.
> > > > > > > > > > > >
> > > > > > > > > > > > The difference is basically if we want to sum
> over
> > > > > aligned
> > > > > > > > > > addresses or
> > > > > > > > > > > > not, handling the heading and trailing bytes
> > > > > appropriately.
> > > > > > > > > > > >
> > > > > > > > > > > > Your method does not work in our case since we
> want
> > > to
> > > > > treat
> > > > > > > the
> > > > > > > > > > first
> > > > > > > > > > > > two bytes as the first word in our case. But I do
> > > > > understand
> > > > > > > that
> > > > > > > > > > both
> > > > > > > > > > > > methods are useful.
> > > > > > > > > > >
> > > > > > > > > > > Yes, that certainly are two different use cases,
> > > requiring
> > > > > two
> > > > > > > > > > different ways of calculating the 16 bit checksum.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Note that your method breaks the API. Previously
> > > > > (assuming
> > > > > > > > > > > > no
> > > > > > > > > > crashing
> > > > > > > > > > > > due to low optimization levels, more accepting
> > > hardware,
> > > > > or
> > > > > > > > > > > > a
> > > > > > > > > > different
> > > > > > > > > > > > compiler (version)) the current method would
> > > calculate
> > > > > the
> > > > > > > > > > > > checksum assuming the first two bytes is the
> first
> > > word.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Depending on the point of view, my patch either
> fixes
> > > > > > > > > > > a bug
> > > > > > > (where
> > > > > > > > > > the checksum was calculated incorrectly when the buf
> > > pointer
> > > > > was
> > > > > > > > > > unaligned) or breaks the API (by calculating the
> > > differently
> > > > > > > > > > when
> > > > > > > the
> > > > > > > > > > buffer is unaligned).
> > > > > > > > > > >
> > > > > > > > > > > I cannot say with certainty which one is correct,
> but
> > > > > perhaps
> > > > > > > some
> > > > > > > > > > > of
> > > > > > > > > > the people with a deeper DPDK track record can...
> > > > > > > > > > >
> > > > > > > > > > > @Bruce and @Stephen, in 2019 you signed off on a
> patch
> > > [1]
> > > > > > > > > > introducing a 16 bit alignment requirement to the
> > > Ethernet
> > > > > > > address
> > > > > > > > > > structure.
> > > > > > > > > > >
> > > > > > > > > > > It is my understanding that DPDK has an invariant
> > > > > > > > > > > requiring
> > > > > > > packets
> > > > > > > > > > to be 16 bit aligned, which that patch supports. Is
> this
> > > > > > > invariant
> > > > > > > > > > documented anywhere, or am I completely wrong? If I'm
> > > wrong,
> > > > > > > > > > then
> > > > > > > the
> > > > > > > > > > alignment requirement introduced in that patch needs
> to
> > > be
> > > > > > > removed, as
> > > > > > > > > > well as any similar alignment requirements elsewhere
> in
> > > DPDK.
> > > > > > > > > >
> > > > > > > > > > I don't believe it is explicitly documented as a
> global
> > > > > > > invariant, but
> > > > > > > > > > I think it should be unless there is a definite case
> > > where
> > > > > > > > > > we
> > > > > > > need to
> > > > > > > > > > allow packets to be completely unaligned. Across all
> > > packet
> > > > > > > headers we
> > > > > > > > > > looked at, there was no tunneling protocol where the
> > > > > resulting
> > > > > > > packet
> > > > > > > > > > was left unaligned.
> > > > > > > > > >
> > > > > > > > > > That said, if there are real use cases where we need
> to
> > > > > > > > > > allow
> > > > > > > packets
> > > > > > > > > > to start at an unaligned address, then I agree with
> you
> > > that
> > > > > we
> > > > > > > need
> > > > > > > > > > to roll back the patch and work to ensure everything
> > > works
> > > > > with
> > > > > > > > > > unaligned addresses.
> > > > > > > > > >
> > > > > > > > > > /Bruce
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > @Emil, can you please describe or refer to which
> tunneling
> > > > > > > > > protocol
> > > > > > > you are
> > > > > > > > > using, where the nested packet can be unaligned?
> > > > > > > > >
> > > > > > > > > I am asking to determine if your use case is exotic
> (maybe
> > > > > > > > > some
> > > > > > > Ericsson
> > > > > > > > > proprietary protocol), or more generic (rooted in some
> > > > > > > > > standard
> > > > > > > protocol).
> > > > > > > > > This information affects the DPDK community's opinion
> > > > > > > > > about how
> > > > > it
> > > > > > > should
> > > > > > > > > be supported by DPDK.
> > > > > > > > >
> > > > > > > > > If possible, please provide more details about the
> > > tunneling
> > > > > > > protocol and
> > > > > > > > > nested packets... E.g. do the nested packets also
> contain
> > > > > > > > > Layer
> > > > > 2
> > > > > > > (Ethernet,
> > > > > > > > > VLAN, etc.) headers, or only Layer 3 (IP) or Layer 4
> (TCP,
> > > > > > > > > UDP,
> > > > > > > etc.)? And how
> > > > > > > > > about ARP packets and Layer 2 control protocol packets
> > > (STP,
> > > > > LACP,
> > > > > > > etc.)?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Well, if you append or adjust an odd number of bytes
> (e.g. a
> > > > > > > > PDCP
> > > > > > > header) from a previously aligned payload the entire packet
> > > will
> > > > > then
> > > > > > > be unaligned.
> > > > > > > >
> > > > > > >
> > > > > > > If PDCP headers can leave the rest of the packet field
> > > unaligned,
> > > > > then
> > > > > > > we had better remove the alignment restrictions through all
> of
> > > > > DPDK.
> > > > > > >
> > > > > > > /Bruce
> > > > > >
> > > > > > Re-reading the details regarding unaligned pointers in C11,
> as
> > > > > > posted
> > > > > by Emil
> > > > > > in Bugzilla [2], I interpret it as follows: Any 16 bit or
> wider
> > > > > pointer type a must
> > > > > > point to data aligned with that type, i.e. a pointer of the
> type
> > > > > "uint16_t *"
> > > > > > must point to 16 bit aligned data, and a pointer of the type
> > > > > "uint64_t *" must
> > > > > > point to 64 bit aligned data. Please, someone tell me I got
> this
> > > > > wrong, and
> > > > > > wake me up from my nightmare!
> > > > > >
> > > > > > Updating DPDK's packet structures to fully support this C11
> > > > > limitation with
> > > > > > unaligned access would be a nightmare, as we would need to
> use
> > > byte
> > > > > arrays
> > > > > > for all structure fields. Functions would also be unable to
> use
> > > > > > other
> > > > > pointer
> > > > > > types than "void *" and "char *", which seems to be the
> actual
> > > > > problem in
> > > > > > the __rte_raw_cksum() function. I guess that it also would
> > > prevent
> > > > > the
> > > > > > compiler from auto-vectorizing the functions.
> > > > > >
> > > > > > I am usually a big proponent of academically correct
> solutions,
> > > but
> > > > > such a
> > > > > > change would be too wide ranging, so I would like to narrow
> it
> > > down
> > > > > to the
> > > > > > actual use case, and perhaps extrapolate a bit from there.
> > > > > >
> > > > > > @Emil: Do you only need to calculate the checksum of the
> > > > > > (potentially
> > > > > > unaligned) embedded packet? Or do you also need to use other
> > > > > > DPDK functions with the embedded packet, potentially
> accessing
> > > > > > it at
> > > an
> > > > > unaligned
> > > > > > address?
> > > > > >
> > > > > > I'm trying to determine the scope of this C11 pointer
> alignment
> > > > > limitation for
> > > > > > your use case, i.e. whether or not other DPDK functions need
> to
> > > be
> > > > > updated
> > > > > > to support unaligned packet access too.
> > > > > >
> > > > > > [2]
> > > > > > https://protect2.fireeye.com/v1/url?k=31323334-501cfaf3-
> 313273af
> > > > > > -
> > > > > > 454445554331-2ffe58e5caaeb74e&q=1&e=3f0544d3-8a71-4676-b4f9-
> > > > > >
> > > >
> > 27e0952f7de0&u=https%3A%2F%2Fbugs.dpdk.org%2Fshow_bug.cgi%3Fid%
> > > > > > 3D1035
> > > > >
> > > > > That's my interpretation of the standard as well; For example
> an
> > > > > uint16_t* must be on even addresses. If not it is undefined
> > > behavior.
> > > > > I think this is a bigger problem on ARM for example.
> > > > >
> > > > > Without being that invested in dpdk, adding unaligned support
> for
> > > > > everything seems like a steep step, but I'm not sure what it
> > > entails
> > > > > in practice.
> > > > >
> > > > > We are actually only interested in the checksumming.
> > > >
> > > > Great! Then we can cancel the panic about rewriting DPDK Core
> > > completely.
> > > > Although it might still need some review for similar alignment
> bugs,
> > > where
> > > > we have been forcing the compiler shut up when trying to warn us.
> > > > :-)
> > > >
> > > > I have provided v3 of the patch, which should do as requested -
> and
> > > still allow
> > > > the compiler to auto-vectorize.
> > > >
> > > > @Emil, will you please test v3 of the patch?
> > >
> > > It seems to work in these two cases:
> > > * Even address, even length
> > > * Even address, odd length
> > > But it breaks in these two cases:
> > > * Odd address, even length (although it works for small buffers,
> > > probably when the sum fits inside a uint16_t integer or something)
> >
> > Interesting observation, good analysis.
> >
> > > * Odd address, odd length
> >
> > Does this also work for small buffers?
> >
> > > I get (and like) the main idea of the algorithm but haven't yet
> > > figured out what the problem is with odd addresses.
> >
> > I wonder if I messed up the algorithm for swapping back the bytes in
> bsum
> > after the calculation... Is the checksum also wrong when compiling
> without
> > optimization?
> >
> > And just to be sure: The algorithm requires that
> __rte_raw_cksum_reduce()
> > is also applied to the sum. Please confirm that you call
> rte_raw_cksum() (or
> > __rte_raw_cksum() followed by __rte_raw_cksum_reduce())?
> >
> 
> Yes, I messed up. I didn't run the reduction part. When I do the output
> seems to be the same.

I'm really happy to hear that! Thank you for informing me quickly.

> 
> It seems to be about as fast as the previous algorithm, obviously. Both
> valgrind and fsanitize=undefined are happy.
> 
> Some minor improvements:
> * #include <stdbool.h>?

Some include it, some don't.
Looking at other header files, I can't see any pattern, but I agree that it should be included here.

> * Use RTE_PTR_ADD to make the casts cleaner?

Yes. Will do.

> * I guess you could skip using 'bsum' and add to 'sum' instead, but
> that's a matter of preference

No, because we want to keep the byte order of the initial sum intact, and only swap the bytes of the buffer's sum (when the buffer is unaligned).
	
> * Can't you just do bsum += *(const unsigned char *)buf; to avoid
> 'first', making it a bit more readable?

No, because on little endian CPUs, the second byte is the MSB of the uint16_t.
It would only work on big endian CPUs.

I considered doing it differently. But for consistency, I reused the 'left' block's method; only I'm putting the spare byte in the second byte of the 16 bit word, whereas the 'first' block puts the spare byte in the first byte of the 16 bit word.


PS: A lot of smart tricks are available for IP checksumming... RFC 1071 is worth a read. :-)


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-17  7:32           ` Morten Brørup
                               ` (2 preceding siblings ...)
  2022-06-22 13:54             ` [PATCH v3] " Morten Brørup
@ 2022-06-23 12:39             ` Morten Brørup
  2022-06-23 12:51               ` Morten Brørup
                                 ` (2 more replies)
  3 siblings, 3 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-23 12:39 UTC (permalink / raw)
  To: emil.berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, hofors, olivier.matz, Morten Brørup

With this patch, the checksum can be calculated on an unaligned buffer.
I.e. the buf parameter is no longer required to be 16 bit aligned.

The checksum is still calculated using a 16 bit aligned pointer, so the
compiler can auto-vectorize the function's inner loop.

When the buffer is unaligned, the first byte of the buffer is handled
separately. Furthermore, the calculated checksum of the buffer is byte
shifted before being added to the initial checksum, to compensate for the
checksum having been calculated on the buffer shifted by one byte.

v4:
* Add copyright notice.
* Include stdbool.h (Emil Berg).
* Use RTE_PTR_ADD (Emil Berg).
* Fix one more typo in commit message. Is 'unligned' even a word?
v3:
* Remove braces from single statement block.
* Fix typo in commit message.
v2:
* Do not assume that the buffer is part of an aligned packet buffer.

Bugzilla ID: 1035
Cc: stable@dpdk.org

Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
 lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index b502481670..738d643da0 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -3,6 +3,7 @@
  *      The Regents of the University of California.
  * Copyright(c) 2010-2014 Intel Corporation.
  * Copyright(c) 2014 6WIND S.A.
+ * Copyright(c) 2022 SmartShare Systems.
  * All rights reserved.
  */
 
@@ -15,6 +16,7 @@
  * IP-related defines
  */
 
+#include <stdbool.h>
 #include <stdint.h>
 
 #ifdef RTE_EXEC_ENV_WINDOWS
@@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
 	/* extend strict-aliasing rules */
 	typedef uint16_t __attribute__((__may_alias__)) u16_p;
-	const u16_p *u16_buf = (const u16_p *)buf;
-	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+	const u16_p *u16_buf;
+	const u16_p *end;
+	uint32_t bsum = 0;
+	const bool unaligned = (uintptr_t)buf & 1;
+
+	/* if buffer is unaligned, keeping it byte order independent */
+	if (unlikely(unaligned)) {
+		uint16_t first = 0;
+		if (unlikely(len == 0))
+			return 0;
+		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
+		bsum += first;
+		buf = RTE_PTR_ADD(buf, 1);
+		len--;
+	}
 
+	/* aligned access for compiler auto-vectorization */
+	u16_buf = (const u16_p *)buf;
+	end = u16_buf + len / sizeof(*u16_buf);
 	for (; u16_buf != end; ++u16_buf)
-		sum += *u16_buf;
+		bsum += *u16_buf;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
 		uint16_t left = 0;
 		*(unsigned char *)&left = *(const unsigned char *)end;
-		sum += left;
+		bsum += left;
 	}
 
-	return sum;
+	/* if buffer is unaligned, swap the checksum bytes */
+	if (unlikely(unaligned))
+		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & 0x00FF00FF) << 8;
+
+	return sum + bsum;
 }
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-23 12:39             ` [PATCH v4] " Morten Brørup
@ 2022-06-23 12:51               ` Morten Brørup
  2022-06-27  7:56                 ` Emil Berg
  2022-06-27 12:28                 ` Mattias Rönnblom
  2022-06-30 17:41               ` [PATCH v4] net: fix checksum with unaligned buffer Stephen Hemminger
  2022-06-30 17:45               ` Stephen Hemminger
  2 siblings, 2 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-23 12:51 UTC (permalink / raw)
  To: emil.berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, hofors, olivier.matz

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 23 June 2022 14.39
> 
> With this patch, the checksum can be calculated on an unaligned buffer.
> I.e. the buf parameter is no longer required to be 16 bit aligned.
> 
> The checksum is still calculated using a 16 bit aligned pointer, so the
> compiler can auto-vectorize the function's inner loop.
> 
> When the buffer is unaligned, the first byte of the buffer is handled
> separately. Furthermore, the calculated checksum of the buffer is byte
> shifted before being added to the initial checksum, to compensate for
> the
> checksum having been calculated on the buffer shifted by one byte.
> 
> v4:
> * Add copyright notice.
> * Include stdbool.h (Emil Berg).
> * Use RTE_PTR_ADD (Emil Berg).
> * Fix one more typo in commit message. Is 'unligned' even a word?
> v3:
> * Remove braces from single statement block.
> * Fix typo in commit message.
> v2:
> * Do not assume that the buffer is part of an aligned packet buffer.
> 
> Bugzilla ID: 1035
> Cc: stable@dpdk.org
> 
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..738d643da0 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -3,6 +3,7 @@
>   *      The Regents of the University of California.
>   * Copyright(c) 2010-2014 Intel Corporation.
>   * Copyright(c) 2014 6WIND S.A.
> + * Copyright(c) 2022 SmartShare Systems.
>   * All rights reserved.
>   */
> 
> @@ -15,6 +16,7 @@
>   * IP-related defines
>   */
> 
> +#include <stdbool.h>
>  #include <stdint.h>
> 
>  #ifdef RTE_EXEC_ENV_WINDOWS
> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> uint32_t sum)
>  {
>  	/* extend strict-aliasing rules */
>  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> -	const u16_p *u16_buf = (const u16_p *)buf;
> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> +	const u16_p *u16_buf;
> +	const u16_p *end;
> +	uint32_t bsum = 0;
> +	const bool unaligned = (uintptr_t)buf & 1;
> +
> +	/* if buffer is unaligned, keeping it byte order independent */
> +	if (unlikely(unaligned)) {
> +		uint16_t first = 0;
> +		if (unlikely(len == 0))
> +			return 0;
> +		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
> +		bsum += first;
> +		buf = RTE_PTR_ADD(buf, 1);
> +		len--;
> +	}
> 
> +	/* aligned access for compiler auto-vectorization */
> +	u16_buf = (const u16_p *)buf;
> +	end = u16_buf + len / sizeof(*u16_buf);
>  	for (; u16_buf != end; ++u16_buf)
> -		sum += *u16_buf;
> +		bsum += *u16_buf;
> 
>  	/* if length is odd, keeping it byte order independent */
>  	if (unlikely(len % 2)) {
>  		uint16_t left = 0;
>  		*(unsigned char *)&left = *(const unsigned char *)end;
> -		sum += left;
> +		bsum += left;
>  	}
> 
> -	return sum;
> +	/* if buffer is unaligned, swap the checksum bytes */
> +	if (unlikely(unaligned))
> +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & 0x00FF00FF) << 8;
> +
> +	return sum + bsum;
>  }
> 
>  /**
> --
> 2.17.1

@Emil, thank you for thoroughly reviewing the previous versions.

If your test succeeds and you are satisfied with the patch, remember to reply with a "Tested-by" tag for patchwork.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-23 12:51               ` Morten Brørup
@ 2022-06-27  7:56                 ` Emil Berg
  2022-06-27 10:54                   ` Morten Brørup
  2022-06-27 12:28                 ` Mattias Rönnblom
  1 sibling, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-27  7:56 UTC (permalink / raw)
  To: Morten Brørup, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, hofors, olivier.matz



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: den 23 juni 2022 14:51
> To: Emil Berg <emil.berg@ericsson.com>; bruce.richardson@intel.com;
> dev@dpdk.org
> Cc: stephen@networkplumber.org; stable@dpdk.org; bugzilla@dpdk.org;
> hofors@lysator.liu.se; olivier.matz@6wind.com
> Subject: RE: [PATCH v4] net: fix checksum with unaligned buffer
> 
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Thursday, 23 June 2022 14.39
> >
> > With this patch, the checksum can be calculated on an unaligned buffer.
> > I.e. the buf parameter is no longer required to be 16 bit aligned.
> >
> > The checksum is still calculated using a 16 bit aligned pointer, so
> > the compiler can auto-vectorize the function's inner loop.
> >
> > When the buffer is unaligned, the first byte of the buffer is handled
> > separately. Furthermore, the calculated checksum of the buffer is byte
> > shifted before being added to the initial checksum, to compensate for
> > the checksum having been calculated on the buffer shifted by one byte.
> >
> > v4:
> > * Add copyright notice.
> > * Include stdbool.h (Emil Berg).
> > * Use RTE_PTR_ADD (Emil Berg).
> > * Fix one more typo in commit message. Is 'unligned' even a word?
> > v3:
> > * Remove braces from single statement block.
> > * Fix typo in commit message.
> > v2:
> > * Do not assume that the buffer is part of an aligned packet buffer.
> >
> > Bugzilla ID: 1035
> > Cc: stable@dpdk.org
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > Tested-by: Emil Berg <emil.berg@ericsson.com>
> > ---
> >  lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> >  1 file changed, 27 insertions(+), 5 deletions(-)
> >
> > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > b502481670..738d643da0 100644
> > --- a/lib/net/rte_ip.h
> > +++ b/lib/net/rte_ip.h
> > @@ -3,6 +3,7 @@
> >   *      The Regents of the University of California.
> >   * Copyright(c) 2010-2014 Intel Corporation.
> >   * Copyright(c) 2014 6WIND S.A.
> > + * Copyright(c) 2022 SmartShare Systems.
> >   * All rights reserved.
> >   */
> >
> > @@ -15,6 +16,7 @@
> >   * IP-related defines
> >   */
> >
> > +#include <stdbool.h>
> >  #include <stdint.h>
> >
> >  #ifdef RTE_EXEC_ENV_WINDOWS
> > @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> > uint32_t sum)  {
> >  	/* extend strict-aliasing rules */
> >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > -	const u16_p *u16_buf = (const u16_p *)buf;
> > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > +	const u16_p *u16_buf;
> > +	const u16_p *end;
> > +	uint32_t bsum = 0;
> > +	const bool unaligned = (uintptr_t)buf & 1;
> > +
> > +	/* if buffer is unaligned, keeping it byte order independent */
> > +	if (unlikely(unaligned)) {
> > +		uint16_t first = 0;
> > +		if (unlikely(len == 0))
> > +			return 0;
> > +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> > +		bsum += first;
> > +		buf = RTE_PTR_ADD(buf, 1);
> > +		len--;
> > +	}
> >
> > +	/* aligned access for compiler auto-vectorization */
> > +	u16_buf = (const u16_p *)buf;
> > +	end = u16_buf + len / sizeof(*u16_buf);
> >  	for (; u16_buf != end; ++u16_buf)
> > -		sum += *u16_buf;
> > +		bsum += *u16_buf;
> >
> >  	/* if length is odd, keeping it byte order independent */
> >  	if (unlikely(len % 2)) {
> >  		uint16_t left = 0;
> >  		*(unsigned char *)&left = *(const unsigned char
> *)end;
> > -		sum += left;
> > +		bsum += left;
> >  	}
> >
> > -	return sum;
> > +	/* if buffer is unaligned, swap the checksum bytes */
> > +	if (unlikely(unaligned))
> > +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum &
> 0x00FF00FF) << 8;
> > +
> > +	return sum + bsum;
> >  }
> >
> >  /**
> > --
> > 2.17.1
> 
> @Emil, thank you for thoroughly reviewing the previous versions.
> 
> If your test succeeds and you are satisfied with the patch, remember to reply
> with a "Tested-by" tag for patchwork.

The test succeeded and I'm satisfied with the patch. I added 'Tested-by: Emil Berg <emil.berg@ericsson.com>' to the patch above, hopefully as you intended.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-27  7:56                 ` Emil Berg
@ 2022-06-27 10:54                   ` Morten Brørup
  0 siblings, 0 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-27 10:54 UTC (permalink / raw)
  To: Emil Berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, hofors, olivier.matz

> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Monday, 27 June 2022 09.57
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: den 23 juni 2022 14:51
> >
> > > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > Sent: Thursday, 23 June 2022 14.39
> > >
> > > With this patch, the checksum can be calculated on an unaligned
> buffer.
> > > I.e. the buf parameter is no longer required to be 16 bit aligned.
> > >
> > > The checksum is still calculated using a 16 bit aligned pointer, so
> > > the compiler can auto-vectorize the function's inner loop.
> > >
> > > When the buffer is unaligned, the first byte of the buffer is
> handled
> > > separately. Furthermore, the calculated checksum of the buffer is
> byte
> > > shifted before being added to the initial checksum, to compensate
> for
> > > the checksum having been calculated on the buffer shifted by one
> byte.
> > >
> > > v4:
> > > * Add copyright notice.
> > > * Include stdbool.h (Emil Berg).
> > > * Use RTE_PTR_ADD (Emil Berg).
> > > * Fix one more typo in commit message. Is 'unligned' even a word?
> > > v3:
> > > * Remove braces from single statement block.
> > > * Fix typo in commit message.
> > > v2:
> > > * Do not assume that the buffer is part of an aligned packet
> buffer.
> > >
> > > Bugzilla ID: 1035
> > > Cc: stable@dpdk.org
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > Tested-by: Emil Berg <emil.berg@ericsson.com>
> > > ---
> > >  lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> > >  1 file changed, 27 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > b502481670..738d643da0 100644
> > > --- a/lib/net/rte_ip.h
> > > +++ b/lib/net/rte_ip.h
> > > @@ -3,6 +3,7 @@
> > >   *      The Regents of the University of California.
> > >   * Copyright(c) 2010-2014 Intel Corporation.
> > >   * Copyright(c) 2014 6WIND S.A.
> > > + * Copyright(c) 2022 SmartShare Systems.
> > >   * All rights reserved.
> > >   */
> > >
> > > @@ -15,6 +16,7 @@
> > >   * IP-related defines
> > >   */
> > >
> > > +#include <stdbool.h>
> > >  #include <stdint.h>
> > >
> > >  #ifdef RTE_EXEC_ENV_WINDOWS
> > > @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> > > uint32_t sum)  {
> > >  	/* extend strict-aliasing rules */
> > >  	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > -	const u16_p *u16_buf = (const u16_p *)buf;
> > > -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > +	const u16_p *u16_buf;
> > > +	const u16_p *end;
> > > +	uint32_t bsum = 0;
> > > +	const bool unaligned = (uintptr_t)buf & 1;
> > > +
> > > +	/* if buffer is unaligned, keeping it byte order independent */
> > > +	if (unlikely(unaligned)) {
> > > +		uint16_t first = 0;
> > > +		if (unlikely(len == 0))
> > > +			return 0;
> > > +		((unsigned char *)&first)[1] = *(const unsigned
> > char *)buf;
> > > +		bsum += first;
> > > +		buf = RTE_PTR_ADD(buf, 1);
> > > +		len--;
> > > +	}
> > >
> > > +	/* aligned access for compiler auto-vectorization */
> > > +	u16_buf = (const u16_p *)buf;
> > > +	end = u16_buf + len / sizeof(*u16_buf);
> > >  	for (; u16_buf != end; ++u16_buf)
> > > -		sum += *u16_buf;
> > > +		bsum += *u16_buf;
> > >
> > >  	/* if length is odd, keeping it byte order independent */
> > >  	if (unlikely(len % 2)) {
> > >  		uint16_t left = 0;
> > >  		*(unsigned char *)&left = *(const unsigned char
> > *)end;
> > > -		sum += left;
> > > +		bsum += left;
> > >  	}
> > >
> > > -	return sum;
> > > +	/* if buffer is unaligned, swap the checksum bytes */
> > > +	if (unlikely(unaligned))
> > > +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum &
> > 0x00FF00FF) << 8;
> > > +
> > > +	return sum + bsum;
> > >  }
> > >
> > >  /**
> > > --
> > > 2.17.1
> >
> > @Emil, thank you for thoroughly reviewing the previous versions.
> >
> > If your test succeeds and you are satisfied with the patch, remember
> to reply
> > with a "Tested-by" tag for patchwork.
> 
> The test succeeded and I'm satisfied with the patch. I added 'Tested-
> by: Emil Berg <emil.berg@ericsson.com>' to the patch above, hopefully
> as you intended.

Thank you for testing! You don't need to put the tag inside the patch. Next time, just put the tag in your reply, like here, and Patchwork will catch it.

Tested-by: Emil Berg <emil.berg@ericsson.com>

The same goes for other tags, like Acked-by, Reviewed-by, etc..

-Morten

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-23 12:51               ` Morten Brørup
  2022-06-27  7:56                 ` Emil Berg
@ 2022-06-27 12:28                 ` Mattias Rönnblom
  2022-06-27 12:46                   ` Emil Berg
  1 sibling, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-06-27 12:28 UTC (permalink / raw)
  To: Morten Brørup, emil.berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, olivier.matz

On 2022-06-23 14:51, Morten Brørup wrote:
>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>> Sent: Thursday, 23 June 2022 14.39
>>
>> With this patch, the checksum can be calculated on an unaligned buffer.
>> I.e. the buf parameter is no longer required to be 16 bit aligned.
>>
>> The checksum is still calculated using a 16 bit aligned pointer, so the
>> compiler can auto-vectorize the function's inner loop.
>>
>> When the buffer is unaligned, the first byte of the buffer is handled
>> separately. Furthermore, the calculated checksum of the buffer is byte
>> shifted before being added to the initial checksum, to compensate for
>> the
>> checksum having been calculated on the buffer shifted by one byte.
>>
>> v4:
>> * Add copyright notice.
>> * Include stdbool.h (Emil Berg).
>> * Use RTE_PTR_ADD (Emil Berg).
>> * Fix one more typo in commit message. Is 'unligned' even a word?
>> v3:
>> * Remove braces from single statement block.
>> * Fix typo in commit message.
>> v2:
>> * Do not assume that the buffer is part of an aligned packet buffer.
>>
>> Bugzilla ID: 1035
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>>   lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
>>   1 file changed, 27 insertions(+), 5 deletions(-)
>>
>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
>> index b502481670..738d643da0 100644
>> --- a/lib/net/rte_ip.h
>> +++ b/lib/net/rte_ip.h
>> @@ -3,6 +3,7 @@
>>    *      The Regents of the University of California.
>>    * Copyright(c) 2010-2014 Intel Corporation.
>>    * Copyright(c) 2014 6WIND S.A.
>> + * Copyright(c) 2022 SmartShare Systems.
>>    * All rights reserved.
>>    */
>>
>> @@ -15,6 +16,7 @@
>>    * IP-related defines
>>    */
>>
>> +#include <stdbool.h>
>>   #include <stdint.h>
>>
>>   #ifdef RTE_EXEC_ENV_WINDOWS
>> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
>> uint32_t sum)
>>   {
>>   	/* extend strict-aliasing rules */
>>   	typedef uint16_t __attribute__((__may_alias__)) u16_p;
>> -	const u16_p *u16_buf = (const u16_p *)buf;
>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
>> +	const u16_p *u16_buf;
>> +	const u16_p *end;
>> +	uint32_t bsum = 0;
>> +	const bool unaligned = (uintptr_t)buf & 1;
>> +
>> +	/* if buffer is unaligned, keeping it byte order independent */
>> +	if (unlikely(unaligned)) {
>> +		uint16_t first = 0;
>> +		if (unlikely(len == 0))
>> +			return 0;
>> +		((unsigned char *)&first)[1] = *(const unsigned char *)buf;
>> +		bsum += first;
>> +		buf = RTE_PTR_ADD(buf, 1);
>> +		len--;
>> +	}
>>
>> +	/* aligned access for compiler auto-vectorization */

The compiler will be able to auto vectorize even unaligned accesses, 
just with different instructions. From what I can tell, there's no 
performance impact, at least not on the x86_64 systems I tried on.

I think you should remove the first special case conditional and use 
memcpy() instead of the cumbersome __may_alias__ construct to retrieve 
the data.

>> +	u16_buf = (const u16_p *)buf;
>> +	end = u16_buf + len / sizeof(*u16_buf);
>>   	for (; u16_buf != end; ++u16_buf)
>> -		sum += *u16_buf;
>> +		bsum += *u16_buf;
>>
>>   	/* if length is odd, keeping it byte order independent */
>>   	if (unlikely(len % 2)) {
>>   		uint16_t left = 0;
>>   		*(unsigned char *)&left = *(const unsigned char *)end;
>> -		sum += left;
>> +		bsum += left;
>>   	}
>>
>> -	return sum;
>> +	/* if buffer is unaligned, swap the checksum bytes */
>> +	if (unlikely(unaligned))
>> +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum & 0x00FF00FF) << 8;
>> +
>> +	return sum + bsum;
>>   }
>>
>>   /**
>> --
>> 2.17.1
> 
> @Emil, thank you for thoroughly reviewing the previous versions.
> 
> If your test succeeds and you are satisfied with the patch, remember to reply with a "Tested-by" tag for patchwork.
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-27 12:28                 ` Mattias Rönnblom
@ 2022-06-27 12:46                   ` Emil Berg
  2022-06-27 12:50                     ` Emil Berg
  0 siblings, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-27 12:46 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, olivier.matz



> -----Original Message-----
> From: Mattias Rönnblom <hofors@lysator.liu.se>
> Sent: den 27 juni 2022 14:28
> To: Morten Brørup <mb@smartsharesystems.com>; Emil Berg
> <emil.berg@ericsson.com>; bruce.richardson@intel.com; dev@dpdk.org
> Cc: stephen@networkplumber.org; stable@dpdk.org; bugzilla@dpdk.org;
> olivier.matz@6wind.com
> Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer
> 
> On 2022-06-23 14:51, Morten Brørup wrote:
> >> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> >> Sent: Thursday, 23 June 2022 14.39
> >>
> >> With this patch, the checksum can be calculated on an unaligned buffer.
> >> I.e. the buf parameter is no longer required to be 16 bit aligned.
> >>
> >> The checksum is still calculated using a 16 bit aligned pointer, so
> >> the compiler can auto-vectorize the function's inner loop.
> >>
> >> When the buffer is unaligned, the first byte of the buffer is handled
> >> separately. Furthermore, the calculated checksum of the buffer is
> >> byte shifted before being added to the initial checksum, to
> >> compensate for the checksum having been calculated on the buffer
> >> shifted by one byte.
> >>
> >> v4:
> >> * Add copyright notice.
> >> * Include stdbool.h (Emil Berg).
> >> * Use RTE_PTR_ADD (Emil Berg).
> >> * Fix one more typo in commit message. Is 'unligned' even a word?
> >> v3:
> >> * Remove braces from single statement block.
> >> * Fix typo in commit message.
> >> v2:
> >> * Do not assume that the buffer is part of an aligned packet buffer.
> >>
> >> Bugzilla ID: 1035
> >> Cc: stable@dpdk.org
> >>
> >> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >> ---
> >>   lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> >>   1 file changed, 27 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> >> b502481670..738d643da0 100644
> >> --- a/lib/net/rte_ip.h
> >> +++ b/lib/net/rte_ip.h
> >> @@ -3,6 +3,7 @@
> >>    *      The Regents of the University of California.
> >>    * Copyright(c) 2010-2014 Intel Corporation.
> >>    * Copyright(c) 2014 6WIND S.A.
> >> + * Copyright(c) 2022 SmartShare Systems.
> >>    * All rights reserved.
> >>    */
> >>
> >> @@ -15,6 +16,7 @@
> >>    * IP-related defines
> >>    */
> >>
> >> +#include <stdbool.h>
> >>   #include <stdint.h>
> >>
> >>   #ifdef RTE_EXEC_ENV_WINDOWS
> >> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> >> uint32_t sum)
> >>   {
> >>   	/* extend strict-aliasing rules */
> >>   	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> >> -	const u16_p *u16_buf = (const u16_p *)buf;
> >> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> >> +	const u16_p *u16_buf;
> >> +	const u16_p *end;
> >> +	uint32_t bsum = 0;
> >> +	const bool unaligned = (uintptr_t)buf & 1;
> >> +
> >> +	/* if buffer is unaligned, keeping it byte order independent */
> >> +	if (unlikely(unaligned)) {
> >> +		uint16_t first = 0;
> >> +		if (unlikely(len == 0))
> >> +			return 0;
> >> +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> >> +		bsum += first;
> >> +		buf = RTE_PTR_ADD(buf, 1);
> >> +		len--;
> >> +	}
> >>
> >> +	/* aligned access for compiler auto-vectorization */
> 
> The compiler will be able to auto vectorize even unaligned accesses, just with
> different instructions. From what I can tell, there's no performance impact, at
> least not on the x86_64 systems I tried on.
> 
> I think you should remove the first special case conditional and use
> memcpy() instead of the cumbersome __may_alias__ construct to retrieve
> the data.
> 

Here:
https://www.agner.org/optimize/instruction_tables.pdf
it lists the latency of vmovdqa (aligned) as 6 cycles and the latency for vmovdqu (unaligned) as 7 cycles. So I guess there can be some difference.
Although in practice I'm not sure what difference it makes. I've not seen any difference in runtime between the two versions.

> >> +	u16_buf = (const u16_p *)buf;
> >> +	end = u16_buf + len / sizeof(*u16_buf);
> >>   	for (; u16_buf != end; ++u16_buf)
> >> -		sum += *u16_buf;
> >> +		bsum += *u16_buf;
> >>
> >>   	/* if length is odd, keeping it byte order independent */
> >>   	if (unlikely(len % 2)) {
> >>   		uint16_t left = 0;
> >>   		*(unsigned char *)&left = *(const unsigned char
> *)end;
> >> -		sum += left;
> >> +		bsum += left;
> >>   	}
> >>
> >> -	return sum;
> >> +	/* if buffer is unaligned, swap the checksum bytes */
> >> +	if (unlikely(unaligned))
> >> +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum &
> 0x00FF00FF) << 8;
> >> +
> >> +	return sum + bsum;
> >>   }
> >>
> >>   /**
> >> --
> >> 2.17.1
> >
> > @Emil, thank you for thoroughly reviewing the previous versions.
> >
> > If your test succeeds and you are satisfied with the patch, remember to
> reply with a "Tested-by" tag for patchwork.
> >

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-27 12:46                   ` Emil Berg
@ 2022-06-27 12:50                     ` Emil Berg
  2022-06-27 13:22                       ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-06-27 12:50 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, olivier.matz



> -----Original Message-----
> From: Emil Berg
> Sent: den 27 juni 2022 14:46
> To: Mattias Rönnblom <hofors@lysator.liu.se>; Morten Brørup
> <mb@smartsharesystems.com>; bruce.richardson@intel.com;
> dev@dpdk.org
> Cc: stephen@networkplumber.org; stable@dpdk.org; bugzilla@dpdk.org;
> olivier.matz@6wind.com
> Subject: RE: [PATCH v4] net: fix checksum with unaligned buffer
> 
> 
> 
> > -----Original Message-----
> > From: Mattias Rönnblom <hofors@lysator.liu.se>
> > Sent: den 27 juni 2022 14:28
> > To: Morten Brørup <mb@smartsharesystems.com>; Emil Berg
> > <emil.berg@ericsson.com>; bruce.richardson@intel.com; dev@dpdk.org
> > Cc: stephen@networkplumber.org; stable@dpdk.org; bugzilla@dpdk.org;
> > olivier.matz@6wind.com
> > Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer
> >
> > On 2022-06-23 14:51, Morten Brørup wrote:
> > >> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > >> Sent: Thursday, 23 June 2022 14.39
> > >>
> > >> With this patch, the checksum can be calculated on an unaligned buffer.
> > >> I.e. the buf parameter is no longer required to be 16 bit aligned.
> > >>
> > >> The checksum is still calculated using a 16 bit aligned pointer, so
> > >> the compiler can auto-vectorize the function's inner loop.
> > >>
> > >> When the buffer is unaligned, the first byte of the buffer is
> > >> handled separately. Furthermore, the calculated checksum of the
> > >> buffer is byte shifted before being added to the initial checksum,
> > >> to compensate for the checksum having been calculated on the buffer
> > >> shifted by one byte.
> > >>
> > >> v4:
> > >> * Add copyright notice.
> > >> * Include stdbool.h (Emil Berg).
> > >> * Use RTE_PTR_ADD (Emil Berg).
> > >> * Fix one more typo in commit message. Is 'unligned' even a word?
> > >> v3:
> > >> * Remove braces from single statement block.
> > >> * Fix typo in commit message.
> > >> v2:
> > >> * Do not assume that the buffer is part of an aligned packet buffer.
> > >>
> > >> Bugzilla ID: 1035
> > >> Cc: stable@dpdk.org
> > >>
> > >> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >> ---
> > >>   lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> > >>   1 file changed, 27 insertions(+), 5 deletions(-)
> > >>
> > >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > >> b502481670..738d643da0 100644
> > >> --- a/lib/net/rte_ip.h
> > >> +++ b/lib/net/rte_ip.h
> > >> @@ -3,6 +3,7 @@
> > >>    *      The Regents of the University of California.
> > >>    * Copyright(c) 2010-2014 Intel Corporation.
> > >>    * Copyright(c) 2014 6WIND S.A.
> > >> + * Copyright(c) 2022 SmartShare Systems.
> > >>    * All rights reserved.
> > >>    */
> > >>
> > >> @@ -15,6 +16,7 @@
> > >>    * IP-related defines
> > >>    */
> > >>
> > >> +#include <stdbool.h>
> > >>   #include <stdint.h>
> > >>
> > >>   #ifdef RTE_EXEC_ENV_WINDOWS
> > >> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t len,
> > >> uint32_t sum)
> > >>   {
> > >>   	/* extend strict-aliasing rules */
> > >>   	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > >> -	const u16_p *u16_buf = (const u16_p *)buf;
> > >> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > >> +	const u16_p *u16_buf;
> > >> +	const u16_p *end;
> > >> +	uint32_t bsum = 0;
> > >> +	const bool unaligned = (uintptr_t)buf & 1;
> > >> +
> > >> +	/* if buffer is unaligned, keeping it byte order independent */
> > >> +	if (unlikely(unaligned)) {
> > >> +		uint16_t first = 0;
> > >> +		if (unlikely(len == 0))
> > >> +			return 0;
> > >> +		((unsigned char *)&first)[1] = *(const unsigned
> > char *)buf;
> > >> +		bsum += first;
> > >> +		buf = RTE_PTR_ADD(buf, 1);
> > >> +		len--;
> > >> +	}
> > >>
> > >> +	/* aligned access for compiler auto-vectorization */
> >
> > The compiler will be able to auto vectorize even unaligned accesses,
> > just with different instructions. From what I can tell, there's no
> > performance impact, at least not on the x86_64 systems I tried on.
> >
> > I think you should remove the first special case conditional and use
> > memcpy() instead of the cumbersome __may_alias__ construct to retrieve
> > the data.
> >
> 
> Here:
> https://www.agner.org/optimize/instruction_tables.pdf
> it lists the latency of vmovdqa (aligned) as 6 cycles and the latency for
> vmovdqu (unaligned) as 7 cycles. So I guess there can be some difference.
> Although in practice I'm not sure what difference it makes. I've not seen any
> difference in runtime between the two versions.
> 

Correction to my comment:
Those stats are for some older CPU. For some newer CPUs such as Tiger Lake the stats seem to be the same regardless of aligned or unaligned.

> > >> +	u16_buf = (const u16_p *)buf;
> > >> +	end = u16_buf + len / sizeof(*u16_buf);
> > >>   	for (; u16_buf != end; ++u16_buf)
> > >> -		sum += *u16_buf;
> > >> +		bsum += *u16_buf;
> > >>
> > >>   	/* if length is odd, keeping it byte order independent */
> > >>   	if (unlikely(len % 2)) {
> > >>   		uint16_t left = 0;
> > >>   		*(unsigned char *)&left = *(const unsigned char
> > *)end;
> > >> -		sum += left;
> > >> +		bsum += left;
> > >>   	}
> > >>
> > >> -	return sum;
> > >> +	/* if buffer is unaligned, swap the checksum bytes */
> > >> +	if (unlikely(unaligned))
> > >> +		bsum = (bsum & 0xFF00FF00) >> 8 | (bsum &
> > 0x00FF00FF) << 8;
> > >> +
> > >> +	return sum + bsum;
> > >>   }
> > >>
> > >>   /**
> > >> --
> > >> 2.17.1
> > >
> > > @Emil, thank you for thoroughly reviewing the previous versions.
> > >
> > > If your test succeeds and you are satisfied with the patch, remember
> > > to
> > reply with a "Tested-by" tag for patchwork.
> > >

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-27 12:50                     ` Emil Berg
@ 2022-06-27 13:22                       ` Morten Brørup
  2022-06-27 17:22                         ` Mattias Rönnblom
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-27 13:22 UTC (permalink / raw)
  To: Emil Berg, Mattias Rönnblom, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, olivier.matz

> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Monday, 27 June 2022 14.51
> 
> > From: Emil Berg
> > Sent: den 27 juni 2022 14:46
> >
> > > From: Mattias Rönnblom <hofors@lysator.liu.se>
> > > Sent: den 27 juni 2022 14:28
> > >
> > > On 2022-06-23 14:51, Morten Brørup wrote:
> > > >> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > > >> Sent: Thursday, 23 June 2022 14.39
> > > >>
> > > >> With this patch, the checksum can be calculated on an unaligned
> buffer.
> > > >> I.e. the buf parameter is no longer required to be 16 bit
> aligned.
> > > >>
> > > >> The checksum is still calculated using a 16 bit aligned pointer,
> so
> > > >> the compiler can auto-vectorize the function's inner loop.
> > > >>
> > > >> When the buffer is unaligned, the first byte of the buffer is
> > > >> handled separately. Furthermore, the calculated checksum of the
> > > >> buffer is byte shifted before being added to the initial
> checksum,
> > > >> to compensate for the checksum having been calculated on the
> buffer
> > > >> shifted by one byte.
> > > >>
> > > >> v4:
> > > >> * Add copyright notice.
> > > >> * Include stdbool.h (Emil Berg).
> > > >> * Use RTE_PTR_ADD (Emil Berg).
> > > >> * Fix one more typo in commit message. Is 'unligned' even a
> word?
> > > >> v3:
> > > >> * Remove braces from single statement block.
> > > >> * Fix typo in commit message.
> > > >> v2:
> > > >> * Do not assume that the buffer is part of an aligned packet
> buffer.
> > > >>
> > > >> Bugzilla ID: 1035
> > > >> Cc: stable@dpdk.org
> > > >>
> > > >> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > > >> ---
> > > >>   lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
> > > >>   1 file changed, 27 insertions(+), 5 deletions(-)
> > > >>
> > > >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
> > > >> b502481670..738d643da0 100644
> > > >> --- a/lib/net/rte_ip.h
> > > >> +++ b/lib/net/rte_ip.h
> > > >> @@ -3,6 +3,7 @@
> > > >>    *      The Regents of the University of California.
> > > >>    * Copyright(c) 2010-2014 Intel Corporation.
> > > >>    * Copyright(c) 2014 6WIND S.A.
> > > >> + * Copyright(c) 2022 SmartShare Systems.
> > > >>    * All rights reserved.
> > > >>    */
> > > >>
> > > >> @@ -15,6 +16,7 @@
> > > >>    * IP-related defines
> > > >>    */
> > > >>
> > > >> +#include <stdbool.h>
> > > >>   #include <stdint.h>
> > > >>
> > > >>   #ifdef RTE_EXEC_ENV_WINDOWS
> > > >> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t
> len,
> > > >> uint32_t sum)
> > > >>   {
> > > >>   	/* extend strict-aliasing rules */
> > > >>   	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > > >> -	const u16_p *u16_buf = (const u16_p *)buf;
> > > >> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > > >> +	const u16_p *u16_buf;
> > > >> +	const u16_p *end;
> > > >> +	uint32_t bsum = 0;
> > > >> +	const bool unaligned = (uintptr_t)buf & 1;
> > > >> +
> > > >> +	/* if buffer is unaligned, keeping it byte order
> independent */
> > > >> +	if (unlikely(unaligned)) {
> > > >> +		uint16_t first = 0;
> > > >> +		if (unlikely(len == 0))
> > > >> +			return 0;
> > > >> +		((unsigned char *)&first)[1] = *(const unsigned
> > > char *)buf;
> > > >> +		bsum += first;
> > > >> +		buf = RTE_PTR_ADD(buf, 1);
> > > >> +		len--;
> > > >> +	}
> > > >>
> > > >> +	/* aligned access for compiler auto-vectorization */
> > >
> > > The compiler will be able to auto vectorize even unaligned
> accesses,
> > > just with different instructions. From what I can tell, there's no
> > > performance impact, at least not on the x86_64 systems I tried on.
> > >
> > > I think you should remove the first special case conditional and
> use
> > > memcpy() instead of the cumbersome __may_alias__ construct to
> retrieve
> > > the data.
> > >
> >
> > Here:
> > https://www.agner.org/optimize/instruction_tables.pdf
> > it lists the latency of vmovdqa (aligned) as 6 cycles and the latency
> for
> > vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> difference.
> > Although in practice I'm not sure what difference it makes. I've not
> seen any
> > difference in runtime between the two versions.
> >
> 
> Correction to my comment:
> Those stats are for some older CPU. For some newer CPUs such as Tiger
> Lake the stats seem to be the same regardless of aligned or unaligned.
> 

I agree that the memcpy method is more elegant and easy to read.

However, we would need to performance test the modified checksum function with a large number of CPUs to prove that we don't introduce a performance regression on any CPU architecture still supported by DPDK. And Emil already found a CPU where it costs 1 extra cycle per 16 bytes, which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP packet.

So I opted for a solution with zero changes to the inner loop, so no performance retesting is required (for the previously supported use cases, where the buffer is aligned).

I have previously submitted a couple of patches to fix some minor bugs in the mempool cache functions [1] and [2], while also refactoring the functions for readability. But after having incorporated various feedback, nobody wants to proceed with the patches, probably due to fear of performance regressions. I didn't want to risk the same with this patch.

[1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D8712B@smartserver.smartshare.dk/
[2] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk/


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-27 13:22                       ` Morten Brørup
@ 2022-06-27 17:22                         ` Mattias Rönnblom
  2022-06-27 20:21                           ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-06-27 17:22 UTC (permalink / raw)
  To: Morten Brørup, Emil Berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, olivier.matz

On 2022-06-27 15:22, Morten Brørup wrote:
>> From: Emil Berg [mailto:emil.berg@ericsson.com]
>> Sent: Monday, 27 June 2022 14.51
>>
>>> From: Emil Berg
>>> Sent: den 27 juni 2022 14:46
>>>
>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
>>>> Sent: den 27 juni 2022 14:28
>>>>
>>>> On 2022-06-23 14:51, Morten Brørup wrote:
>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>>>>>> Sent: Thursday, 23 June 2022 14.39
>>>>>>
>>>>>> With this patch, the checksum can be calculated on an unaligned
>> buffer.
>>>>>> I.e. the buf parameter is no longer required to be 16 bit
>> aligned.
>>>>>>
>>>>>> The checksum is still calculated using a 16 bit aligned pointer,
>> so
>>>>>> the compiler can auto-vectorize the function's inner loop.
>>>>>>
>>>>>> When the buffer is unaligned, the first byte of the buffer is
>>>>>> handled separately. Furthermore, the calculated checksum of the
>>>>>> buffer is byte shifted before being added to the initial
>> checksum,
>>>>>> to compensate for the checksum having been calculated on the
>> buffer
>>>>>> shifted by one byte.
>>>>>>
>>>>>> v4:
>>>>>> * Add copyright notice.
>>>>>> * Include stdbool.h (Emil Berg).
>>>>>> * Use RTE_PTR_ADD (Emil Berg).
>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
>> word?
>>>>>> v3:
>>>>>> * Remove braces from single statement block.
>>>>>> * Fix typo in commit message.
>>>>>> v2:
>>>>>> * Do not assume that the buffer is part of an aligned packet
>> buffer.
>>>>>>
>>>>>> Bugzilla ID: 1035
>>>>>> Cc: stable@dpdk.org
>>>>>>
>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>>>>>> ---
>>>>>>    lib/net/rte_ip.h | 32 +++++++++++++++++++++++++++-----
>>>>>>    1 file changed, 27 insertions(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h index
>>>>>> b502481670..738d643da0 100644
>>>>>> --- a/lib/net/rte_ip.h
>>>>>> +++ b/lib/net/rte_ip.h
>>>>>> @@ -3,6 +3,7 @@
>>>>>>     *      The Regents of the University of California.
>>>>>>     * Copyright(c) 2010-2014 Intel Corporation.
>>>>>>     * Copyright(c) 2014 6WIND S.A.
>>>>>> + * Copyright(c) 2022 SmartShare Systems.
>>>>>>     * All rights reserved.
>>>>>>     */
>>>>>>
>>>>>> @@ -15,6 +16,7 @@
>>>>>>     * IP-related defines
>>>>>>     */
>>>>>>
>>>>>> +#include <stdbool.h>
>>>>>>    #include <stdint.h>
>>>>>>
>>>>>>    #ifdef RTE_EXEC_ENV_WINDOWS
>>>>>> @@ -162,20 +164,40 @@ __rte_raw_cksum(const void *buf, size_t
>> len,
>>>>>> uint32_t sum)
>>>>>>    {
>>>>>>    	/* extend strict-aliasing rules */
>>>>>>    	typedef uint16_t __attribute__((__may_alias__)) u16_p;
>>>>>> -	const u16_p *u16_buf = (const u16_p *)buf;
>>>>>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
>>>>>> +	const u16_p *u16_buf;
>>>>>> +	const u16_p *end;
>>>>>> +	uint32_t bsum = 0;
>>>>>> +	const bool unaligned = (uintptr_t)buf & 1;
>>>>>> +
>>>>>> +	/* if buffer is unaligned, keeping it byte order
>> independent */
>>>>>> +	if (unlikely(unaligned)) {
>>>>>> +		uint16_t first = 0;
>>>>>> +		if (unlikely(len == 0))
>>>>>> +			return 0;
>>>>>> +		((unsigned char *)&first)[1] = *(const unsigned
>>>> char *)buf;
>>>>>> +		bsum += first;
>>>>>> +		buf = RTE_PTR_ADD(buf, 1);
>>>>>> +		len--;
>>>>>> +	}
>>>>>>
>>>>>> +	/* aligned access for compiler auto-vectorization */
>>>>
>>>> The compiler will be able to auto vectorize even unaligned
>> accesses,
>>>> just with different instructions. From what I can tell, there's no
>>>> performance impact, at least not on the x86_64 systems I tried on.
>>>>
>>>> I think you should remove the first special case conditional and
>> use
>>>> memcpy() instead of the cumbersome __may_alias__ construct to
>> retrieve
>>>> the data.
>>>>
>>>
>>> Here:
>>> https://www.agner.org/optimize/instruction_tables.pdf
>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the latency
>> for
>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
>> difference.
>>> Although in practice I'm not sure what difference it makes. I've not
>> seen any
>>> difference in runtime between the two versions.
>>>
>>
>> Correction to my comment:
>> Those stats are for some older CPU. For some newer CPUs such as Tiger
>> Lake the stats seem to be the same regardless of aligned or unaligned.
>>
> 
> I agree that the memcpy method is more elegant and easy to read.
> 
> However, we would need to performance test the modified checksum function with a large number of CPUs to prove that we don't introduce a performance regression on any CPU architecture still supported by DPDK. And Emil already found a CPU where it costs 1 extra cycle per 16 bytes, which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP packet.
> 

I think you've misunderstood what latency means in such tables. It's a 
data dependency thing, not a measure of throughput. The throughput is 
*much* higher. My guess would be two such instruction per clock.

For your 1460 bytes example, my Zen3 AMD needs performs identical with 
both the current DPDK implementation, your patch, and a memcpy()-ified 
version of the current implementation. They all need ~130 clock 
cycles/packet, with warm caches. IPC is 3 instructions per cycle, but 
obvious not all instructions are SIMD.

The main issue with checksumming on the CPU is, in my experience, not 
that you don't have enough compute, but that you trash the caches.

> So I opted for a solution with zero changes to the inner loop, so no performance retesting is required (for the previously supported use cases, where the buffer is aligned).
> 

You will see performance degradation with this solution as well, under 
certain conditions. For unaligned 100 bytes of data, the current DPDK 
implementation and the memcpy()-fied version needs ~21 cc/packet. Your 
patch needs 54 cc/packet.

But the old version didn't support unaligned accesses? In many compiler 
flag/machine combinations it did.

> I have previously submitted a couple of patches to fix some minor bugs in the mempool cache functions [1] and [2], while also refactoring the functions for readability. But after having incorporated various feedback, nobody wants to proceed with the patches, probably due to fear of performance regressions. I didn't want to risk the same with this patch.
> 
> [1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D8712B@smartserver.smartshare.dk/
> [2] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D86FBB@smartserver.smartshare.dk/
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-27 17:22                         ` Mattias Rönnblom
@ 2022-06-27 20:21                           ` Morten Brørup
  2022-06-28  6:28                             ` Mattias Rönnblom
  2022-07-07 18:34                             ` [PATCH 1/2] app/test: add cksum performance test Mattias Rönnblom
  0 siblings, 2 replies; 74+ messages in thread
From: Morten Brørup @ 2022-06-27 20:21 UTC (permalink / raw)
  To: Mattias Rönnblom, Emil Berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, olivier.matz

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 27 June 2022 19.23
> 
> On 2022-06-27 15:22, Morten Brørup wrote:
> >> From: Emil Berg [mailto:emil.berg@ericsson.com]
> >> Sent: Monday, 27 June 2022 14.51
> >>
> >>> From: Emil Berg
> >>> Sent: den 27 juni 2022 14:46
> >>>
> >>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
> >>>> Sent: den 27 juni 2022 14:28
> >>>>
> >>>> On 2022-06-23 14:51, Morten Brørup wrote:
> >>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> >>>>>> Sent: Thursday, 23 June 2022 14.39
> >>>>>>
> >>>>>> With this patch, the checksum can be calculated on an unaligned
> >> buffer.
> >>>>>> I.e. the buf parameter is no longer required to be 16 bit
> >> aligned.
> >>>>>>
> >>>>>> The checksum is still calculated using a 16 bit aligned pointer,
> >> so
> >>>>>> the compiler can auto-vectorize the function's inner loop.
> >>>>>>
> >>>>>> When the buffer is unaligned, the first byte of the buffer is
> >>>>>> handled separately. Furthermore, the calculated checksum of the
> >>>>>> buffer is byte shifted before being added to the initial
> >> checksum,
> >>>>>> to compensate for the checksum having been calculated on the
> >> buffer
> >>>>>> shifted by one byte.
> >>>>>>
> >>>>>> v4:
> >>>>>> * Add copyright notice.
> >>>>>> * Include stdbool.h (Emil Berg).
> >>>>>> * Use RTE_PTR_ADD (Emil Berg).
> >>>>>> * Fix one more typo in commit message. Is 'unligned' even a
> >> word?
> >>>>>> v3:
> >>>>>> * Remove braces from single statement block.
> >>>>>> * Fix typo in commit message.
> >>>>>> v2:
> >>>>>> * Do not assume that the buffer is part of an aligned packet
> >> buffer.
> >>>>>>
> >>>>>> Bugzilla ID: 1035
> >>>>>> Cc: stable@dpdk.org
> >>>>>>
> >>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>

[...]

> >>>>
> >>>> The compiler will be able to auto vectorize even unaligned
> >> accesses,
> >>>> just with different instructions. From what I can tell, there's no
> >>>> performance impact, at least not on the x86_64 systems I tried on.
> >>>>
> >>>> I think you should remove the first special case conditional and
> >> use
> >>>> memcpy() instead of the cumbersome __may_alias__ construct to
> >> retrieve
> >>>> the data.
> >>>>
> >>>
> >>> Here:
> >>> https://www.agner.org/optimize/instruction_tables.pdf
> >>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
> latency
> >> for
> >>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> >> difference.
> >>> Although in practice I'm not sure what difference it makes. I've
> not
> >> seen any
> >>> difference in runtime between the two versions.
> >>>
> >>
> >> Correction to my comment:
> >> Those stats are for some older CPU. For some newer CPUs such as
> Tiger
> >> Lake the stats seem to be the same regardless of aligned or
> unaligned.
> >>
> >
> > I agree that the memcpy method is more elegant and easy to read.
> >
> > However, we would need to performance test the modified checksum
> function with a large number of CPUs to prove that we don't introduce a
> performance regression on any CPU architecture still supported by DPDK.
> And Emil already found a CPU where it costs 1 extra cycle per 16 bytes,
> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
> packet.
> >
> 
> I think you've misunderstood what latency means in such tables. It's a
> data dependency thing, not a measure of throughput. The throughput is
> *much* higher. My guess would be two such instruction per clock.
> 
> For your 1460 bytes example, my Zen3 AMD needs performs identical with
> both the current DPDK implementation, your patch, and a memcpy()-ified
> version of the current implementation. They all need ~130 clock
> cycles/packet, with warm caches. IPC is 3 instructions per cycle, but
> obvious not all instructions are SIMD.

You're right, I wasn't thinking deeper about it before extrapolating.

Great to see some real numbers! I wish someone would do the same testing on an old ARM CPU, so we could also see the other end of the scale.

> The main issue with checksumming on the CPU is, in my experience, not
> that you don't have enough compute, but that you trash the caches.

Agree. I have noticed that x86 has "non-temporal" instruction variants to load/store data without trashing the cache entirely.

A variant of the checksum function using such instructions might be handy.

Variants of the memcpy function using such instructions might also be handy for some purposes, e.g. copying the contents of packets, where the original and/or copy will not accessed shortly thereafter.

> > So I opted for a solution with zero changes to the inner loop, so no
> performance retesting is required (for the previously supported use
> cases, where the buffer is aligned).
> >
> 
> You will see performance degradation with this solution as well, under
> certain conditions. For unaligned 100 bytes of data, the current DPDK
> implementation and the memcpy()-fied version needs ~21 cc/packet. Your
> patch needs 54 cc/packet.

Yes, it's a tradeoff. I exclusively aimed at maintaining performance for the case with aligned buffers (under all circumstances, with all CPUs etc.), and ignored how it affects the performance for the case with unaligned buffers.

Unlike this patch, the memcpy() variant has no additional branches for the unaligned case, so its performance should be generally unaffected by the buffer being aligned or not. However, I don't have sufficient in-depth CPU knowledge to say if this also applies to RISCV and older ARM CPUs still supported by DPDK.

I don't have strong feelings for this patch, so you can provide a memcpy() based alternative patch, and we will let the community decide which one they prefer.

> But the old version didn't support unaligned accesses? In many compiler
> flag/machine combinations it did.

It crashed in some cases. That was the point of the bug report [1], and the reason to provide a patch.

[1] https://bugs.dpdk.org/show_bug.cgi?id=1035


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-27 20:21                           ` Morten Brørup
@ 2022-06-28  6:28                             ` Mattias Rönnblom
  2022-06-30 16:28                               ` Morten Brørup
  2022-07-07 18:34                             ` [PATCH 1/2] app/test: add cksum performance test Mattias Rönnblom
  1 sibling, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-06-28  6:28 UTC (permalink / raw)
  To: Morten Brørup, Emil Berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, olivier.matz

On 2022-06-27 22:21, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Monday, 27 June 2022 19.23
>>
>> On 2022-06-27 15:22, Morten Brørup wrote:
>>>> From: Emil Berg [mailto:emil.berg@ericsson.com]
>>>> Sent: Monday, 27 June 2022 14.51
>>>>
>>>>> From: Emil Berg
>>>>> Sent: den 27 juni 2022 14:46
>>>>>
>>>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
>>>>>> Sent: den 27 juni 2022 14:28
>>>>>>
>>>>>> On 2022-06-23 14:51, Morten Brørup wrote:
>>>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
>>>>>>>> Sent: Thursday, 23 June 2022 14.39
>>>>>>>>
>>>>>>>> With this patch, the checksum can be calculated on an unaligned
>>>> buffer.
>>>>>>>> I.e. the buf parameter is no longer required to be 16 bit
>>>> aligned.
>>>>>>>>
>>>>>>>> The checksum is still calculated using a 16 bit aligned pointer,
>>>> so
>>>>>>>> the compiler can auto-vectorize the function's inner loop.
>>>>>>>>
>>>>>>>> When the buffer is unaligned, the first byte of the buffer is
>>>>>>>> handled separately. Furthermore, the calculated checksum of the
>>>>>>>> buffer is byte shifted before being added to the initial
>>>> checksum,
>>>>>>>> to compensate for the checksum having been calculated on the
>>>> buffer
>>>>>>>> shifted by one byte.
>>>>>>>>
>>>>>>>> v4:
>>>>>>>> * Add copyright notice.
>>>>>>>> * Include stdbool.h (Emil Berg).
>>>>>>>> * Use RTE_PTR_ADD (Emil Berg).
>>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
>>>> word?
>>>>>>>> v3:
>>>>>>>> * Remove braces from single statement block.
>>>>>>>> * Fix typo in commit message.
>>>>>>>> v2:
>>>>>>>> * Do not assume that the buffer is part of an aligned packet
>>>> buffer.
>>>>>>>>
>>>>>>>> Bugzilla ID: 1035
>>>>>>>> Cc: stable@dpdk.org
>>>>>>>>
>>>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> 
> [...]
> 
>>>>>>
>>>>>> The compiler will be able to auto vectorize even unaligned
>>>> accesses,
>>>>>> just with different instructions. From what I can tell, there's no
>>>>>> performance impact, at least not on the x86_64 systems I tried on.
>>>>>>
>>>>>> I think you should remove the first special case conditional and
>>>> use
>>>>>> memcpy() instead of the cumbersome __may_alias__ construct to
>>>> retrieve
>>>>>> the data.
>>>>>>
>>>>>
>>>>> Here:
>>>>> https://www.agner.org/optimize/instruction_tables.pdf
>>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
>> latency
>>>> for
>>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
>>>> difference.
>>>>> Although in practice I'm not sure what difference it makes. I've
>> not
>>>> seen any
>>>>> difference in runtime between the two versions.
>>>>>
>>>>
>>>> Correction to my comment:
>>>> Those stats are for some older CPU. For some newer CPUs such as
>> Tiger
>>>> Lake the stats seem to be the same regardless of aligned or
>> unaligned.
>>>>
>>>
>>> I agree that the memcpy method is more elegant and easy to read.
>>>
>>> However, we would need to performance test the modified checksum
>> function with a large number of CPUs to prove that we don't introduce a
>> performance regression on any CPU architecture still supported by DPDK.
>> And Emil already found a CPU where it costs 1 extra cycle per 16 bytes,
>> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
>> packet.
>>>
>>
>> I think you've misunderstood what latency means in such tables. It's a
>> data dependency thing, not a measure of throughput. The throughput is
>> *much* higher. My guess would be two such instruction per clock.
>>
>> For your 1460 bytes example, my Zen3 AMD needs performs identical with
>> both the current DPDK implementation, your patch, and a memcpy()-ified
>> version of the current implementation. They all need ~130 clock
>> cycles/packet, with warm caches. IPC is 3 instructions per cycle, but
>> obvious not all instructions are SIMD.
> 
> You're right, I wasn't thinking deeper about it before extrapolating.
> 
> Great to see some real numbers! I wish someone would do the same testing on an old ARM CPU, so we could also see the other end of the scale.
> 

I've ran it on an ARM A72. For the aligned 1460 bytes case I got: 
Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They 
performed about the same for all unaligned/aligned and sizes I tested. 
This platform (or could be GCC version as well) doesn't suffer from the 
unaligned performance degradation your patch showed on my AMD machine.

>> The main issue with checksumming on the CPU is, in my experience, not
>> that you don't have enough compute, but that you trash the caches.
> 
> Agree. I have noticed that x86 has "non-temporal" instruction variants to load/store data without trashing the cache entirely.
> 
> A variant of the checksum function using such instructions might be handy.
> 

Yes, although you may need to prefetch the payload for good performance.

> Variants of the memcpy function using such instructions might also be handy for some purposes, e.g. copying the contents of packets, where the original and/or copy will not accessed shortly thereafter.
> 

Indeed and I think it's been discussed on the list. There's some work to 
get it right, since alignment requirement and the fact a different 
memory model is used for those SIMD instructions causes trouble for a 
generic implementation. (For x86_64.)

>>> So I opted for a solution with zero changes to the inner loop, so no
>> performance retesting is required (for the previously supported use
>> cases, where the buffer is aligned).
>>>
>>
>> You will see performance degradation with this solution as well, under
>> certain conditions. For unaligned 100 bytes of data, the current DPDK
>> implementation and the memcpy()-fied version needs ~21 cc/packet. Your
>> patch needs 54 cc/packet.
> 
> Yes, it's a tradeoff. I exclusively aimed at maintaining performance for the case with aligned buffers (under all circumstances, with all CPUs etc.), and ignored how it affects the performance for the case with unaligned buffers.
> 
> Unlike this patch, the memcpy() variant has no additional branches for the unaligned case, so its performance should be generally unaffected by the buffer being aligned or not. However, I don't have sufficient in-depth CPU knowledge to say if this also applies to RISCV and older ARM CPUs still supported by DPDK.
> 

I don't think avoiding RISCV non-catastrophic regressions triumphs 
improving performance on mainstream CPUs and avoiding code quality 
regressions.

> I don't have strong feelings for this patch, so you can provide a memcpy() based alternative patch, and we will let the community decide which one they prefer.
> 

This is what my conversion looks like:

static inline uint32_t
__rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
{
	const void *end;

	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));
	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
		uint16_t v;

		memcpy(&v, buf, sizeof(uint16_t));
		sum += v;
	}

	/* if length is odd, keeping it byte order independent */
	if (unlikely(len % 2)) {
		uint8_t last;
		uint16_t left = 0;

		memcpy(&last, end, 1);
		*(unsigned char *)&left = last;
		sum += left;
	}

	return sum;
}

Maybe the maintainer can have an opinion, before I provide a real patch, 
to save me some work. I would really like to contribute some performance 
autotests in that case. I mean, you could be right and the performance 
could be totally off on some platform.

>> But the old version didn't support unaligned accesses? In many compiler
>> flag/machine combinations it did.
> 
> It crashed in some cases. That was the point of the bug report [1], and the reason to provide a patch.
> 
> [1] https://bugs.dpdk.org/show_bug.cgi?id=1035
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-28  6:28                             ` Mattias Rönnblom
@ 2022-06-30 16:28                               ` Morten Brørup
  2022-07-07 15:21                                 ` Stanisław Kardach
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-06-30 16:28 UTC (permalink / raw)
  To: Mattias Rönnblom, Emil Berg, bruce.richardson, dev
  Cc: stephen, stable, bugzilla, olivier.matz

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Tuesday, 28 June 2022 08.28
> 
> On 2022-06-27 22:21, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Monday, 27 June 2022 19.23
> >>
> >> On 2022-06-27 15:22, Morten Brørup wrote:
> >>>> From: Emil Berg [mailto:emil.berg@ericsson.com]
> >>>> Sent: Monday, 27 June 2022 14.51
> >>>>
> >>>>> From: Emil Berg
> >>>>> Sent: den 27 juni 2022 14:46
> >>>>>
> >>>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
> >>>>>> Sent: den 27 juni 2022 14:28
> >>>>>>
> >>>>>> On 2022-06-23 14:51, Morten Brørup wrote:
> >>>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> >>>>>>>> Sent: Thursday, 23 June 2022 14.39
> >>>>>>>>
> >>>>>>>> With this patch, the checksum can be calculated on an
> unaligned
> >>>> buffer.
> >>>>>>>> I.e. the buf parameter is no longer required to be 16 bit
> >>>> aligned.
> >>>>>>>>
> >>>>>>>> The checksum is still calculated using a 16 bit aligned
> pointer,
> >>>> so
> >>>>>>>> the compiler can auto-vectorize the function's inner loop.
> >>>>>>>>
> >>>>>>>> When the buffer is unaligned, the first byte of the buffer is
> >>>>>>>> handled separately. Furthermore, the calculated checksum of
> the
> >>>>>>>> buffer is byte shifted before being added to the initial
> >>>> checksum,
> >>>>>>>> to compensate for the checksum having been calculated on the
> >>>> buffer
> >>>>>>>> shifted by one byte.
> >>>>>>>>
> >>>>>>>> v4:
> >>>>>>>> * Add copyright notice.
> >>>>>>>> * Include stdbool.h (Emil Berg).
> >>>>>>>> * Use RTE_PTR_ADD (Emil Berg).
> >>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
> >>>> word?
> >>>>>>>> v3:
> >>>>>>>> * Remove braces from single statement block.
> >>>>>>>> * Fix typo in commit message.
> >>>>>>>> v2:
> >>>>>>>> * Do not assume that the buffer is part of an aligned packet
> >>>> buffer.
> >>>>>>>>
> >>>>>>>> Bugzilla ID: 1035
> >>>>>>>> Cc: stable@dpdk.org
> >>>>>>>>
> >>>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >
> > [...]
> >
> >>>>>>
> >>>>>> The compiler will be able to auto vectorize even unaligned
> >>>> accesses,
> >>>>>> just with different instructions. From what I can tell, there's
> no
> >>>>>> performance impact, at least not on the x86_64 systems I tried
> on.
> >>>>>>
> >>>>>> I think you should remove the first special case conditional and
> >>>> use
> >>>>>> memcpy() instead of the cumbersome __may_alias__ construct to
> >>>> retrieve
> >>>>>> the data.
> >>>>>>
> >>>>>
> >>>>> Here:
> >>>>> https://www.agner.org/optimize/instruction_tables.pdf
> >>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
> >> latency
> >>>> for
> >>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> >>>> difference.
> >>>>> Although in practice I'm not sure what difference it makes. I've
> >> not
> >>>> seen any
> >>>>> difference in runtime between the two versions.
> >>>>>
> >>>>
> >>>> Correction to my comment:
> >>>> Those stats are for some older CPU. For some newer CPUs such as
> >> Tiger
> >>>> Lake the stats seem to be the same regardless of aligned or
> >> unaligned.
> >>>>
> >>>
> >>> I agree that the memcpy method is more elegant and easy to read.
> >>>
> >>> However, we would need to performance test the modified checksum
> >> function with a large number of CPUs to prove that we don't
> introduce a
> >> performance regression on any CPU architecture still supported by
> DPDK.
> >> And Emil already found a CPU where it costs 1 extra cycle per 16
> bytes,
> >> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
> >> packet.
> >>>
> >>
> >> I think you've misunderstood what latency means in such tables. It's
> a
> >> data dependency thing, not a measure of throughput. The throughput
> is
> >> *much* higher. My guess would be two such instruction per clock.
> >>
> >> For your 1460 bytes example, my Zen3 AMD needs performs identical
> with
> >> both the current DPDK implementation, your patch, and a memcpy()-
> ified
> >> version of the current implementation. They all need ~130 clock
> >> cycles/packet, with warm caches. IPC is 3 instructions per cycle,
> but
> >> obvious not all instructions are SIMD.
> >
> > You're right, I wasn't thinking deeper about it before extrapolating.
> >
> > Great to see some real numbers! I wish someone would do the same
> testing on an old ARM CPU, so we could also see the other end of the
> scale.
> >
> 
> I've ran it on an ARM A72. For the aligned 1460 bytes case I got:
> Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They
> performed about the same for all unaligned/aligned and sizes I tested.
> This platform (or could be GCC version as well) doesn't suffer from the
> unaligned performance degradation your patch showed on my AMD machine.
> 
> >> The main issue with checksumming on the CPU is, in my experience,
> not
> >> that you don't have enough compute, but that you trash the caches.
> >
> > Agree. I have noticed that x86 has "non-temporal" instruction
> variants to load/store data without trashing the cache entirely.
> >
> > A variant of the checksum function using such instructions might be
> handy.
> >
> 
> Yes, although you may need to prefetch the payload for good
> performance.
> 
> > Variants of the memcpy function using such instructions might also be
> handy for some purposes, e.g. copying the contents of packets, where
> the original and/or copy will not accessed shortly thereafter.
> >
> 
> Indeed and I think it's been discussed on the list. There's some work
> to
> get it right, since alignment requirement and the fact a different
> memory model is used for those SIMD instructions causes trouble for a
> generic implementation. (For x86_64.)

I just posted an RFC [1] for such memcpy() and memset() functions,
so let's see how it fans out.

[1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87195@smartserver.smartshare.dk/T/#u

> 
> >>> So I opted for a solution with zero changes to the inner loop, so
> no
> >> performance retesting is required (for the previously supported use
> >> cases, where the buffer is aligned).
> >>>
> >>
> >> You will see performance degradation with this solution as well,
> under
> >> certain conditions. For unaligned 100 bytes of data, the current
> DPDK
> >> implementation and the memcpy()-fied version needs ~21 cc/packet.
> Your
> >> patch needs 54 cc/packet.
> >
> > Yes, it's a tradeoff. I exclusively aimed at maintaining performance
> for the case with aligned buffers (under all circumstances, with all
> CPUs etc.), and ignored how it affects the performance for the case
> with unaligned buffers.
> >
> > Unlike this patch, the memcpy() variant has no additional branches
> for the unaligned case, so its performance should be generally
> unaffected by the buffer being aligned or not. However, I don't have
> sufficient in-depth CPU knowledge to say if this also applies to RISCV
> and older ARM CPUs still supported by DPDK.
> >
> 
> I don't think avoiding RISCV non-catastrophic regressions triumphs
> improving performance on mainstream CPUs and avoiding code quality
> regressions.
+1

> 
> > I don't have strong feelings for this patch, so you can provide a
> memcpy() based alternative patch, and we will let the community decide
> which one they prefer.
> >
> 
> This is what my conversion looks like:
> 
> static inline uint32_t
> __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
> {
> 	const void *end;
> 
> 	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) *
> sizeof(uint16_t));
> 	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> 		uint16_t v;
> 
> 		memcpy(&v, buf, sizeof(uint16_t));
> 		sum += v;
> 	}
> 
> 	/* if length is odd, keeping it byte order independent */
> 	if (unlikely(len % 2)) {
> 		uint8_t last;
> 		uint16_t left = 0;
> 
> 		memcpy(&last, end, 1);
> 		*(unsigned char *)&left = last;
> 		sum += left;
> 	}
> 
> 	return sum;
> }
> 
> Maybe the maintainer can have an opinion, before I provide a real
> patch,
> to save me some work. I would really like to contribute some
> performance
> autotests in that case. I mean, you could be right and the performance
> could be totally off on some platform.

Looking at the meson files, A72 seems to be the weakest type of CPU actively supported by DPDK.
So your testing shows better performance for your solution than mine under all conditions.

If you post a patch, I'll close this one as superseded.

> 
> >> But the old version didn't support unaligned accesses? In many
> compiler
> >> flag/machine combinations it did.
> >
> > It crashed in some cases. That was the point of the bug report [1],
> and the reason to provide a patch.
> >
> > [1] https://bugs.dpdk.org/show_bug.cgi?id=1035
> >


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-23 12:39             ` [PATCH v4] " Morten Brørup
  2022-06-23 12:51               ` Morten Brørup
@ 2022-06-30 17:41               ` Stephen Hemminger
  2022-06-30 17:45               ` Stephen Hemminger
  2 siblings, 0 replies; 74+ messages in thread
From: Stephen Hemminger @ 2022-06-30 17:41 UTC (permalink / raw)
  To: Morten Brørup
  Cc: emil.berg, bruce.richardson, dev, stable, bugzilla, hofors, olivier.matz

On Thu, 23 Jun 2022 14:39:00 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..738d643da0 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -3,6 +3,7 @@
>   *      The Regents of the University of California.
>   * Copyright(c) 2010-2014 Intel Corporation.
>   * Copyright(c) 2014 6WIND S.A.
> + * Copyright(c) 2022 SmartShare Systems.
>   * All rights reserved.
>   */

NAK
Doing a small incremental fix should not change copyright.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-23 12:39             ` [PATCH v4] " Morten Brørup
  2022-06-23 12:51               ` Morten Brørup
  2022-06-30 17:41               ` [PATCH v4] net: fix checksum with unaligned buffer Stephen Hemminger
@ 2022-06-30 17:45               ` Stephen Hemminger
  2022-07-01  4:11                 ` Emil Berg
  2 siblings, 1 reply; 74+ messages in thread
From: Stephen Hemminger @ 2022-06-30 17:45 UTC (permalink / raw)
  To: Morten Brørup
  Cc: emil.berg, bruce.richardson, dev, stable, bugzilla, hofors, olivier.matz

On Thu, 23 Jun 2022 14:39:00 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> +	/* if buffer is unaligned, keeping it byte order independent */
> +	if (unlikely(unaligned)) {
> +		uint16_t first = 0;
> +		if (unlikely(len == 0))
> +			return 0;

Why is length == 0 unique to unaligned case?

> +		((unsigned char *)&first)[1] = *(const unsigned char *)buf;

Use a proper union instead of casting to avoid aliasing warnings.

> +		bsum += first;
> +		buf = RTE_PTR_ADD(buf, 1);
> +		len--;
> +	}

Many CPU's (such as x86) won't care about alignment and therefore the extra
code to handle this is not worth doing.

Perhaps DPDK needs a macro (like Linux kernel) for efficient unaligned access.

In Linux kernel it is CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-30 17:45               ` Stephen Hemminger
@ 2022-07-01  4:11                 ` Emil Berg
  2022-07-01 16:50                   ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Emil Berg @ 2022-07-01  4:11 UTC (permalink / raw)
  To: Stephen Hemminger, Morten Brørup
  Cc: bruce.richardson, dev, stable, bugzilla, hofors, olivier.matz



> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: den 30 juni 2022 19:46
> To: Morten Brørup <mb@smartsharesystems.com>
> Cc: Emil Berg <emil.berg@ericsson.com>; bruce.richardson@intel.com;
> dev@dpdk.org; stable@dpdk.org; bugzilla@dpdk.org; hofors@lysator.liu.se;
> olivier.matz@6wind.com
> Subject: Re: [PATCH v4] net: fix checksum with unaligned buffer
> 
> On Thu, 23 Jun 2022 14:39:00 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > +	/* if buffer is unaligned, keeping it byte order independent */
> > +	if (unlikely(unaligned)) {
> > +		uint16_t first = 0;
> > +		if (unlikely(len == 0))
> > +			return 0;
> 
> Why is length == 0 unique to unaligned case?
> 
> > +		((unsigned char *)&first)[1] = *(const unsigned
> char *)buf;
> 
> Use a proper union instead of casting to avoid aliasing warnings.
> 
> > +		bsum += first;
> > +		buf = RTE_PTR_ADD(buf, 1);
> > +		len--;
> > +	}
> 
> Many CPU's (such as x86) won't care about alignment and therefore the
> extra code to handle this is not worth doing.
> 

x86 does care about alignment. An example is the vmovdqa instruction, where 'a' stands for 'aligned'. The description in the link below says: "When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte boundary or a general-protection exception (#GP) will be generated. "

https://www.felixcloutier.com/x86/movdqa:vmovdqa32:vmovdqa64

> Perhaps DPDK needs a macro (like Linux kernel) for efficient unaligned
> access.
> 
> In Linux kernel it is CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-07-01  4:11                 ` Emil Berg
@ 2022-07-01 16:50                   ` Morten Brørup
  2022-07-01 17:04                     ` Stephen Hemminger
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-07-01 16:50 UTC (permalink / raw)
  To: Emil Berg, Stephen Hemminger
  Cc: bruce.richardson, dev, stable, bugzilla, hofors, olivier.matz

> From: Emil Berg [mailto:emil.berg@ericsson.com]
> Sent: Friday, 1 July 2022 06.11
> 
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: den 30 juni 2022 19:46
> >
> > On Thu, 23 Jun 2022 14:39:00 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> > > +	/* if buffer is unaligned, keeping it byte order independent */
> > > +	if (unlikely(unaligned)) {
> > > +		uint16_t first = 0;
> > > +		if (unlikely(len == 0))
> > > +			return 0;
> >
> > Why is length == 0 unique to unaligned case?

Because the aligned case handles it gracefully. The unaligned case subtracts one from 'len', which (being unsigned) would become a very large number, causing 'end' to become way off.

> >
> > > +		((unsigned char *)&first)[1] = *(const unsigned
> > char *)buf;
> >
> > Use a proper union instead of casting to avoid aliasing warnings.

I copied what is done by 'left', so it resembles the existing code in the function, making it easier to review.

It is part of the endian neutral checksum handling. Tricky stuff! :-)

> >
> > > +		bsum += first;
> > > +		buf = RTE_PTR_ADD(buf, 1);
> > > +		len--;
> > > +	}
> >
> > Many CPU's (such as x86) won't care about alignment and therefore the
> > extra code to handle this is not worth doing.
> >
> 
> x86 does care about alignment. An example is the vmovdqa instruction,
> where 'a' stands for 'aligned'. The description in the link below says:
> "When the source or destination operand is a memory operand, the
> operand must be aligned on a 16-byte boundary or a general-protection
> exception (#GP) will be generated. "
> 
> https://www.felixcloutier.com/x86/movdqa:vmovdqa32:vmovdqa64
> 

Also, this misconception is exactly the bug [1] this patch fixes.

[1] https://bugs.dpdk.org/show_bug.cgi?id=1035

> > Perhaps DPDK needs a macro (like Linux kernel) for efficient
> unaligned
> > access.
> >
> > In Linux kernel it is CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

I recently stumbled across RTE_ARCH_STRICT_ALIGN in /lib/eal/include/rte_common.h.

But I guess it is something else.

Anyway, this function has ugly alignment problems (also before the patch), and has gone through a couple of iterations to silence warnings from the compiler. These warnings should have been addressed instead of silenced. Mattias has suggested a far better solution [2] than mine, which also correctly addresses the compiler alignment warnings, so we will probably end up with his solution instead.

[2] http://inbox.dpdk.org/dev/AM8PR07MB7666AD7BF7B780CC5062C14598BD9@AM8PR07MB7666.eurprd07.prod.outlook.com/T/#m1a76490541fce4a85b12d9390f2f4fac5a9f4660


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4] net: fix checksum with unaligned buffer
  2022-07-01 16:50                   ` Morten Brørup
@ 2022-07-01 17:04                     ` Stephen Hemminger
  2022-07-01 20:46                       ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Stephen Hemminger @ 2022-07-01 17:04 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Emil Berg, bruce.richardson, dev, stable, bugzilla, hofors, olivier.matz

On Fri, 1 Jul 2022 18:50:34 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:

> But I guess it is something else.
> 
> Anyway, this function has ugly alignment problems (also before the patch), and has gone through a couple of iterations to silence warnings from the compiler. These warnings should have been addressed instead of silenced. Mattias has suggested a far better solution [2] than mine, which also correctly addresses the compiler alignment warnings, so we will probably end up with his solution instead.
> 
> [2] http://inbox.dpdk.org/dev/AM8PR07MB7666AD7BF7B780CC5062C14598BD9@AM8PR07MB7666.eurprd07.prod.outlook.com/T/#m1a76490541fce4a85b12d9390f2f4fac5a9f4660
> 


Maybe some mix of the memcpy for unaligned and odd length and faster (unrolled?) version for the case of aligned and exact multiple?
Or just take code from FreeBSD?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH v4] net: fix checksum with unaligned buffer
  2022-07-01 17:04                     ` Stephen Hemminger
@ 2022-07-01 20:46                       ` Morten Brørup
  0 siblings, 0 replies; 74+ messages in thread
From: Morten Brørup @ 2022-07-01 20:46 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Emil Berg, bruce.richardson, dev, stable, bugzilla, hofors, olivier.matz

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, 1 July 2022 19.05
> 
> On Fri, 1 Jul 2022 18:50:34 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
> 
> > But I guess it is something else.
> >
> > Anyway, this function has ugly alignment problems (also before the
> patch), and has gone through a couple of iterations to silence warnings
> from the compiler. These warnings should have been addressed instead of
> silenced. Mattias has suggested a far better solution [2] than mine,
> which also correctly addresses the compiler alignment warnings, so we
> will probably end up with his solution instead.
> >
> > [2]
> http://inbox.dpdk.org/dev/AM8PR07MB7666AD7BF7B780CC5062C14598BD9@AM8PR0
> 7MB7666.eurprd07.prod.outlook.com/T/#m1a76490541fce4a85b12d9390f2f4fac5
> a9f4660
> >
> 
> 
> Maybe some mix of the memcpy for unaligned and odd length and faster
> (unrolled?) version for the case of aligned and exact multiple?
> Or just take code from FreeBSD?

I just took a look at the BSD code, and it starts with the same "if ptr is unaligned" as my patch, and then does some manual loop unrolling, which we expect the compiler to do. Mattias has demonstrated that his solution has better performance, not only on modern X86 CPUs, also on an A72 CPU, so I prefer his solution. And the difference between using "vmovdqa" and "vmovdq" instructions here seem to be insignificant.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v4] net: fix checksum with unaligned buffer
  2022-06-30 16:28                               ` Morten Brørup
@ 2022-07-07 15:21                                 ` Stanisław Kardach
  0 siblings, 0 replies; 74+ messages in thread
From: Stanisław Kardach @ 2022-07-07 15:21 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, Emil Berg, Bruce Richardson, dev,
	Stephen Hemminger, dpdk stable, bugzilla, Olivier Matz

On Thu, Jun 30, 2022 at 6:32 PM Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Tuesday, 28 June 2022 08.28
> >
> > On 2022-06-27 22:21, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > >> Sent: Monday, 27 June 2022 19.23
> > >>
> > >> On 2022-06-27 15:22, Morten Brørup wrote:
> > >>>> From: Emil Berg [mailto:emil.berg@ericsson.com]
> > >>>> Sent: Monday, 27 June 2022 14.51
> > >>>>
> > >>>>> From: Emil Berg
> > >>>>> Sent: den 27 juni 2022 14:46
> > >>>>>
> > >>>>>> From: Mattias Rönnblom <hofors@lysator.liu.se>
> > >>>>>> Sent: den 27 juni 2022 14:28
> > >>>>>>
> > >>>>>> On 2022-06-23 14:51, Morten Brørup wrote:
> > >>>>>>>> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > >>>>>>>> Sent: Thursday, 23 June 2022 14.39
> > >>>>>>>>
> > >>>>>>>> With this patch, the checksum can be calculated on an
> > unaligned
> > >>>> buffer.
> > >>>>>>>> I.e. the buf parameter is no longer required to be 16 bit
> > >>>> aligned.
> > >>>>>>>>
> > >>>>>>>> The checksum is still calculated using a 16 bit aligned
> > pointer,
> > >>>> so
> > >>>>>>>> the compiler can auto-vectorize the function's inner loop.
> > >>>>>>>>
> > >>>>>>>> When the buffer is unaligned, the first byte of the buffer is
> > >>>>>>>> handled separately. Furthermore, the calculated checksum of
> > the
> > >>>>>>>> buffer is byte shifted before being added to the initial
> > >>>> checksum,
> > >>>>>>>> to compensate for the checksum having been calculated on the
> > >>>> buffer
> > >>>>>>>> shifted by one byte.
> > >>>>>>>>
> > >>>>>>>> v4:
> > >>>>>>>> * Add copyright notice.
> > >>>>>>>> * Include stdbool.h (Emil Berg).
> > >>>>>>>> * Use RTE_PTR_ADD (Emil Berg).
> > >>>>>>>> * Fix one more typo in commit message. Is 'unligned' even a
> > >>>> word?
> > >>>>>>>> v3:
> > >>>>>>>> * Remove braces from single statement block.
> > >>>>>>>> * Fix typo in commit message.
> > >>>>>>>> v2:
> > >>>>>>>> * Do not assume that the buffer is part of an aligned packet
> > >>>> buffer.
> > >>>>>>>>
> > >>>>>>>> Bugzilla ID: 1035
> > >>>>>>>> Cc: stable@dpdk.org
> > >>>>>>>>
> > >>>>>>>> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > [...]
> > >
> > >>>>>>
> > >>>>>> The compiler will be able to auto vectorize even unaligned
> > >>>> accesses,
> > >>>>>> just with different instructions. From what I can tell, there's
> > no
> > >>>>>> performance impact, at least not on the x86_64 systems I tried
> > on.
> > >>>>>>
> > >>>>>> I think you should remove the first special case conditional and
> > >>>> use
> > >>>>>> memcpy() instead of the cumbersome __may_alias__ construct to
> > >>>> retrieve
> > >>>>>> the data.
> > >>>>>>
> > >>>>>
> > >>>>> Here:
> > >>>>> https://www.agner.org/optimize/instruction_tables.pdf
> > >>>>> it lists the latency of vmovdqa (aligned) as 6 cycles and the
> > >> latency
> > >>>> for
> > >>>>> vmovdqu (unaligned) as 7 cycles. So I guess there can be some
> > >>>> difference.
> > >>>>> Although in practice I'm not sure what difference it makes. I've
> > >> not
> > >>>> seen any
> > >>>>> difference in runtime between the two versions.
> > >>>>>
> > >>>>
> > >>>> Correction to my comment:
> > >>>> Those stats are for some older CPU. For some newer CPUs such as
> > >> Tiger
> > >>>> Lake the stats seem to be the same regardless of aligned or
> > >> unaligned.
> > >>>>
> > >>>
> > >>> I agree that the memcpy method is more elegant and easy to read.
> > >>>
> > >>> However, we would need to performance test the modified checksum
> > >> function with a large number of CPUs to prove that we don't
> > introduce a
> > >> performance regression on any CPU architecture still supported by
> > DPDK.
> > >> And Emil already found a CPU where it costs 1 extra cycle per 16
> > bytes,
> > >> which adds up to a total of ca. 91 extra cycles on a 1460 byte TCP
> > >> packet.
> > >>>
> > >>
> > >> I think you've misunderstood what latency means in such tables. It's
> > a
> > >> data dependency thing, not a measure of throughput. The throughput
> > is
> > >> *much* higher. My guess would be two such instruction per clock.
> > >>
> > >> For your 1460 bytes example, my Zen3 AMD needs performs identical
> > with
> > >> both the current DPDK implementation, your patch, and a memcpy()-
> > ified
> > >> version of the current implementation. They all need ~130 clock
> > >> cycles/packet, with warm caches. IPC is 3 instructions per cycle,
> > but
> > >> obvious not all instructions are SIMD.
> > >
> > > You're right, I wasn't thinking deeper about it before extrapolating.
> > >
> > > Great to see some real numbers! I wish someone would do the same
> > testing on an old ARM CPU, so we could also see the other end of the
> > scale.
> > >
> >
> > I've ran it on an ARM A72. For the aligned 1460 bytes case I got:
> > Current DPDK ~572 cc. Your patch: ~578 cc. Memcpy-fied: ~573 cc. They
> > performed about the same for all unaligned/aligned and sizes I tested.
> > This platform (or could be GCC version as well) doesn't suffer from the
> > unaligned performance degradation your patch showed on my AMD machine.
> >
> > >> The main issue with checksumming on the CPU is, in my experience,
> > not
> > >> that you don't have enough compute, but that you trash the caches.
> > >
> > > Agree. I have noticed that x86 has "non-temporal" instruction
> > variants to load/store data without trashing the cache entirely.
> > >
> > > A variant of the checksum function using such instructions might be
> > handy.
> > >
> >
> > Yes, although you may need to prefetch the payload for good
> > performance.
> >
> > > Variants of the memcpy function using such instructions might also be
> > handy for some purposes, e.g. copying the contents of packets, where
> > the original and/or copy will not accessed shortly thereafter.
> > >
> >
> > Indeed and I think it's been discussed on the list. There's some work
> > to
> > get it right, since alignment requirement and the fact a different
> > memory model is used for those SIMD instructions causes trouble for a
> > generic implementation. (For x86_64.)
>
> I just posted an RFC [1] for such memcpy() and memset() functions,
> so let's see how it fans out.
>
> [1] http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87195@smartserver.smartshare.dk/T/#u
>
> >
> > >>> So I opted for a solution with zero changes to the inner loop, so
> > no
> > >> performance retesting is required (for the previously supported use
> > >> cases, where the buffer is aligned).
> > >>>
> > >>
> > >> You will see performance degradation with this solution as well,
> > under
> > >> certain conditions. For unaligned 100 bytes of data, the current
> > DPDK
> > >> implementation and the memcpy()-fied version needs ~21 cc/packet.
> > Your
> > >> patch needs 54 cc/packet.
> > >
> > > Yes, it's a tradeoff. I exclusively aimed at maintaining performance
> > for the case with aligned buffers (under all circumstances, with all
> > CPUs etc.), and ignored how it affects the performance for the case
> > with unaligned buffers.
> > >
> > > Unlike this patch, the memcpy() variant has no additional branches
> > for the unaligned case, so its performance should be generally
> > unaffected by the buffer being aligned or not. However, I don't have
> > sufficient in-depth CPU knowledge to say if this also applies to RISCV
> > and older ARM CPUs still supported by DPDK.
> > >
> >
> > I don't think avoiding RISCV non-catastrophic regressions triumphs
> > improving performance on mainstream CPUs and avoiding code quality
> > regressions.
> +1
+1. In general RISC-V spec leaves the unaligned load/store handling to
implementation (it might fault, it might not). The U74 core that I
have at hand allows unaligned reads/writes. Though it's not a platform
for performance evaluation (time measurement causes a trap to
firmware), so I won't say anything on that.


--
Best Regards,
Stanisław Kardach

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/2] app/test: add cksum performance test
  2022-06-27 20:21                           ` Morten Brørup
  2022-06-28  6:28                             ` Mattias Rönnblom
@ 2022-07-07 18:34                             ` Mattias Rönnblom
  2022-07-07 18:34                               ` [PATCH 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
  1 sibling, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-07 18:34 UTC (permalink / raw)
  To: olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup, Mattias Rönnblom

From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Add performance test for the rte_raw_cksum() function, which delegates
the actual work to __rte_raw_cksum(), which in turn is used by other
functions in need of Internet checksum calculation.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 MAINTAINERS                |   1 +
 app/test/meson.build       |   1 +
 app/test/test_cksum_perf.c | 118 +++++++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+)
 create mode 100644 app/test/test_cksum_perf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c923712946..2a4c99e05a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1414,6 +1414,7 @@ Network headers
 M: Olivier Matz <olivier.matz@6wind.com>
 F: lib/net/
 F: app/test/test_cksum.c
+F: app/test/test_cksum_perf.c
 
 Packet CRC
 M: Jasvinder Singh <jasvinder.singh@intel.com>
diff --git a/app/test/meson.build b/app/test/meson.build
index 431c5bd318..191db03d1d 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -18,6 +18,7 @@ test_sources = files(
         'test_bpf.c',
         'test_byteorder.c',
         'test_cksum.c',
+        'test_cksum_perf.c',
         'test_cmdline.c',
         'test_cmdline_cirbuf.c',
         'test_cmdline_etheraddr.c',
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
new file mode 100644
index 0000000000..d27e7f893a
--- /dev/null
+++ b/app/test/test_cksum_perf.c
@@ -0,0 +1,118 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Ericsson AB
+ */
+
+#include <stdio.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_ip.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define NUM_BLOCKS (10)
+#define ITERATIONS (1000000)
+
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+
+static __rte_noinline uint16_t
+do_rte_raw_cksum(const void *buf, size_t len)
+{
+	return rte_raw_cksum(buf, len);
+}
+
+static void
+init_block(void *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		((char *)buf)[i] = (uint8_t)rte_rand();
+}
+
+static int
+test_cksum_perf_size_alignment(size_t block_size, bool aligned)
+{
+	char *data[NUM_BLOCKS];
+	char *blocks[NUM_BLOCKS];
+	unsigned int i;
+	uint64_t start;
+	uint64_t end;
+	/* Floating point to handle low (pseudo-)TSC frequencies */
+	double block_latency;
+	double byte_latency;
+	volatile uint64_t sum = 0;
+
+	for (i = 0; i < NUM_BLOCKS; i++) {
+		data[i] = rte_malloc(NULL, block_size + 1, 0);
+
+		if (data[i] == NULL) {
+			printf("Failed to allocate memory for block\n");
+			return TEST_FAILED;
+		}
+
+		init_block(data[i], block_size + 1);
+
+		blocks[i] = aligned ? data[i] : data[i] + 1;
+	}
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++) {
+		unsigned int j;
+		for (j = 0; j < NUM_BLOCKS; j++)
+			sum += do_rte_raw_cksum(blocks[j], block_size);
+	}
+
+	end = rte_rdtsc();
+
+	block_latency = (end - start) / (double)(ITERATIONS * NUM_BLOCKS);
+	byte_latency = block_latency / block_size;
+
+	printf("%-9s %10zd %19.1f %16.2f\n", aligned ? "Aligned" : "Unaligned",
+	       block_size, block_latency, byte_latency);
+
+	for (i = 0; i < NUM_BLOCKS; i++)
+		rte_free(data[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_perf_size(size_t block_size)
+{
+	int rc;
+
+	rc = test_cksum_perf_size_alignment(block_size, true);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_cksum_perf_size_alignment(block_size, false);
+
+	return rc;
+}
+
+static int
+test_cksum_perf(void)
+{
+	uint16_t i;
+
+	printf("### rte_raw_cksum() performance ###\n");
+	printf("Alignment  Block size    TSC cycles/block  TSC cycles/byte\n");
+
+	for (i = 0; i < RTE_DIM(data_sizes); i++) {
+		int rc;
+
+		rc = test_cksum_perf_size(data_sizes[i]);
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+
+REGISTER_TEST_COMMAND(cksum_perf_autotest, test_cksum_perf);
+
-- 
2.25.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 2/2] net: have checksum routines accept unaligned data
  2022-07-07 18:34                             ` [PATCH 1/2] app/test: add cksum performance test Mattias Rönnblom
@ 2022-07-07 18:34                               ` Mattias Rönnblom
  2022-07-07 21:44                                 ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-07 18:34 UTC (permalink / raw)
  To: olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup, Mattias Rönnblom

From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

__rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
data through an uint16_t pointer, which allowed the compiler to assume
the data was 16-bit aligned. This in turn would, with certain
architectures and compiler flag combinations, result in code with SIMD
load or store instructions with restrictions on data alignment.

This patch keeps the old algorithm, but data is read using memcpy()
instead of direct pointer access, forcing the compiler to always
generate code that handles unaligned input. The __may_alias__ GCC
attribute is no longer needed.

The data on which the Internet checksum functions operates are almost
always 16-bit aligned, but there are exceptions. In particular, the
PDCP protocol header may (literally) have an odd size.

Performance impact seems to range from none to a very slight
regression.

Bugzilla ID: 1035
Cc: stable@dpdk.org

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/net/rte_ip.h | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index b502481670..a9e6251f14 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	/* extend strict-aliasing rules */
-	typedef uint16_t __attribute__((__may_alias__)) u16_p;
-	const u16_p *u16_buf = (const u16_p *)buf;
-	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+	const void *end;
 
-	for (; u16_buf != end; ++u16_buf)
-		sum += *u16_buf;
+	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
+		uint8_t last;
 		uint16_t left = 0;
-		*(unsigned char *)&left = *(const unsigned char *)end;
+
+		memcpy(&last, end, 1);
+		*(unsigned char *)&left = last;
 		sum += left;
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH 2/2] net: have checksum routines accept unaligned data
  2022-07-07 18:34                               ` [PATCH 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
@ 2022-07-07 21:44                                 ` Morten Brørup
  2022-07-08 12:43                                   ` Mattias Rönnblom
  0 siblings, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-07-07 21:44 UTC (permalink / raw)
  To: Mattias Rönnblom, olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Mattias Rönnblom

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Thursday, 7 July 2022 20.35
> 
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
> data through an uint16_t pointer, which allowed the compiler to assume
> the data was 16-bit aligned. This in turn would, with certain
> architectures and compiler flag combinations, result in code with SIMD
> load or store instructions with restrictions on data alignment.
> 
> This patch keeps the old algorithm, but data is read using memcpy()
> instead of direct pointer access, forcing the compiler to always
> generate code that handles unaligned input. The __may_alias__ GCC
> attribute is no longer needed.
> 
> The data on which the Internet checksum functions operates are almost
> always 16-bit aligned, but there are exceptions. In particular, the
> PDCP protocol header may (literally) have an odd size.
> 
> Performance impact seems to range from none to a very slight
> regression.
> 
> Bugzilla ID: 1035
> Cc: stable@dpdk.org
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  lib/net/rte_ip.h | 19 ++++++++++++-------
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..a9e6251f14 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr
> *ipv4_hdr)
>  static inline uint32_t
>  __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
>  {
> -	/* extend strict-aliasing rules */
> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> -	const u16_p *u16_buf = (const u16_p *)buf;
> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> +	const void *end;

I would set "end" here instead, possibly making the pointer const too. And add spaces around '/'.
const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * sizeof(uint16_t));

> 
> -	for (; u16_buf != end; ++u16_buf)
> -		sum += *u16_buf;
> +	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) *
> sizeof(uint16_t));
> +	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> +		uint16_t v;
> +
> +		memcpy(&v, buf, sizeof(uint16_t));
> +		sum += v;
> +	}
> 
>  	/* if length is odd, keeping it byte order independent */
>  	if (unlikely(len % 2)) {
> +		uint8_t last;
>  		uint16_t left = 0;
> -		*(unsigned char *)&left = *(const unsigned char *)end;
> +
> +		memcpy(&last, end, 1);
> +		*(unsigned char *)&left = last;

Couldn't you just memcpy(&left, end, 1), and omit the temporary variable "last"?

>  		sum += left;
>  	}
> 
> --
> 2.25.1
> 

With our without my suggested changes, it looks good.

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] net: have checksum routines accept unaligned data
  2022-07-07 21:44                                 ` Morten Brørup
@ 2022-07-08 12:43                                   ` Mattias Rönnblom
  2022-07-08 12:56                                     ` [PATCH v2 1/2] app/test: add cksum performance test Mattias Rönnblom
  2022-07-08 13:02                                     ` [PATCH 2/2] net: have checksum routines accept unaligned data Morten Brørup
  0 siblings, 2 replies; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-08 12:43 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev, Onar Olsen

On 2022-07-07 23:44, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Thursday, 7 July 2022 20.35
>>
>> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>
>> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
>> data through an uint16_t pointer, which allowed the compiler to assume
>> the data was 16-bit aligned. This in turn would, with certain
>> architectures and compiler flag combinations, result in code with SIMD
>> load or store instructions with restrictions on data alignment.
>>
>> This patch keeps the old algorithm, but data is read using memcpy()
>> instead of direct pointer access, forcing the compiler to always
>> generate code that handles unaligned input. The __may_alias__ GCC
>> attribute is no longer needed.
>>
>> The data on which the Internet checksum functions operates are almost
>> always 16-bit aligned, but there are exceptions. In particular, the
>> PDCP protocol header may (literally) have an odd size.
>>
>> Performance impact seems to range from none to a very slight
>> regression.
>>
>> Bugzilla ID: 1035
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   lib/net/rte_ip.h | 19 ++++++++++++-------
>>   1 file changed, 12 insertions(+), 7 deletions(-)
>>
>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
>> index b502481670..a9e6251f14 100644
>> --- a/lib/net/rte_ip.h
>> +++ b/lib/net/rte_ip.h
>> @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr
>> *ipv4_hdr)
>>   static inline uint32_t
>>   __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
>>   {
>> -	/* extend strict-aliasing rules */
>> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
>> -	const u16_p *u16_buf = (const u16_p *)buf;
>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
>> +	const void *end;
> 
> I would set "end" here instead, possibly making the pointer const too. And add spaces around '/'.
> const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) * sizeof(uint16_t));
> 

I don't think that makes the code more readable.

>>
>> -	for (; u16_buf != end; ++u16_buf)
>> -		sum += *u16_buf;
>> +	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) *
>> sizeof(uint16_t));
>> +	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
>> +		uint16_t v;
>> +
>> +		memcpy(&v, buf, sizeof(uint16_t));
>> +		sum += v;
>> +	}
>>
>>   	/* if length is odd, keeping it byte order independent */
>>   	if (unlikely(len % 2)) {
>> +		uint8_t last;
>>   		uint16_t left = 0;
>> -		*(unsigned char *)&left = *(const unsigned char *)end;
>> +
>> +		memcpy(&last, end, 1);
>> +		*(unsigned char *)&left = last;
> 
> Couldn't you just memcpy(&left, end, 1), and omit the temporary variable "last"?
> 

Good point.

I don't like how this code is clever vis-à-vis byte order, but then I 
also don't have a better suggestion.

>>   		sum += left;
>>   	}
>>
>> --
>> 2.25.1
>>
> 
> With our without my suggested changes, it looks good.
> 
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 

Thanks!


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 1/2] app/test: add cksum performance test
  2022-07-08 12:43                                   ` Mattias Rönnblom
@ 2022-07-08 12:56                                     ` Mattias Rönnblom
  2022-07-08 12:56                                       ` [PATCH v2 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
  2022-07-11  9:47                                       ` [PATCH v2 1/2] app/test: add cksum performance test Olivier Matz
  2022-07-08 13:02                                     ` [PATCH 2/2] net: have checksum routines accept unaligned data Morten Brørup
  1 sibling, 2 replies; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-08 12:56 UTC (permalink / raw)
  To: olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup, Mattias Rönnblom

Add performance test for the rte_raw_cksum() function, which delegates
the actual work to __rte_raw_cksum(), which in turn is used by other
functions in need of Internet checksum calculation.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

---

v2:
  * Added __rte_unused to unused volatile variable, to keep the Intel
    compiler happy.
---
 MAINTAINERS                |   1 +
 app/test/meson.build       |   1 +
 app/test/test_cksum_perf.c | 118 +++++++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+)
 create mode 100644 app/test/test_cksum_perf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c923712946..2a4c99e05a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1414,6 +1414,7 @@ Network headers
 M: Olivier Matz <olivier.matz@6wind.com>
 F: lib/net/
 F: app/test/test_cksum.c
+F: app/test/test_cksum_perf.c
 
 Packet CRC
 M: Jasvinder Singh <jasvinder.singh@intel.com>
diff --git a/app/test/meson.build b/app/test/meson.build
index 431c5bd318..191db03d1d 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -18,6 +18,7 @@ test_sources = files(
         'test_bpf.c',
         'test_byteorder.c',
         'test_cksum.c',
+        'test_cksum_perf.c',
         'test_cmdline.c',
         'test_cmdline_cirbuf.c',
         'test_cmdline_etheraddr.c',
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
new file mode 100644
index 0000000000..bff73cb3bb
--- /dev/null
+++ b/app/test/test_cksum_perf.c
@@ -0,0 +1,118 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Ericsson AB
+ */
+
+#include <stdio.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_ip.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define NUM_BLOCKS (10)
+#define ITERATIONS (1000000)
+
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+
+static __rte_noinline uint16_t
+do_rte_raw_cksum(const void *buf, size_t len)
+{
+	return rte_raw_cksum(buf, len);
+}
+
+static void
+init_block(void *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		((char *)buf)[i] = (uint8_t)rte_rand();
+}
+
+static int
+test_cksum_perf_size_alignment(size_t block_size, bool aligned)
+{
+	char *data[NUM_BLOCKS];
+	char *blocks[NUM_BLOCKS];
+	unsigned int i;
+	uint64_t start;
+	uint64_t end;
+	/* Floating point to handle low (pseudo-)TSC frequencies */
+	double block_latency;
+	double byte_latency;
+	volatile __rte_unused uint64_t sum = 0;
+
+	for (i = 0; i < NUM_BLOCKS; i++) {
+		data[i] = rte_malloc(NULL, block_size + 1, 0);
+
+		if (data[i] == NULL) {
+			printf("Failed to allocate memory for block\n");
+			return TEST_FAILED;
+		}
+
+		init_block(data[i], block_size + 1);
+
+		blocks[i] = aligned ? data[i] : data[i] + 1;
+	}
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++) {
+		unsigned int j;
+		for (j = 0; j < NUM_BLOCKS; j++)
+			sum += do_rte_raw_cksum(blocks[j], block_size);
+	}
+
+	end = rte_rdtsc();
+
+	block_latency = (end - start) / (double)(ITERATIONS * NUM_BLOCKS);
+	byte_latency = block_latency / block_size;
+
+	printf("%-9s %10zd %19.1f %16.2f\n", aligned ? "Aligned" : "Unaligned",
+	       block_size, block_latency, byte_latency);
+
+	for (i = 0; i < NUM_BLOCKS; i++)
+		rte_free(data[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_perf_size(size_t block_size)
+{
+	int rc;
+
+	rc = test_cksum_perf_size_alignment(block_size, true);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_cksum_perf_size_alignment(block_size, false);
+
+	return rc;
+}
+
+static int
+test_cksum_perf(void)
+{
+	uint16_t i;
+
+	printf("### rte_raw_cksum() performance ###\n");
+	printf("Alignment  Block size    TSC cycles/block  TSC cycles/byte\n");
+
+	for (i = 0; i < RTE_DIM(data_sizes); i++) {
+		int rc;
+
+		rc = test_cksum_perf_size(data_sizes[i]);
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+
+REGISTER_TEST_COMMAND(cksum_perf_autotest, test_cksum_perf);
+
-- 
2.25.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v2 2/2] net: have checksum routines accept unaligned data
  2022-07-08 12:56                                     ` [PATCH v2 1/2] app/test: add cksum performance test Mattias Rönnblom
@ 2022-07-08 12:56                                       ` Mattias Rönnblom
  2022-07-08 14:44                                         ` Ferruh Yigit
  2022-07-11  9:53                                         ` Olivier Matz
  2022-07-11  9:47                                       ` [PATCH v2 1/2] app/test: add cksum performance test Olivier Matz
  1 sibling, 2 replies; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-08 12:56 UTC (permalink / raw)
  To: olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup, Mattias Rönnblom

__rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
data through an uint16_t pointer, which allowed the compiler to assume
the data was 16-bit aligned. This in turn would, with certain
architectures and compiler flag combinations, result in code with SIMD
load or store instructions with restrictions on data alignment.

This patch keeps the old algorithm, but data is read using memcpy()
instead of direct pointer access, forcing the compiler to always
generate code that handles unaligned input. The __may_alias__ GCC
attribute is no longer needed.

The data on which the Internet checksum functions operates are almost
always 16-bit aligned, but there are exceptions. In particular, the
PDCP protocol header may (literally) have an odd size.

Performance impact seems to range from none to a very slight
regression.

Bugzilla ID: 1035
Cc: stable@dpdk.org

---

v2:
  * Simplified the odd-length conditional (Morten Brørup).

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/net/rte_ip.h | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index b502481670..a0334d931e 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -160,18 +160,21 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	/* extend strict-aliasing rules */
-	typedef uint16_t __attribute__((__may_alias__)) u16_p;
-	const u16_p *u16_buf = (const u16_p *)buf;
-	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+	const void *end;
 
-	for (; u16_buf != end; ++u16_buf)
-		sum += *u16_buf;
+	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
 		uint16_t left = 0;
-		*(unsigned char *)&left = *(const unsigned char *)end;
+
+		memcpy(&left, end, 1);
 		sum += left;
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH 2/2] net: have checksum routines accept unaligned data
  2022-07-08 12:43                                   ` Mattias Rönnblom
  2022-07-08 12:56                                     ` [PATCH v2 1/2] app/test: add cksum performance test Mattias Rönnblom
@ 2022-07-08 13:02                                     ` Morten Brørup
  2022-07-08 13:52                                       ` Mattias Rönnblom
  1 sibling, 1 reply; 74+ messages in thread
From: Morten Brørup @ 2022-07-08 13:02 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev, Onar Olsen

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Friday, 8 July 2022 14.44
> 
> On 2022-07-07 23:44, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >> Sent: Thursday, 7 July 2022 20.35
> >>
> >> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>
> >> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed
> its
> >> data through an uint16_t pointer, which allowed the compiler to
> assume
> >> the data was 16-bit aligned. This in turn would, with certain
> >> architectures and compiler flag combinations, result in code with
> SIMD
> >> load or store instructions with restrictions on data alignment.
> >>
> >> This patch keeps the old algorithm, but data is read using memcpy()
> >> instead of direct pointer access, forcing the compiler to always
> >> generate code that handles unaligned input. The __may_alias__ GCC
> >> attribute is no longer needed.
> >>
> >> The data on which the Internet checksum functions operates are
> almost
> >> always 16-bit aligned, but there are exceptions. In particular, the
> >> PDCP protocol header may (literally) have an odd size.
> >>
> >> Performance impact seems to range from none to a very slight
> >> regression.
> >>
> >> Bugzilla ID: 1035
> >> Cc: stable@dpdk.org
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> ---
> >>   lib/net/rte_ip.h | 19 ++++++++++++-------
> >>   1 file changed, 12 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> >> index b502481670..a9e6251f14 100644
> >> --- a/lib/net/rte_ip.h
> >> +++ b/lib/net/rte_ip.h
> >> @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr
> >> *ipv4_hdr)
> >>   static inline uint32_t
> >>   __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
> >>   {
> >> -	/* extend strict-aliasing rules */
> >> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> >> -	const u16_p *u16_buf = (const u16_p *)buf;
> >> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> >> +	const void *end;
> >
> > I would set "end" here instead, possibly making the pointer const
> too. And add spaces around '/'.
> > const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) *
> sizeof(uint16_t));
> >
> 
> I don't think that makes the code more readable.

It's only a matter of taste... Your code, your decision. :-)

I think the spaces are required by the coding standard; not sure, though.

> 
> >>
> >> -	for (; u16_buf != end; ++u16_buf)
> >> -		sum += *u16_buf;
> >> +	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) *
> >> sizeof(uint16_t));
> >> +	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> >> +		uint16_t v;
> >> +
> >> +		memcpy(&v, buf, sizeof(uint16_t));
> >> +		sum += v;
> >> +	}
> >>
> >>   	/* if length is odd, keeping it byte order independent */
> >>   	if (unlikely(len % 2)) {
> >> +		uint8_t last;
> >>   		uint16_t left = 0;
> >> -		*(unsigned char *)&left = *(const unsigned char *)end;
> >> +
> >> +		memcpy(&last, end, 1);
> >> +		*(unsigned char *)&left = last;
> >
> > Couldn't you just memcpy(&left, end, 1), and omit the temporary
> variable "last"?
> >
> 
> Good point.
> 
> I don't like how this code is clever vis-à-vis byte order, but then I
> also don't have a better suggestion.

The byte ordering cleverness has its roots in RFC 1071.

Stephen suggested using a union, although in a slightly different context. I'm not sure it will be more readable here, because it will require #ifdef to support byte ordering. Just thought I'd mention it, for your consideration.

Your patch v2 just reached my inbox, and it looks good. No further response to this email is expected.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] net: have checksum routines accept unaligned data
  2022-07-08 13:02                                     ` [PATCH 2/2] net: have checksum routines accept unaligned data Morten Brørup
@ 2022-07-08 13:52                                       ` Mattias Rönnblom
  2022-07-08 14:10                                         ` Bruce Richardson
  0 siblings, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-08 13:52 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev, Onar Olsen

On 2022-07-08 15:02, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Friday, 8 July 2022 14.44
>>
>> On 2022-07-07 23:44, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>>> Sent: Thursday, 7 July 2022 20.35
>>>>
>>>> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>>
>>>> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed
>> its
>>>> data through an uint16_t pointer, which allowed the compiler to
>> assume
>>>> the data was 16-bit aligned. This in turn would, with certain
>>>> architectures and compiler flag combinations, result in code with
>> SIMD
>>>> load or store instructions with restrictions on data alignment.
>>>>
>>>> This patch keeps the old algorithm, but data is read using memcpy()
>>>> instead of direct pointer access, forcing the compiler to always
>>>> generate code that handles unaligned input. The __may_alias__ GCC
>>>> attribute is no longer needed.
>>>>
>>>> The data on which the Internet checksum functions operates are
>> almost
>>>> always 16-bit aligned, but there are exceptions. In particular, the
>>>> PDCP protocol header may (literally) have an odd size.
>>>>
>>>> Performance impact seems to range from none to a very slight
>>>> regression.
>>>>
>>>> Bugzilla ID: 1035
>>>> Cc: stable@dpdk.org
>>>>
>>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>>> ---
>>>>    lib/net/rte_ip.h | 19 ++++++++++++-------
>>>>    1 file changed, 12 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
>>>> index b502481670..a9e6251f14 100644
>>>> --- a/lib/net/rte_ip.h
>>>> +++ b/lib/net/rte_ip.h
>>>> @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr
>>>> *ipv4_hdr)
>>>>    static inline uint32_t
>>>>    __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
>>>>    {
>>>> -	/* extend strict-aliasing rules */
>>>> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
>>>> -	const u16_p *u16_buf = (const u16_p *)buf;
>>>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
>>>> +	const void *end;
>>>
>>> I would set "end" here instead, possibly making the pointer const
>> too. And add spaces around '/'.
>>> const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) *
>> sizeof(uint16_t));
>>>
>>
>> I don't think that makes the code more readable.
> 
> It's only a matter of taste... Your code, your decision. :-)
> 
> I think the spaces are required by the coding standard; not sure, though.
> 

If it isn't in the coding standard, it should be. But if you add spaces, 
you have to break the line, to fit into 80 characters. A net loss, IMO.

>>
>>>>
>>>> -	for (; u16_buf != end; ++u16_buf)
>>>> -		sum += *u16_buf;
>>>> +	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) *
>>>> sizeof(uint16_t));
>>>> +	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
>>>> +		uint16_t v;
>>>> +
>>>> +		memcpy(&v, buf, sizeof(uint16_t));
>>>> +		sum += v;
>>>> +	}
>>>>
>>>>    	/* if length is odd, keeping it byte order independent */
>>>>    	if (unlikely(len % 2)) {
>>>> +		uint8_t last;
>>>>    		uint16_t left = 0;
>>>> -		*(unsigned char *)&left = *(const unsigned char *)end;
>>>> +
>>>> +		memcpy(&last, end, 1);
>>>> +		*(unsigned char *)&left = last;
>>>
>>> Couldn't you just memcpy(&left, end, 1), and omit the temporary
>> variable "last"?
>>>
>>
>> Good point.
>>
>> I don't like how this code is clever vis-à-vis byte order, but then I
>> also don't have a better suggestion.
> 
> The byte ordering cleverness has its roots in RFC 1071.
> 
> Stephen suggested using a union, although in a slightly different context. I'm not sure it will be more readable here, because it will require #ifdef to support byte ordering. Just thought I'd mention it, for your consideration.
> 
> Your patch v2 just reached my inbox, and it looks good. No further response to this email is expected.
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/2] net: have checksum routines accept unaligned data
  2022-07-08 13:52                                       ` Mattias Rönnblom
@ 2022-07-08 14:10                                         ` Bruce Richardson
  2022-07-08 14:30                                           ` Morten Brørup
  0 siblings, 1 reply; 74+ messages in thread
From: Bruce Richardson @ 2022-07-08 14:10 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Morten Brørup, Mattias Rönnblom, olivier.matz,
	Emil Berg, stephen, stable, bugzilla, dev, Onar Olsen

On Fri, Jul 08, 2022 at 01:52:12PM +0000, Mattias Rönnblom wrote:
> On 2022-07-08 15:02, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Friday, 8 July 2022 14.44
> >>
> >> On 2022-07-07 23:44, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> >>>> Sent: Thursday, 7 July 2022 20.35
> >>>>
> >>>> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>>
> >>>> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed
> >> its
> >>>> data through an uint16_t pointer, which allowed the compiler to
> >> assume
> >>>> the data was 16-bit aligned. This in turn would, with certain
> >>>> architectures and compiler flag combinations, result in code with
> >> SIMD
> >>>> load or store instructions with restrictions on data alignment.
> >>>>
> >>>> This patch keeps the old algorithm, but data is read using memcpy()
> >>>> instead of direct pointer access, forcing the compiler to always
> >>>> generate code that handles unaligned input. The __may_alias__ GCC
> >>>> attribute is no longer needed.
> >>>>
> >>>> The data on which the Internet checksum functions operates are
> >> almost
> >>>> always 16-bit aligned, but there are exceptions. In particular, the
> >>>> PDCP protocol header may (literally) have an odd size.
> >>>>
> >>>> Performance impact seems to range from none to a very slight
> >>>> regression.
> >>>>
> >>>> Bugzilla ID: 1035
> >>>> Cc: stable@dpdk.org
> >>>>
> >>>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>>> ---
> >>>>    lib/net/rte_ip.h | 19 ++++++++++++-------
> >>>>    1 file changed, 12 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> >>>> index b502481670..a9e6251f14 100644
> >>>> --- a/lib/net/rte_ip.h
> >>>> +++ b/lib/net/rte_ip.h
> >>>> @@ -160,18 +160,23 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr
> >>>> *ipv4_hdr)
> >>>>    static inline uint32_t
> >>>>    __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
> >>>>    {
> >>>> -	/* extend strict-aliasing rules */
> >>>> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> >>>> -	const u16_p *u16_buf = (const u16_p *)buf;
> >>>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> >>>> +	const void *end;
> >>>
> >>> I would set "end" here instead, possibly making the pointer const
> >> too. And add spaces around '/'.
> >>> const void * const end = RTE_PTR_ADD(buf, (len / sizeof(uint16_t)) *
> >> sizeof(uint16_t));
> >>>
> >>
> >> I don't think that makes the code more readable.
> > 
> > It's only a matter of taste... Your code, your decision. :-)
> > 
> > I think the spaces are required by the coding standard; not sure, though.
> > 
> 
> If it isn't in the coding standard, it should be. But if you add spaces, 
> you have to break the line, to fit into 80 characters. A net loss, IMO.
> 

Just FYI, lines up to 100 chars are ok now. Automated checkpatch checks as
shown in patchwork only flag lines longer than that.

/Bruce

^ permalink raw reply	[flat|nested] 74+ messages in thread

* RE: [PATCH 2/2] net: have checksum routines accept unaligned data
  2022-07-08 14:10                                         ` Bruce Richardson
@ 2022-07-08 14:30                                           ` Morten Brørup
  0 siblings, 0 replies; 74+ messages in thread
From: Morten Brørup @ 2022-07-08 14:30 UTC (permalink / raw)
  To: Bruce Richardson, Mattias Rönnblom
  Cc: Mattias Rönnblom, olivier.matz, Emil Berg, stephen, stable,
	bugzilla, dev, Onar Olsen

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Friday, 8 July 2022 16.10
> 
> On Fri, Jul 08, 2022 at 01:52:12PM +0000, Mattias Rönnblom wrote:
> > On 2022-07-08 15:02, Morten Brørup wrote:
> > >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > >> Sent: Friday, 8 July 2022 14.44
> > >>
> > >> On 2022-07-07 23:44, Morten Brørup wrote:
> > >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > >>>> Sent: Thursday, 7 July 2022 20.35
> > >>>>

[...]

> > >>>>    static inline uint32_t
> > >>>>    __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
> > >>>>    {
> > >>>> -	/* extend strict-aliasing rules */
> > >>>> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> > >>>> -	const u16_p *u16_buf = (const u16_p *)buf;
> > >>>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> > >>>> +	const void *end;
> > >>>
> > >>> I would set "end" here instead, possibly making the pointer const
> > >> too. And add spaces around '/'.
> > >>> const void * const end = RTE_PTR_ADD(buf, (len /
> sizeof(uint16_t)) *
> > >> sizeof(uint16_t));
> > >>>
> > >>
> > >> I don't think that makes the code more readable.
> > >
> > > It's only a matter of taste... Your code, your decision. :-)
> > >
> > > I think the spaces are required by the coding standard; not sure,
> though.
> > >
> >
> > If it isn't in the coding standard, it should be. But if you add
> spaces,
> > you have to break the line, to fit into 80 characters. A net loss,
> IMO.
> >
> 
> Just FYI, lines up to 100 chars are ok now. Automated checkpatch checks
> as
> shown in patchwork only flag lines longer than that.
> 
> /Bruce

The coding style [1] recommends max 80 characters, with an (easy to miss) note below, saying that up to 100 characters are acceptable. I think it should be simplified, replacing the 80 characters recommendation by a 100 characters limit, so the additional note can be removed. Why still recommend 80, when we really mean 100?

[1] https://doc.dpdk.org/guides/contributing/coding_style.html


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/2] net: have checksum routines accept unaligned data
  2022-07-08 12:56                                       ` [PATCH v2 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
@ 2022-07-08 14:44                                         ` Ferruh Yigit
  2022-07-11  9:53                                         ` Olivier Matz
  1 sibling, 0 replies; 74+ messages in thread
From: Ferruh Yigit @ 2022-07-08 14:44 UTC (permalink / raw)
  To: Mattias Rönnblom, olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup

On 7/8/2022 1:56 PM, Mattias Rönnblom wrote:
> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
> data through an uint16_t pointer, which allowed the compiler to assume
> the data was 16-bit aligned. This in turn would, with certain
> architectures and compiler flag combinations, result in code with SIMD
> load or store instructions with restrictions on data alignment.
> 
> This patch keeps the old algorithm, but data is read using memcpy()
> instead of direct pointer access, forcing the compiler to always
> generate code that handles unaligned input. The __may_alias__ GCC
> attribute is no longer needed.
> 
> The data on which the Internet checksum functions operates are almost
> always 16-bit aligned, but there are exceptions. In particular, the
> PDCP protocol header may (literally) have an odd size.
> 
> Performance impact seems to range from none to a very slight
> regression.
> 
> Bugzilla ID: 1035
> Cc: stable@dpdk.org
> 
> ---
> 
> v2:
>    * Simplified the odd-length conditional (Morten Brørup).
> 
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>   lib/net/rte_ip.h | 17 ++++++++++-------
>   1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..a0334d931e 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -160,18 +160,21 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
>   static inline uint32_t
>   __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
>   {
> -	/* extend strict-aliasing rules */
> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> -	const u16_p *u16_buf = (const u16_p *)buf;
> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> +	const void *end;
>   
> -	for (; u16_buf != end; ++u16_buf)
> -		sum += *u16_buf;
> +	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));
> +	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> +		uint16_t v;
> +
> +		memcpy(&v, buf, sizeof(uint16_t));
> +		sum += v;
> +	}
>   
>   	/* if length is odd, keeping it byte order independent */
>   	if (unlikely(len % 2)) {
>   		uint16_t left = 0;
> -		*(unsigned char *)&left = *(const unsigned char *)end;
> +
> +		memcpy(&left, end, 1);
>   		sum += left;
>   	}
>   

Hi Mattias,

I got following result [1] with patches on [2].
Can you shed light to some questions I have,
1) For 1500 why 'Unaligned' access gives better performance than 
'Aligned' access?
2) Why 21/101 bytes almost doubles 20/100 bytes perf?
3) Why 1501 bytes perf better than 1500 bytes perf?


Btw, I don't see any noticeable performance difference between with and 
without patch.

[1]
RTE>>cksum_perf_autotest
### rte_raw_cksum() performance ###
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                25.1             1.25
Unaligned         20                25.1             1.25
Aligned           21                51.5             2.45
Unaligned         21                51.5             2.45
Aligned          100                28.2             0.28
Unaligned        100                28.2             0.28
Aligned          101                54.5             0.54
Unaligned        101                54.5             0.54
Aligned         1500               188.9             0.13
Unaligned       1500               138.7             0.09
Aligned         1501               114.1             0.08
Unaligned       1501               110.1             0.07
Test OK
RTE>>


[2]
AMD EPYC 7543P

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 1/2] app/test: add cksum performance test
  2022-07-08 12:56                                     ` [PATCH v2 1/2] app/test: add cksum performance test Mattias Rönnblom
  2022-07-08 12:56                                       ` [PATCH v2 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
@ 2022-07-11  9:47                                       ` Olivier Matz
  2022-07-11 10:42                                         ` Mattias Rönnblom
  1 sibling, 1 reply; 74+ messages in thread
From: Olivier Matz @ 2022-07-11  9:47 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup

Hi Mattias,

Please see few comments below.

On Fri, Jul 08, 2022 at 02:56:07PM +0200, Mattias Rönnblom wrote:
> Add performance test for the rte_raw_cksum() function, which delegates
> the actual work to __rte_raw_cksum(), which in turn is used by other
> functions in need of Internet checksum calculation.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> ---
> 
> v2:
>   * Added __rte_unused to unused volatile variable, to keep the Intel
>     compiler happy.
> ---
>  MAINTAINERS                |   1 +
>  app/test/meson.build       |   1 +
>  app/test/test_cksum_perf.c | 118 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 120 insertions(+)
>  create mode 100644 app/test/test_cksum_perf.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c923712946..2a4c99e05a 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1414,6 +1414,7 @@ Network headers
>  M: Olivier Matz <olivier.matz@6wind.com>
>  F: lib/net/
>  F: app/test/test_cksum.c
> +F: app/test/test_cksum_perf.c
>  
>  Packet CRC
>  M: Jasvinder Singh <jasvinder.singh@intel.com>
> diff --git a/app/test/meson.build b/app/test/meson.build
> index 431c5bd318..191db03d1d 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -18,6 +18,7 @@ test_sources = files(
>          'test_bpf.c',
>          'test_byteorder.c',
>          'test_cksum.c',
> +        'test_cksum_perf.c',
>          'test_cmdline.c',
>          'test_cmdline_cirbuf.c',
>          'test_cmdline_etheraddr.c',
> diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
> new file mode 100644
> index 0000000000..bff73cb3bb
> --- /dev/null
> +++ b/app/test/test_cksum_perf.c
> @@ -0,0 +1,118 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Ericsson AB
> + */
> +
> +#include <stdio.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_ip.h>
> +#include <rte_malloc.h>
> +#include <rte_random.h>
> +
> +#include "test.h"
> +
> +#define NUM_BLOCKS (10)
> +#define ITERATIONS (1000000)

Parenthesis can be safely removed

> +
> +static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
> +
> +static __rte_noinline uint16_t
> +do_rte_raw_cksum(const void *buf, size_t len)
> +{
> +	return rte_raw_cksum(buf, len);
> +}

I don't understand the need to have this wrapper, especially marked
__rte_noinline. What is the objective?

Note that when I remove the __rte_noinline, the performance is better
for size 20 and 21.

> +
> +static void
> +init_block(void *buf, size_t len)

Can buf be a (char *) instead?
It would avoid a cast below.

> +{
> +	size_t i;
> +
> +	for (i = 0; i < len; i++)
> +		((char *)buf)[i] = (uint8_t)rte_rand();
> +}
> +
> +static int
> +test_cksum_perf_size_alignment(size_t block_size, bool aligned)
> +{
> +	char *data[NUM_BLOCKS];
> +	char *blocks[NUM_BLOCKS];
> +	unsigned int i;
> +	uint64_t start;
> +	uint64_t end;
> +	/* Floating point to handle low (pseudo-)TSC frequencies */
> +	double block_latency;
> +	double byte_latency;
> +	volatile __rte_unused uint64_t sum = 0;
> +
> +	for (i = 0; i < NUM_BLOCKS; i++) {
> +		data[i] = rte_malloc(NULL, block_size + 1, 0);
> +
> +		if (data[i] == NULL) {
> +			printf("Failed to allocate memory for block\n");
> +			return TEST_FAILED;
> +		}
> +
> +		init_block(data[i], block_size + 1);
> +
> +		blocks[i] = aligned ? data[i] : data[i] + 1;
> +	}
> +
> +	start = rte_rdtsc();
> +
> +	for (i = 0; i < ITERATIONS; i++) {
> +		unsigned int j;
> +		for (j = 0; j < NUM_BLOCKS; j++)
> +			sum += do_rte_raw_cksum(blocks[j], block_size);
> +	}
> +
> +	end = rte_rdtsc();
> +
> +	block_latency = (end - start) / (double)(ITERATIONS * NUM_BLOCKS);
> +	byte_latency = block_latency / block_size;
> +
> +	printf("%-9s %10zd %19.1f %16.2f\n", aligned ? "Aligned" : "Unaligned",
> +	       block_size, block_latency, byte_latency);

When I run the test on my dev machine, I get the following results,
which are quite reproductible:

Aligned           20       10.4      0.52     (range is 0.48 - 0.52)
Unaligned         20        7.9      0.39     (range is 0.39 - 0.40)
...

If I increase the number of iterations, the first results
change significantly:

Aligned           20        8.2      0.42     (range is 0.41 - 0.42)
Unaligned         20        8.0      0.40     (always this value)

To have more precise tests with small size, would it make sense to
target a test time instead of an iteration count? Something like
this:

	#define ITERATIONS 1000000
	uint64_t iterations = 0;

	...

	do {
		for (i = 0; i < ITERATIONS; i++) {
			unsigned int j;
			for (j = 0; j < NUM_BLOCKS; j++)
				sum += do_rte_raw_cksum(blocks[j], block_size);
		}
		iterations += ITERATIONS;
		end = rte_rdtsc();
	} while ((end - start) < rte_get_tsc_hz());

	block_latency = (end - start) / (double)(iterations * NUM_BLOCKS);


After this change, the aligned and unaligned cases have the same
performance on my machine.


> +
> +	for (i = 0; i < NUM_BLOCKS; i++)
> +		rte_free(data[i]);
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static int
> +test_cksum_perf_size(size_t block_size)
> +{
> +	int rc;
> +
> +	rc = test_cksum_perf_size_alignment(block_size, true);
> +	if (rc != TEST_SUCCESS)
> +		return rc;
> +
> +	rc = test_cksum_perf_size_alignment(block_size, false);
> +
> +	return rc;
> +}
> +
> +static int
> +test_cksum_perf(void)
> +{
> +	uint16_t i;
> +
> +	printf("### rte_raw_cksum() performance ###\n");
> +	printf("Alignment  Block size    TSC cycles/block  TSC cycles/byte\n");
> +
> +	for (i = 0; i < RTE_DIM(data_sizes); i++) {
> +		int rc;
> +
> +		rc = test_cksum_perf_size(data_sizes[i]);
> +		if (rc != TEST_SUCCESS)
> +			return rc;
> +	}
> +
> +	return TEST_SUCCESS;
> +}
> +
> +
> +REGISTER_TEST_COMMAND(cksum_perf_autotest, test_cksum_perf);
> +

The last empty line can be removed.

> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/2] net: have checksum routines accept unaligned data
  2022-07-08 12:56                                       ` [PATCH v2 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
  2022-07-08 14:44                                         ` Ferruh Yigit
@ 2022-07-11  9:53                                         ` Olivier Matz
  2022-07-11 10:53                                           ` Mattias Rönnblom
  1 sibling, 1 reply; 74+ messages in thread
From: Olivier Matz @ 2022-07-11  9:53 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup

Hi,

On Fri, Jul 08, 2022 at 02:56:08PM +0200, Mattias Rönnblom wrote:
> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
> data through an uint16_t pointer, which allowed the compiler to assume
> the data was 16-bit aligned. This in turn would, with certain
> architectures and compiler flag combinations, result in code with SIMD
> load or store instructions with restrictions on data alignment.
> 
> This patch keeps the old algorithm, but data is read using memcpy()
> instead of direct pointer access, forcing the compiler to always
> generate code that handles unaligned input. The __may_alias__ GCC
> attribute is no longer needed.
> 
> The data on which the Internet checksum functions operates are almost
> always 16-bit aligned, but there are exceptions. In particular, the
> PDCP protocol header may (literally) have an odd size.
> 
> Performance impact seems to range from none to a very slight
> regression.
> 
> Bugzilla ID: 1035
> Cc: stable@dpdk.org
> 
> ---

Using memcpy() looks to be a good solution fix the issue, while avoiding a
branch and the __may_alias__.

I just have one minor comment below.

> 
> v2:
>   * Simplified the odd-length conditional (Morten Brørup).
> 
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  lib/net/rte_ip.h | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
> index b502481670..a0334d931e 100644
> --- a/lib/net/rte_ip.h
> +++ b/lib/net/rte_ip.h
> @@ -160,18 +160,21 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
>  static inline uint32_t
>  __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
>  {
> -	/* extend strict-aliasing rules */
> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
> -	const u16_p *u16_buf = (const u16_p *)buf;
> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
> +	const void *end;
>  
> -	for (; u16_buf != end; ++u16_buf)
> -		sum += *u16_buf;
> +	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));

What do you think about this form:

	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));

This also has the good property to solve the debate about the
spaces around the '/' :)


> +	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
> +		uint16_t v;
> +
> +		memcpy(&v, buf, sizeof(uint16_t));
> +		sum += v;
> +	}
>  
>  	/* if length is odd, keeping it byte order independent */
>  	if (unlikely(len % 2)) {
>  		uint16_t left = 0;
> -		*(unsigned char *)&left = *(const unsigned char *)end;
> +
> +		memcpy(&left, end, 1);
>  		sum += left;
>  	}
>  
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 1/2] app/test: add cksum performance test
  2022-07-11  9:47                                       ` [PATCH v2 1/2] app/test: add cksum performance test Olivier Matz
@ 2022-07-11 10:42                                         ` Mattias Rönnblom
  2022-07-11 11:33                                           ` Olivier Matz
  0 siblings, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-11 10:42 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	Onar Olsen, Morten Brørup

On 2022-07-11 11:47, Olivier Matz wrote:
> Hi Mattias,
> 
> Please see few comments below.
> 
> On Fri, Jul 08, 2022 at 02:56:07PM +0200, Mattias Rönnblom wrote:
>> Add performance test for the rte_raw_cksum() function, which delegates
>> the actual work to __rte_raw_cksum(), which in turn is used by other
>> functions in need of Internet checksum calculation.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>>
>> ---
>>
>> v2:
>>    * Added __rte_unused to unused volatile variable, to keep the Intel
>>      compiler happy.
>> ---
>>   MAINTAINERS                |   1 +
>>   app/test/meson.build       |   1 +
>>   app/test/test_cksum_perf.c | 118 +++++++++++++++++++++++++++++++++++++
>>   3 files changed, 120 insertions(+)
>>   create mode 100644 app/test/test_cksum_perf.c
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index c923712946..2a4c99e05a 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1414,6 +1414,7 @@ Network headers
>>   M: Olivier Matz <olivier.matz@6wind.com>
>>   F: lib/net/
>>   F: app/test/test_cksum.c
>> +F: app/test/test_cksum_perf.c
>>   
>>   Packet CRC
>>   M: Jasvinder Singh <jasvinder.singh@intel.com>
>> diff --git a/app/test/meson.build b/app/test/meson.build
>> index 431c5bd318..191db03d1d 100644
>> --- a/app/test/meson.build
>> +++ b/app/test/meson.build
>> @@ -18,6 +18,7 @@ test_sources = files(
>>           'test_bpf.c',
>>           'test_byteorder.c',
>>           'test_cksum.c',
>> +        'test_cksum_perf.c',
>>           'test_cmdline.c',
>>           'test_cmdline_cirbuf.c',
>>           'test_cmdline_etheraddr.c',
>> diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
>> new file mode 100644
>> index 0000000000..bff73cb3bb
>> --- /dev/null
>> +++ b/app/test/test_cksum_perf.c
>> @@ -0,0 +1,118 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2022 Ericsson AB
>> + */
>> +
>> +#include <stdio.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_cycles.h>
>> +#include <rte_ip.h>
>> +#include <rte_malloc.h>
>> +#include <rte_random.h>
>> +
>> +#include "test.h"
>> +
>> +#define NUM_BLOCKS (10)
>> +#define ITERATIONS (1000000)
> 
> Parenthesis can be safely removed
> 
>> +
>> +static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
>> +
>> +static __rte_noinline uint16_t
>> +do_rte_raw_cksum(const void *buf, size_t len)
>> +{
>> +	return rte_raw_cksum(buf, len);
>> +}
> 
> I don't understand the need to have this wrapper, especially marked
> __rte_noinline. What is the objective?
> 

The intention is to disallow the compiler to perform unrolling and 
integrating/interleave one cksum operating with the next buffer's in a 
way that wouldn't be feasable in a real application.

It will result in an overestimation of the cost for small cksums, so 
it's still misleading, but in another direction. :)

> Note that when I remove the __rte_noinline, the performance is better
> for size 20 and 21.
> 
>> +
>> +static void
>> +init_block(void *buf, size_t len)
> 
> Can buf be a (char *) instead?
> It would avoid a cast below.
> 

Yes.

>> +{
>> +	size_t i;
>> +
>> +	for (i = 0; i < len; i++)
>> +		((char *)buf)[i] = (uint8_t)rte_rand();
>> +}
>> +
>> +static int
>> +test_cksum_perf_size_alignment(size_t block_size, bool aligned)
>> +{
>> +	char *data[NUM_BLOCKS];
>> +	char *blocks[NUM_BLOCKS];
>> +	unsigned int i;
>> +	uint64_t start;
>> +	uint64_t end;
>> +	/* Floating point to handle low (pseudo-)TSC frequencies */
>> +	double block_latency;
>> +	double byte_latency;
>> +	volatile __rte_unused uint64_t sum = 0;
>> +
>> +	for (i = 0; i < NUM_BLOCKS; i++) {
>> +		data[i] = rte_malloc(NULL, block_size + 1, 0);
>> +
>> +		if (data[i] == NULL) {
>> +			printf("Failed to allocate memory for block\n");
>> +			return TEST_FAILED;
>> +		}
>> +
>> +		init_block(data[i], block_size + 1);
>> +
>> +		blocks[i] = aligned ? data[i] : data[i] + 1;
>> +	}
>> +
>> +	start = rte_rdtsc();
>> +
>> +	for (i = 0; i < ITERATIONS; i++) {
>> +		unsigned int j;
>> +		for (j = 0; j < NUM_BLOCKS; j++)
>> +			sum += do_rte_raw_cksum(blocks[j], block_size);
>> +	}
>> +
>> +	end = rte_rdtsc();
>> +
>> +	block_latency = (end - start) / (double)(ITERATIONS * NUM_BLOCKS);
>> +	byte_latency = block_latency / block_size;
>> +
>> +	printf("%-9s %10zd %19.1f %16.2f\n", aligned ? "Aligned" : "Unaligned",
>> +	       block_size, block_latency, byte_latency);
> 
> When I run the test on my dev machine, I get the following results,
> which are quite reproductible:
> 
> Aligned           20       10.4      0.52     (range is 0.48 - 0.52)
> Unaligned         20        7.9      0.39     (range is 0.39 - 0.40)
> ...
> 
> If I increase the number of iterations, the first results
> change significantly:
> 
> Aligned           20        8.2      0.42     (range is 0.41 - 0.42)
> Unaligned         20        8.0      0.40     (always this value)


I suspect you have frequency scaling enabled on your system. This is 
generally not advisable, you want to some level of determinism in when 
benchmarking. Especially on short runs like this is (and must be).

I thought about doing something about this, but it seemed like an issue 
that should be addressed on a framework level, rather than on a per-perf 
autotest level.

If you want your CPU core to scale up, you can just insert

rte_delay_block_us(100000);

before the actual test is run.

Should I add this? I *think* 100 ms should be enough, but maybe someone 
with more in-depth knowledge of the frequency governors can comment on this.

> 
> To have more precise tests with small size, would it make sense to
> target a test time instead of an iteration count? Something like
> this:
> 

The time lost when running on a lower frequency (plus the hiccups when 
the frequency is changed) will be amortized as you add to the length of 
the test run, which will partly solved the problem. A better solution is 
to not start the test before the core runs on the max frequency.

Again, this is assuming DVFS is what you suffer from here. I guess in 
theory it could be TLB miss as well.

> 	#define ITERATIONS 1000000
> 	uint64_t iterations = 0;
> 
> 	...
> 
> 	do {
> 		for (i = 0; i < ITERATIONS; i++) {
> 			unsigned int j;
> 			for (j = 0; j < NUM_BLOCKS; j++)
> 				sum += do_rte_raw_cksum(blocks[j], block_size);
> 		}
> 		iterations += ITERATIONS;
> 		end = rte_rdtsc();
> 	} while ((end - start) < rte_get_tsc_hz());
> 
> 	block_latency = (end - start) / (double)(iterations * NUM_BLOCKS);
> 
> 
> After this change, the aligned and unaligned cases have the same
> performance on my machine.
> 
> 

RTE>>cksum_perf_autotest
### rte_raw_cksum() performance ###
Alignment  Block size    TSC cycles/block  TSC cycles/byte
Aligned           20                16.1             0.81
Unaligned         20                16.1             0.81

... with the 100 ms busy-wait delay (and frequency scaling enabled) on 
my AMD machine.

>> +
>> +	for (i = 0; i < NUM_BLOCKS; i++)
>> +		rte_free(data[i]);
>> +
>> +	return TEST_SUCCESS;
>> +}
>> +
>> +static int
>> +test_cksum_perf_size(size_t block_size)
>> +{
>> +	int rc;
>> +
>> +	rc = test_cksum_perf_size_alignment(block_size, true);
>> +	if (rc != TEST_SUCCESS)
>> +		return rc;
>> +
>> +	rc = test_cksum_perf_size_alignment(block_size, false);
>> +
>> +	return rc;
>> +}
>> +
>> +static int
>> +test_cksum_perf(void)
>> +{
>> +	uint16_t i;
>> +
>> +	printf("### rte_raw_cksum() performance ###\n");
>> +	printf("Alignment  Block size    TSC cycles/block  TSC cycles/byte\n");
>> +
>> +	for (i = 0; i < RTE_DIM(data_sizes); i++) {
>> +		int rc;
>> +
>> +		rc = test_cksum_perf_size(data_sizes[i]);
>> +		if (rc != TEST_SUCCESS)
>> +			return rc;
>> +	}
>> +
>> +	return TEST_SUCCESS;
>> +}
>> +
>> +
>> +REGISTER_TEST_COMMAND(cksum_perf_autotest, test_cksum_perf);
>> +
> 
> The last empty line can be removed.
> 

OK.

Thanks for the review. I will send a v3 as soon as we've settled the 
DVFS issue.

>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 2/2] net: have checksum routines accept unaligned data
  2022-07-11  9:53                                         ` Olivier Matz
@ 2022-07-11 10:53                                           ` Mattias Rönnblom
  0 siblings, 0 replies; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-11 10:53 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	Onar Olsen, Morten Brørup

On 2022-07-11 11:53, Olivier Matz wrote:
> Hi,
> 
> On Fri, Jul 08, 2022 at 02:56:08PM +0200, Mattias Rönnblom wrote:
>> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
>> data through an uint16_t pointer, which allowed the compiler to assume
>> the data was 16-bit aligned. This in turn would, with certain
>> architectures and compiler flag combinations, result in code with SIMD
>> load or store instructions with restrictions on data alignment.
>>
>> This patch keeps the old algorithm, but data is read using memcpy()
>> instead of direct pointer access, forcing the compiler to always
>> generate code that handles unaligned input. The __may_alias__ GCC
>> attribute is no longer needed.
>>
>> The data on which the Internet checksum functions operates are almost
>> always 16-bit aligned, but there are exceptions. In particular, the
>> PDCP protocol header may (literally) have an odd size.
>>
>> Performance impact seems to range from none to a very slight
>> regression.
>>
>> Bugzilla ID: 1035
>> Cc: stable@dpdk.org
>>
>> ---
> 
> Using memcpy() looks to be a good solution fix the issue, while avoiding a
> branch and the __may_alias__.
> 
> I just have one minor comment below.
> 
>>
>> v2:
>>    * Simplified the odd-length conditional (Morten Brørup).
>>
>> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
>>   lib/net/rte_ip.h | 17 ++++++++++-------
>>   1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
>> index b502481670..a0334d931e 100644
>> --- a/lib/net/rte_ip.h
>> +++ b/lib/net/rte_ip.h
>> @@ -160,18 +160,21 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
>>   static inline uint32_t
>>   __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
>>   {
>> -	/* extend strict-aliasing rules */
>> -	typedef uint16_t __attribute__((__may_alias__)) u16_p;
>> -	const u16_p *u16_buf = (const u16_p *)buf;
>> -	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
>> +	const void *end;
>>   
>> -	for (; u16_buf != end; ++u16_buf)
>> -		sum += *u16_buf;
>> +	for (end = RTE_PTR_ADD(buf, (len/sizeof(uint16_t)) * sizeof(uint16_t));
> 
> What do you think about this form:
> 
> 	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
> 
> This also has the good property to solve the debate about the
> spaces around the '/' :)
> 

Shorter, more readable. Looks good to me.

Thanks.

> 
>> +	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
>> +		uint16_t v;
>> +
>> +		memcpy(&v, buf, sizeof(uint16_t));
>> +		sum += v;
>> +	}
>>   
>>   	/* if length is odd, keeping it byte order independent */
>>   	if (unlikely(len % 2)) {
>>   		uint16_t left = 0;
>> -		*(unsigned char *)&left = *(const unsigned char *)end;
>> +
>> +		memcpy(&left, end, 1);
>>   		sum += left;
>>   	}
>>   
>> -- 
>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v2 1/2] app/test: add cksum performance test
  2022-07-11 10:42                                         ` Mattias Rönnblom
@ 2022-07-11 11:33                                           ` Olivier Matz
  2022-07-11 12:11                                             ` [PATCH v3 " Mattias Rönnblom
  0 siblings, 1 reply; 74+ messages in thread
From: Olivier Matz @ 2022-07-11 11:33 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	Onar Olsen, Morten Brørup

On Mon, Jul 11, 2022 at 10:42:37AM +0000, Mattias Rönnblom wrote:
> On 2022-07-11 11:47, Olivier Matz wrote:
> > Hi Mattias,
> > 
> > Please see few comments below.
> > 
> > On Fri, Jul 08, 2022 at 02:56:07PM +0200, Mattias Rönnblom wrote:
> >> Add performance test for the rte_raw_cksum() function, which delegates
> >> the actual work to __rte_raw_cksum(), which in turn is used by other
> >> functions in need of Internet checksum calculation.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >>
> >> ---
> >>
> >> v2:
> >>    * Added __rte_unused to unused volatile variable, to keep the Intel
> >>      compiler happy.
> >> ---
> >>   MAINTAINERS                |   1 +
> >>   app/test/meson.build       |   1 +
> >>   app/test/test_cksum_perf.c | 118 +++++++++++++++++++++++++++++++++++++
> >>   3 files changed, 120 insertions(+)
> >>   create mode 100644 app/test/test_cksum_perf.c
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS
> >> index c923712946..2a4c99e05a 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -1414,6 +1414,7 @@ Network headers
> >>   M: Olivier Matz <olivier.matz@6wind.com>
> >>   F: lib/net/
> >>   F: app/test/test_cksum.c
> >> +F: app/test/test_cksum_perf.c
> >>   
> >>   Packet CRC
> >>   M: Jasvinder Singh <jasvinder.singh@intel.com>
> >> diff --git a/app/test/meson.build b/app/test/meson.build
> >> index 431c5bd318..191db03d1d 100644
> >> --- a/app/test/meson.build
> >> +++ b/app/test/meson.build
> >> @@ -18,6 +18,7 @@ test_sources = files(
> >>           'test_bpf.c',
> >>           'test_byteorder.c',
> >>           'test_cksum.c',
> >> +        'test_cksum_perf.c',
> >>           'test_cmdline.c',
> >>           'test_cmdline_cirbuf.c',
> >>           'test_cmdline_etheraddr.c',
> >> diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
> >> new file mode 100644
> >> index 0000000000..bff73cb3bb
> >> --- /dev/null
> >> +++ b/app/test/test_cksum_perf.c
> >> @@ -0,0 +1,118 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2022 Ericsson AB
> >> + */
> >> +
> >> +#include <stdio.h>
> >> +
> >> +#include <rte_common.h>
> >> +#include <rte_cycles.h>
> >> +#include <rte_ip.h>
> >> +#include <rte_malloc.h>
> >> +#include <rte_random.h>
> >> +
> >> +#include "test.h"
> >> +
> >> +#define NUM_BLOCKS (10)
> >> +#define ITERATIONS (1000000)
> > 
> > Parenthesis can be safely removed
> > 
> >> +
> >> +static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
> >> +
> >> +static __rte_noinline uint16_t
> >> +do_rte_raw_cksum(const void *buf, size_t len)
> >> +{
> >> +	return rte_raw_cksum(buf, len);
> >> +}
> > 
> > I don't understand the need to have this wrapper, especially marked
> > __rte_noinline. What is the objective?
> > 
> 
> The intention is to disallow the compiler to perform unrolling and 
> integrating/interleave one cksum operating with the next buffer's in a 
> way that wouldn't be feasable in a real application.
> 
> It will result in an overestimation of the cost for small cksums, so 
> it's still misleading, but in another direction. :)

OK, got it. I think it's fine like you did then.

> 
> > Note that when I remove the __rte_noinline, the performance is better
> > for size 20 and 21.
> > 
> >> +
> >> +static void
> >> +init_block(void *buf, size_t len)
> > 
> > Can buf be a (char *) instead?
> > It would avoid a cast below.
> > 
> 
> Yes.
> 
> >> +{
> >> +	size_t i;
> >> +
> >> +	for (i = 0; i < len; i++)
> >> +		((char *)buf)[i] = (uint8_t)rte_rand();
> >> +}
> >> +
> >> +static int
> >> +test_cksum_perf_size_alignment(size_t block_size, bool aligned)
> >> +{
> >> +	char *data[NUM_BLOCKS];
> >> +	char *blocks[NUM_BLOCKS];
> >> +	unsigned int i;
> >> +	uint64_t start;
> >> +	uint64_t end;
> >> +	/* Floating point to handle low (pseudo-)TSC frequencies */
> >> +	double block_latency;
> >> +	double byte_latency;
> >> +	volatile __rte_unused uint64_t sum = 0;
> >> +
> >> +	for (i = 0; i < NUM_BLOCKS; i++) {
> >> +		data[i] = rte_malloc(NULL, block_size + 1, 0);
> >> +
> >> +		if (data[i] == NULL) {
> >> +			printf("Failed to allocate memory for block\n");
> >> +			return TEST_FAILED;
> >> +		}
> >> +
> >> +		init_block(data[i], block_size + 1);
> >> +
> >> +		blocks[i] = aligned ? data[i] : data[i] + 1;
> >> +	}
> >> +
> >> +	start = rte_rdtsc();
> >> +
> >> +	for (i = 0; i < ITERATIONS; i++) {
> >> +		unsigned int j;
> >> +		for (j = 0; j < NUM_BLOCKS; j++)
> >> +			sum += do_rte_raw_cksum(blocks[j], block_size);
> >> +	}
> >> +
> >> +	end = rte_rdtsc();
> >> +
> >> +	block_latency = (end - start) / (double)(ITERATIONS * NUM_BLOCKS);
> >> +	byte_latency = block_latency / block_size;
> >> +
> >> +	printf("%-9s %10zd %19.1f %16.2f\n", aligned ? "Aligned" : "Unaligned",
> >> +	       block_size, block_latency, byte_latency);
> > 
> > When I run the test on my dev machine, I get the following results,
> > which are quite reproductible:
> > 
> > Aligned           20       10.4      0.52     (range is 0.48 - 0.52)
> > Unaligned         20        7.9      0.39     (range is 0.39 - 0.40)
> > ...
> > 
> > If I increase the number of iterations, the first results
> > change significantly:
> > 
> > Aligned           20        8.2      0.42     (range is 0.41 - 0.42)
> > Unaligned         20        8.0      0.40     (always this value)
> 
> 
> I suspect you have frequency scaling enabled on your system. This is 
> generally not advisable, you want to some level of determinism in when 
> benchmarking. Especially on short runs like this is (and must be).
> 
> I thought about doing something about this, but it seemed like an issue 
> that should be addressed on a framework level, rather than on a per-perf 
> autotest level.
> 
> If you want your CPU core to scale up, you can just insert
> 
> rte_delay_block_us(100000);
> 
> before the actual test is run.

Your hypothesis is correct. When I disable frequency scaling, the results
are now the same with 100K, 1M or 10M iterations.

However, adding a rte_delay_us_block() does not really solve the issue,
probably because it calls rte_pause() in the loop.

> Should I add this? I *think* 100 ms should be enough, but maybe someone 
> with more in-depth knowledge of the frequency governors can comment on this.

Anyway, I think we don't need to add the blocking delay, we can
legitimally expect that freq scaling is disabled when we run performance
tests.

> 
> > 
> > To have more precise tests with small size, would it make sense to
> > target a test time instead of an iteration count? Something like
> > this:
> > 
> 
> The time lost when running on a lower frequency (plus the hiccups when 
> the frequency is changed) will be amortized as you add to the length of 
> the test run, which will partly solved the problem. A better solution is 
> to not start the test before the core runs on the max frequency.
> 
> Again, this is assuming DVFS is what you suffer from here. I guess in 
> theory it could be TLB miss as well.
> 
> > 	#define ITERATIONS 1000000
> > 	uint64_t iterations = 0;
> > 
> > 	...
> > 
> > 	do {
> > 		for (i = 0; i < ITERATIONS; i++) {
> > 			unsigned int j;
> > 			for (j = 0; j < NUM_BLOCKS; j++)
> > 				sum += do_rte_raw_cksum(blocks[j], block_size);
> > 		}
> > 		iterations += ITERATIONS;
> > 		end = rte_rdtsc();
> > 	} while ((end - start) < rte_get_tsc_hz());
> > 
> > 	block_latency = (end - start) / (double)(iterations * NUM_BLOCKS);
> > 
> > 
> > After this change, the aligned and unaligned cases have the same
> > performance on my machine.
> > 
> > 
> 
> RTE>>cksum_perf_autotest
> ### rte_raw_cksum() performance ###
> Alignment  Block size    TSC cycles/block  TSC cycles/byte
> Aligned           20                16.1             0.81
> Unaligned         20                16.1             0.81
> 
> ... with the 100 ms busy-wait delay (and frequency scaling enabled) on 
> my AMD machine.
> 
> >> +
> >> +	for (i = 0; i < NUM_BLOCKS; i++)
> >> +		rte_free(data[i]);
> >> +
> >> +	return TEST_SUCCESS;
> >> +}
> >> +
> >> +static int
> >> +test_cksum_perf_size(size_t block_size)
> >> +{
> >> +	int rc;
> >> +
> >> +	rc = test_cksum_perf_size_alignment(block_size, true);
> >> +	if (rc != TEST_SUCCESS)
> >> +		return rc;
> >> +
> >> +	rc = test_cksum_perf_size_alignment(block_size, false);
> >> +
> >> +	return rc;
> >> +}
> >> +
> >> +static int
> >> +test_cksum_perf(void)
> >> +{
> >> +	uint16_t i;
> >> +
> >> +	printf("### rte_raw_cksum() performance ###\n");
> >> +	printf("Alignment  Block size    TSC cycles/block  TSC cycles/byte\n");
> >> +
> >> +	for (i = 0; i < RTE_DIM(data_sizes); i++) {
> >> +		int rc;
> >> +
> >> +		rc = test_cksum_perf_size(data_sizes[i]);
> >> +		if (rc != TEST_SUCCESS)
> >> +			return rc;
> >> +	}
> >> +
> >> +	return TEST_SUCCESS;
> >> +}
> >> +
> >> +
> >> +REGISTER_TEST_COMMAND(cksum_perf_autotest, test_cksum_perf);
> >> +
> > 
> > The last empty line can be removed.
> > 
> 
> OK.
> 
> Thanks for the review. I will send a v3 as soon as we've settled the 
> DVFS issue.
> 
> >> -- 
> >> 2.25.1
> >>
> 

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 1/2] app/test: add cksum performance test
  2022-07-11 11:33                                           ` Olivier Matz
@ 2022-07-11 12:11                                             ` Mattias Rönnblom
  2022-07-11 12:11                                               ` [PATCH v3 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
  2022-07-11 13:20                                               ` [PATCH v3 1/2] app/test: add cksum performance test Olivier Matz
  0 siblings, 2 replies; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-11 12:11 UTC (permalink / raw)
  To: olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup, Mattias Rönnblom

Add performance test for the rte_raw_cksum() function, which delegates
the actual work to __rte_raw_cksum(), which in turn is used by other
functions in need of Internet checksum calculation.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

---

v3:
  * Changed init function buffer parameter type, to avoid cast.
  * Code formatting improved.
v2:
  * Added __rte_unused to unused volatile variable, to keep the Intel
    compiler happy.
---
 MAINTAINERS                |   1 +
 app/test/meson.build       |   1 +
 app/test/test_cksum_perf.c | 117 +++++++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+)
 create mode 100644 app/test/test_cksum_perf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c923712946..2a4c99e05a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1414,6 +1414,7 @@ Network headers
 M: Olivier Matz <olivier.matz@6wind.com>
 F: lib/net/
 F: app/test/test_cksum.c
+F: app/test/test_cksum_perf.c
 
 Packet CRC
 M: Jasvinder Singh <jasvinder.singh@intel.com>
diff --git a/app/test/meson.build b/app/test/meson.build
index 431c5bd318..191db03d1d 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -18,6 +18,7 @@ test_sources = files(
         'test_bpf.c',
         'test_byteorder.c',
         'test_cksum.c',
+        'test_cksum_perf.c',
         'test_cmdline.c',
         'test_cmdline_cirbuf.c',
         'test_cmdline_etheraddr.c',
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
new file mode 100644
index 0000000000..1f296cae34
--- /dev/null
+++ b/app/test/test_cksum_perf.c
@@ -0,0 +1,117 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Ericsson AB
+ */
+
+#include <stdio.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_ip.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+#define NUM_BLOCKS 10
+#define ITERATIONS 1000000
+
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+
+static __rte_noinline uint16_t
+do_rte_raw_cksum(const void *buf, size_t len)
+{
+	return rte_raw_cksum(buf, len);
+}
+
+static void
+init_block(char *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		buf[i] = (char)rte_rand();
+}
+
+static int
+test_cksum_perf_size_alignment(size_t block_size, bool aligned)
+{
+	char *data[NUM_BLOCKS];
+	char *blocks[NUM_BLOCKS];
+	unsigned int i;
+	uint64_t start;
+	uint64_t end;
+	/* Floating point to handle low (pseudo-)TSC frequencies */
+	double block_latency;
+	double byte_latency;
+	volatile __rte_unused uint64_t sum = 0;
+
+	for (i = 0; i < NUM_BLOCKS; i++) {
+		data[i] = rte_malloc(NULL, block_size + 1, 0);
+
+		if (data[i] == NULL) {
+			printf("Failed to allocate memory for block\n");
+			return TEST_FAILED;
+		}
+
+		init_block(data[i], block_size + 1);
+
+		blocks[i] = aligned ? data[i] : data[i] + 1;
+	}
+
+	start = rte_rdtsc();
+
+	for (i = 0; i < ITERATIONS; i++) {
+		unsigned int j;
+		for (j = 0; j < NUM_BLOCKS; j++)
+			sum += do_rte_raw_cksum(blocks[j], block_size);
+	}
+
+	end = rte_rdtsc();
+
+	block_latency = (end - start) / (double)(ITERATIONS * NUM_BLOCKS);
+	byte_latency = block_latency / block_size;
+
+	printf("%-9s %10zd %19.1f %16.2f\n", aligned ? "Aligned" : "Unaligned",
+	       block_size, block_latency, byte_latency);
+
+	for (i = 0; i < NUM_BLOCKS; i++)
+		rte_free(data[i]);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_perf_size(size_t block_size)
+{
+	int rc;
+
+	rc = test_cksum_perf_size_alignment(block_size, true);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_cksum_perf_size_alignment(block_size, false);
+
+	return rc;
+}
+
+static int
+test_cksum_perf(void)
+{
+	uint16_t i;
+
+	printf("### rte_raw_cksum() performance ###\n");
+	printf("Alignment  Block size    TSC cycles/block  TSC cycles/byte\n");
+
+	for (i = 0; i < RTE_DIM(data_sizes); i++) {
+		int rc;
+
+		rc = test_cksum_perf_size(data_sizes[i]);
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+
+REGISTER_TEST_COMMAND(cksum_perf_autotest, test_cksum_perf);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH v3 2/2] net: have checksum routines accept unaligned data
  2022-07-11 12:11                                             ` [PATCH v3 " Mattias Rönnblom
@ 2022-07-11 12:11                                               ` Mattias Rönnblom
  2022-07-11 13:25                                                 ` Olivier Matz
  2022-07-11 13:20                                               ` [PATCH v3 1/2] app/test: add cksum performance test Olivier Matz
  1 sibling, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-07-11 12:11 UTC (permalink / raw)
  To: olivier.matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup, Mattias Rönnblom

__rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
data through an uint16_t pointer, which allowed the compiler to assume
the data was 16-bit aligned. This in turn would, with certain
architectures and compiler flag combinations, result in code with SIMD
load or store instructions with restrictions on data alignment.

This patch keeps the old algorithm, but data is read using memcpy()
instead of direct pointer access, forcing the compiler to always
generate code that handles unaligned input. The __may_alias__ GCC
attribute is no longer needed.

The data on which the Internet checksum functions operates are almost
always 16-bit aligned, but there are exceptions. In particular, the
PDCP protocol header may (literally) have an odd size.

Performance impact seems to range from none to a very slight
regression.

Bugzilla ID: 1035
Cc: stable@dpdk.org

---

v3:
  * Use RTE_ALIGN_FLOOR() in the pointer arithmetic (Olivier Matz).
v2:
  * Simplified the odd-length conditional (Morten Brørup).

Reviewed-by: Morten Brørup <mb@smartsharesystems.com>

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 lib/net/rte_ip.h | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/lib/net/rte_ip.h b/lib/net/rte_ip.h
index b502481670..ecd250e9be 100644
--- a/lib/net/rte_ip.h
+++ b/lib/net/rte_ip.h
@@ -160,18 +160,21 @@ rte_ipv4_hdr_len(const struct rte_ipv4_hdr *ipv4_hdr)
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	/* extend strict-aliasing rules */
-	typedef uint16_t __attribute__((__may_alias__)) u16_p;
-	const u16_p *u16_buf = (const u16_p *)buf;
-	const u16_p *end = u16_buf + len / sizeof(*u16_buf);
+	const void *end;
 
-	for (; u16_buf != end; ++u16_buf)
-		sum += *u16_buf;
+	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
 		uint16_t left = 0;
-		*(unsigned char *)&left = *(const unsigned char *)end;
+
+		memcpy(&left, end, 1);
 		sum += left;
 	}
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 1/2] app/test: add cksum performance test
  2022-07-11 12:11                                             ` [PATCH v3 " Mattias Rönnblom
  2022-07-11 12:11                                               ` [PATCH v3 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
@ 2022-07-11 13:20                                               ` Olivier Matz
  1 sibling, 0 replies; 74+ messages in thread
From: Olivier Matz @ 2022-07-11 13:20 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup

On Mon, Jul 11, 2022 at 02:11:31PM +0200, Mattias Rönnblom wrote:
> Add performance test for the rte_raw_cksum() function, which delegates
> the actual work to __rte_raw_cksum(), which in turn is used by other
> functions in need of Internet checksum calculation.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Acked-by: Olivier Matz <olivier.matz@6wind.com>

Thank you!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 2/2] net: have checksum routines accept unaligned data
  2022-07-11 12:11                                               ` [PATCH v3 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
@ 2022-07-11 13:25                                                 ` Olivier Matz
  2022-08-08  9:25                                                   ` Mattias Rönnblom
  2022-09-20 12:09                                                   ` Mattias Rönnblom
  0 siblings, 2 replies; 74+ messages in thread
From: Olivier Matz @ 2022-07-11 13:25 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	onar.olsen, Morten Brørup

On Mon, Jul 11, 2022 at 02:11:32PM +0200, Mattias Rönnblom wrote:
> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
> data through an uint16_t pointer, which allowed the compiler to assume
> the data was 16-bit aligned. This in turn would, with certain
> architectures and compiler flag combinations, result in code with SIMD
> load or store instructions with restrictions on data alignment.
> 
> This patch keeps the old algorithm, but data is read using memcpy()
> instead of direct pointer access, forcing the compiler to always
> generate code that handles unaligned input. The __may_alias__ GCC
> attribute is no longer needed.
> 
> The data on which the Internet checksum functions operates are almost
> always 16-bit aligned, but there are exceptions. In particular, the
> PDCP protocol header may (literally) have an odd size.
> 
> Performance impact seems to range from none to a very slight
> regression.
> 
> Bugzilla ID: 1035
> Cc: stable@dpdk.org

Fixes: 6006818cfb26 ("net: new checksum functions")

> ---
> 
> v3:
>   * Use RTE_ALIGN_FLOOR() in the pointer arithmetic (Olivier Matz).
> v2:
>   * Simplified the odd-length conditional (Morten Brørup).
> 
> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Acked-by: Olivier Matz <olivier.matz@6wind.com>

Thank you!

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 2/2] net: have checksum routines accept unaligned data
  2022-07-11 13:25                                                 ` Olivier Matz
@ 2022-08-08  9:25                                                   ` Mattias Rönnblom
  2022-09-20 12:09                                                   ` Mattias Rönnblom
  1 sibling, 0 replies; 74+ messages in thread
From: Mattias Rönnblom @ 2022-08-08  9:25 UTC (permalink / raw)
  To: Olivier Matz
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	Onar Olsen, Morten Brørup

On 2022-07-11 15:25, Olivier Matz wrote:
> On Mon, Jul 11, 2022 at 02:11:32PM +0200, Mattias Rönnblom wrote:
>> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
>> data through an uint16_t pointer, which allowed the compiler to assume
>> the data was 16-bit aligned. This in turn would, with certain
>> architectures and compiler flag combinations, result in code with SIMD
>> load or store instructions with restrictions on data alignment.
>>
>> This patch keeps the old algorithm, but data is read using memcpy()
>> instead of direct pointer access, forcing the compiler to always
>> generate code that handles unaligned input. The __may_alias__ GCC
>> attribute is no longer needed.
>>
>> The data on which the Internet checksum functions operates are almost
>> always 16-bit aligned, but there are exceptions. In particular, the
>> PDCP protocol header may (literally) have an odd size.
>>
>> Performance impact seems to range from none to a very slight
>> regression.
>>
>> Bugzilla ID: 1035
>> Cc: stable@dpdk.org
> 
> Fixes: 6006818cfb26 ("net: new checksum functions")
> 
>> ---
>>
>> v3:
>>    * Use RTE_ALIGN_FLOOR() in the pointer arithmetic (Olivier Matz).
>> v2:
>>    * Simplified the odd-length conditional (Morten Brørup).
>>
>> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> 
> Thank you!

Will this be merged into 22.11? Into the stable branches?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 2/2] net: have checksum routines accept unaligned data
  2022-07-11 13:25                                                 ` Olivier Matz
  2022-08-08  9:25                                                   ` Mattias Rönnblom
@ 2022-09-20 12:09                                                   ` Mattias Rönnblom
  2022-09-20 16:10                                                     ` Thomas Monjalon
  1 sibling, 1 reply; 74+ messages in thread
From: Mattias Rönnblom @ 2022-09-20 12:09 UTC (permalink / raw)
  To: Olivier Matz, David Marchand, Thomas Monjalon
  Cc: Emil Berg, bruce.richardson, stephen, stable, bugzilla, dev,
	Onar Olsen, Morten Brørup

On 2022-07-11 15:25, Olivier Matz wrote:
> On Mon, Jul 11, 2022 at 02:11:32PM +0200, Mattias Rönnblom wrote:
>> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
>> data through an uint16_t pointer, which allowed the compiler to assume
>> the data was 16-bit aligned. This in turn would, with certain
>> architectures and compiler flag combinations, result in code with SIMD
>> load or store instructions with restrictions on data alignment.
>>
>> This patch keeps the old algorithm, but data is read using memcpy()
>> instead of direct pointer access, forcing the compiler to always
>> generate code that handles unaligned input. The __may_alias__ GCC
>> attribute is no longer needed.
>>
>> The data on which the Internet checksum functions operates are almost
>> always 16-bit aligned, but there are exceptions. In particular, the
>> PDCP protocol header may (literally) have an odd size.
>>
>> Performance impact seems to range from none to a very slight
>> regression.
>>
>> Bugzilla ID: 1035
>> Cc: stable@dpdk.org
> 
> Fixes: 6006818cfb26 ("net: new checksum functions")
> 
>> ---
>>
>> v3:
>>    * Use RTE_ALIGN_FLOOR() in the pointer arithmetic (Olivier Matz).
>> v2:
>>    * Simplified the odd-length conditional (Morten Brørup).
>>
>> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> Acked-by: Olivier Matz <olivier.matz@6wind.com>
> 
> Thank you!

Are there any plans to merge this patchset?


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH v3 2/2] net: have checksum routines accept unaligned data
  2022-09-20 12:09                                                   ` Mattias Rönnblom
@ 2022-09-20 16:10                                                     ` Thomas Monjalon
  0 siblings, 0 replies; 74+ messages in thread
From: Thomas Monjalon @ 2022-09-20 16:10 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Olivier Matz, David Marchand, dev, Emil Berg, bruce.richardson,
	stephen, stable, bugzilla, dev, Onar Olsen, Morten Brørup

20/09/2022 14:09, Mattias Rönnblom:
> On 2022-07-11 15:25, Olivier Matz wrote:
> > On Mon, Jul 11, 2022 at 02:11:32PM +0200, Mattias Rönnblom wrote:
> >> __rte_raw_cksum() (used by rte_raw_cksum() among others) accessed its
> >> data through an uint16_t pointer, which allowed the compiler to assume
> >> the data was 16-bit aligned. This in turn would, with certain
> >> architectures and compiler flag combinations, result in code with SIMD
> >> load or store instructions with restrictions on data alignment.
> >>
> >> This patch keeps the old algorithm, but data is read using memcpy()
> >> instead of direct pointer access, forcing the compiler to always
> >> generate code that handles unaligned input. The __may_alias__ GCC
> >> attribute is no longer needed.
> >>
> >> The data on which the Internet checksum functions operates are almost
> >> always 16-bit aligned, but there are exceptions. In particular, the
> >> PDCP protocol header may (literally) have an odd size.
> >>
> >> Performance impact seems to range from none to a very slight
> >> regression.
> >>
> >> Bugzilla ID: 1035
> >> Cc: stable@dpdk.org
> > 
> > Fixes: 6006818cfb26 ("net: new checksum functions")
> > 
> >> ---
> >>
> >> v3:
> >>    * Use RTE_ALIGN_FLOOR() in the pointer arithmetic (Olivier Matz).
> >> v2:
> >>    * Simplified the odd-length conditional (Morten Brørup).
> >>
> >> Reviewed-by: Morten Brørup <mb@smartsharesystems.com>
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> > 
> > Acked-by: Olivier Matz <olivier.matz@6wind.com>
> > 
> > Thank you!
> 
> Are there any plans to merge this patchset?

Applied, thanks.
Sorry for the delay.




^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Bug 1035] __rte_raw_cksum() crash with misaligned pointer
  2022-06-15  7:16 [Bug 1035] __rte_raw_cksum() crash with misaligned pointer bugzilla
  2022-06-15 14:40 ` Morten Brørup
@ 2022-10-10 10:40 ` bugzilla
  1 sibling, 0 replies; 74+ messages in thread
From: bugzilla @ 2022-10-10 10:40 UTC (permalink / raw)
  To: dev

https://bugs.dpdk.org/show_bug.cgi?id=1035

Thomas Monjalon (thomas@monjalon.net) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|UNCONFIRMED                 |RESOLVED

--- Comment #3 from Thomas Monjalon (thomas@monjalon.net) ---
Resolved in http://git.dpdk.org/dpdk/commit/?id=1c9a7fba5c

-- 
You are receiving this mail because:
You are the assignee for the bug.

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2022-10-10 10:40 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-15  7:16 [Bug 1035] __rte_raw_cksum() crash with misaligned pointer bugzilla
2022-06-15 14:40 ` Morten Brørup
2022-06-16  5:44   ` Emil Berg
2022-06-16  6:27     ` Morten Brørup
2022-06-16  6:32     ` Emil Berg
2022-06-16  6:44       ` Morten Brørup
2022-06-16 13:58         ` Mattias Rönnblom
2022-06-16 14:36           ` Morten Brørup
2022-06-17  7:32           ` Morten Brørup
2022-06-17  8:45             ` [PATCH] net: fix checksum with unaligned buffer Morten Brørup
2022-06-17  9:06               ` Morten Brørup
2022-06-17 12:17                 ` Emil Berg
2022-06-20 10:37                 ` Emil Berg
2022-06-20 10:57                   ` Morten Brørup
2022-06-21  7:16                     ` Emil Berg
2022-06-21  8:05                       ` Morten Brørup
2022-06-21  8:23                         ` Bruce Richardson
2022-06-21  9:35                           ` Morten Brørup
2022-06-22  6:26                             ` Emil Berg
2022-06-22  9:18                               ` Bruce Richardson
2022-06-22 11:26                                 ` Morten Brørup
2022-06-22 12:25                                   ` Emil Berg
2022-06-22 14:01                                     ` Morten Brørup
2022-06-22 14:03                                       ` Emil Berg
2022-06-23  5:21                                       ` Emil Berg
2022-06-23  7:01                                         ` Morten Brørup
2022-06-23 11:39                                           ` Emil Berg
2022-06-23 12:18                                             ` Morten Brørup
2022-06-22 13:44             ` [PATCH v2] " Morten Brørup
2022-06-22 13:54             ` [PATCH v3] " Morten Brørup
2022-06-23 12:39             ` [PATCH v4] " Morten Brørup
2022-06-23 12:51               ` Morten Brørup
2022-06-27  7:56                 ` Emil Berg
2022-06-27 10:54                   ` Morten Brørup
2022-06-27 12:28                 ` Mattias Rönnblom
2022-06-27 12:46                   ` Emil Berg
2022-06-27 12:50                     ` Emil Berg
2022-06-27 13:22                       ` Morten Brørup
2022-06-27 17:22                         ` Mattias Rönnblom
2022-06-27 20:21                           ` Morten Brørup
2022-06-28  6:28                             ` Mattias Rönnblom
2022-06-30 16:28                               ` Morten Brørup
2022-07-07 15:21                                 ` Stanisław Kardach
2022-07-07 18:34                             ` [PATCH 1/2] app/test: add cksum performance test Mattias Rönnblom
2022-07-07 18:34                               ` [PATCH 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
2022-07-07 21:44                                 ` Morten Brørup
2022-07-08 12:43                                   ` Mattias Rönnblom
2022-07-08 12:56                                     ` [PATCH v2 1/2] app/test: add cksum performance test Mattias Rönnblom
2022-07-08 12:56                                       ` [PATCH v2 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
2022-07-08 14:44                                         ` Ferruh Yigit
2022-07-11  9:53                                         ` Olivier Matz
2022-07-11 10:53                                           ` Mattias Rönnblom
2022-07-11  9:47                                       ` [PATCH v2 1/2] app/test: add cksum performance test Olivier Matz
2022-07-11 10:42                                         ` Mattias Rönnblom
2022-07-11 11:33                                           ` Olivier Matz
2022-07-11 12:11                                             ` [PATCH v3 " Mattias Rönnblom
2022-07-11 12:11                                               ` [PATCH v3 2/2] net: have checksum routines accept unaligned data Mattias Rönnblom
2022-07-11 13:25                                                 ` Olivier Matz
2022-08-08  9:25                                                   ` Mattias Rönnblom
2022-09-20 12:09                                                   ` Mattias Rönnblom
2022-09-20 16:10                                                     ` Thomas Monjalon
2022-07-11 13:20                                               ` [PATCH v3 1/2] app/test: add cksum performance test Olivier Matz
2022-07-08 13:02                                     ` [PATCH 2/2] net: have checksum routines accept unaligned data Morten Brørup
2022-07-08 13:52                                       ` Mattias Rönnblom
2022-07-08 14:10                                         ` Bruce Richardson
2022-07-08 14:30                                           ` Morten Brørup
2022-06-30 17:41               ` [PATCH v4] net: fix checksum with unaligned buffer Stephen Hemminger
2022-06-30 17:45               ` Stephen Hemminger
2022-07-01  4:11                 ` Emil Berg
2022-07-01 16:50                   ` Morten Brørup
2022-07-01 17:04                     ` Stephen Hemminger
2022-07-01 20:46                       ` Morten Brørup
2022-06-16 14:09       ` [Bug 1035] __rte_raw_cksum() crash with misaligned pointer Mattias Rönnblom
2022-10-10 10:40 ` bugzilla

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).