Re: [RFC 0/2] Add high-performance timer facility

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Mattias Rönnblom" <mattias.ronnblom@ericsson.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: Erik Gabriel Carrillo <erik.g.carrillo@intel.com>,
	David Marchand <david.marchand@redhat.com>,
	Maria Lingemark <maria.lingemark@ericsson.com>,
	Stefan Sundkvist <stefan.sundkvist@ericsson.com>
Subject: Re: [RFC 0/2] Add high-performance timer facility
Date: Wed, 1 Mar 2023 15:50:01 +0000	[thread overview]
Message-ID: <ba94dd30-e9b8-c8dc-5db0-ec8c8458b52f@ericsson.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D8779D@smartserver.smartshare.dk>

On 2023-03-01 14:31, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 1 March 2023 12.18
>>
>> On 2023-02-28 17:01, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>> Sent: Tuesday, 28 February 2023 10.39
>>>
>>> I have been looking for a high performance timer library (for use in a fast
>> path TCP stack), and this looks very useful, Mattias.
>>>
>>> My initial feedback is based on quickly skimming the patch source code, and
>> reading this cover letter.
>>>
>>>>
>>>> This patchset is an attempt to introduce a high-performance, highly
>>>> scalable timer facility into DPDK.
>>>>
>>>> More specifically, the goals for the htimer library are:
>>>>
>>>> * Efficient handling of a handful up to hundreds of thousands of
>>>>     concurrent timers.
>>>> * Reduced overhead of adding and canceling timers.
>>>> * Provide a service functionally equivalent to that of
>>>>     <rte_timer.h>. API/ABI backward compatibility is secondary.
>>>>
>>>> In the author's opinion, there are two main shortcomings with the
>>>> current DPDK timer library (i.e., rte_timer.[ch]).
>>>>
>>>> One is the synchronization overhead, where heavy-weight full-barrier
>>>> type synchronization is used. rte_timer.c uses per-EAL/lcore skip
>>>> lists, but any thread may add or cancel (or otherwise access) timers
>>>> managed by another lcore (and thus resides in its timer skip list).
>>>>
>>>> The other is an algorithmic shortcoming, with rte_timer.c's reliance
>>>> on a skip list, which, seemingly, is less efficient than certain
>>>> alternatives.
>>>>
>>>> This patchset implements a hierarchical timer wheel (HWT, in
>>>
>>> Typo: HWT or HTW?
>>
>> Yes. I don't understand how I could managed to make so many such HTW ->
>> HWT typos. At least I got the filenames (rte_htw.[ch]) correct.
>>
>>>
>>>> rte_htw.c), as per the Varghese and Lauck paper "Hashed and
>>>> Hierarchical Timing Wheels: Data Structures for the Efficient
>>>> Implementation of a Timer Facility". A HWT is a data structure
>>>> purposely design for this task, and used by many operating system
>>>> kernel timer facilities.
>>>>
>>>> To further improve the solution described by Varghese and Lauck, a
>>>> bitset is placed in front of each of the timer wheel in the HWT,
>>>> reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time
>>>> and expiry processing).
>>>>
>>>> Cycle-efficient scanning and manipulation of these bitsets are crucial
>>>> for the HWT's performance.
>>>>
>>>> The htimer module keeps a per-lcore (or per-registered EAL thread) HWT
>>>> instance, much like rte_timer.c keeps a per-lcore skip list.
>>>>
>>>> To avoid expensive synchronization overhead for thread-local timer
>>>> management, the HWTs are accessed only from the "owning" thread.  Any
>>>> interaction any other thread has with a particular lcore's timer
>>>> wheel goes over a set of DPDK rings. A side-effect of this design is
>>>> that all operations working toward a "remote" HWT must be
>>>> asynchronous.
>>>>
>>>> The <rte_htimer.h> API is available only to EAL threads and registered
>>>> non-EAL threads.
>>>>
>>>> The htimer API allows the application to supply the current time,
>>>> useful in case it already has retrieved this for other purposes,
>>>> saving the cost of a rdtsc instruction (or its equivalent).
>>>>
>>>> Relative htimer does not retrieve a new time, but reuse the current
>>>> time (as known via/at-the-time of the manage-call), again to shave off
>>>> some cycles of overhead.
>>>
>>> I have a comment to the two points above.
>>>
>>> I agree that the application should supply the current time.
>>>
>>> This should be the concept throughout the library. I don't understand why
>> TSC is used in the library at all?
>>>
>>> Please use a unit-less tick, and let the application decide what one tick
>> means.
>>>
>>
>> I suspect the design of rte_htimer_mgr.h (and rte_timer.h) makes more
>> sense if you think of the user of the API as not just a "monolithic"
>> application, but rather a set of different modules, developed by
>> different organizations, and reused across a set of applications. The
>> idea behind the API design is they should all be able to share one timer
>> service instance.
>>
>> The different parts of the application and any future DPDK platform
>> modules that use the htimer service needs to agree what a tick means in
>> terms of actual wall-time, if it's not mandated by the API.
> 
> I see. Then those non-monolithic applications can agree that the unit of time is nanoseconds, or whatever makes sense for those applications. And then they can instantiate one shared HTW for that purpose.
> 

<rte_htimer_mgr.h> contains nothing but shared HTWs.

> There is no need to impose such an API limit on other users of the library.
> 
>>
>> There might be room for module-specific timer wheels as well, with
>> different resolution or other characteristics. The event timer adapter's
>> use of a timer wheel could be one example (although I'm not sure it is).
> 
> We are not using the event device, and I have not looked into it, so I have no qualified comments to this.
> 
>>
>> If timer-wheel-as-a-private-lego-piece is also a valid use case, then
>> one could consider make the <rte_htw.h> API public as well. That is what
>> I think you as asking for here: a generic timer wheel that doesn't know
>> anything about time sources, time source time -> tick conversion, or
>> timer source time -> monotonic wall time conversion, and maybe is also
>> not bound to a particular thread.
> 
> Yes, that is what I had been searching the Internet for.
> 
> (I'm not sure what you mean by "not bound to a particular thread". Your per-thread design seems good to me.)
> 
> I don't want more stuff in the EAL. What I want is high-performance DPDK libraries we can use in our applications.
> 
>>
>> I picked TSC because it seemed like a good "universal time unit" for
>> DPDK. rdtsc (and its equivalent) is also a very precise (especially on
>> x86) and cheap-to-retrieve (especially on ARM, from what I understand).
> 
> The TSC does have excellent performance, but on all other parameters it is a horrible time keeper: The measurement unit depends on the underlying hardware, the TSC drifts depending on temperature, it cannot be PTP synchronized, the list is endless!
> 
>>
>> That said, at the moment, I'm leaning toward nanoseconds (uint64_t
>> format) should be the default for timer expiration time instead of TSC.
>> TSC could still be an option for passing the current time, since TSC
>> will be a common time source, and it shaves off one conversion.
> 
> There are many reasons why nanoseconds is a much better choice than TSC.
> 
>>
>>> A unit-less tick will also let the application instantiate a HTW with higher
>> resolution than the TSC. (E.g. think about oversampling in audio processing,
>> or Brezenham's line drawing algorithm for 2D visuals - oversampling can sound
>> and look better.)
> 
> Some of the timing data in our application have a resolution orders of magnitude higher than one nanosecond. If we combined that with a HTW library with nanosecond resolution, we would need to keep these timer values in two locations: The original high-res timer in our data structure, and the shadow low-res (nanosecond) timer in the HTW.
> 

There is no way you will meet timers with anything approaching 
pico-second-level precision. You will also get into a value range issue, 
since you will wrap around a 64-bit integer in a matter of days.

The HTW only stores the timeout in ticks, not TSC, nanoseconds or 
picoseconds. Generally, you don't want pico-second-level tick 
granularity, since it increases the overhead of advancing the wheel(s). 
The first (lowest-significance) few wheels will pretty much always be empty.

> We might also need to frequently update the HTW timers to prevent drifting away from the high-res timers. E.g. 1.2 + 1.2 is still 2 when rounded, but + 1.2 becomes 3 when it should have been 4 (3 * 1.2 = 3.6) rounded. This level of drifting would also make periodic timers in the HTW useless.
> 

Useless, for a certain class of applications. What application would 
that be?

> Please note: I haven't really considered merging the high-res timing in our application with this HTW, and I'm also not saying that PERIODIC timers in the HTW are required or even useful for our application. I'm only providing arguments for a unit-less time!
> 
>>>
>>> For reference (supporting my suggestion), the dynamic timestamp field in the
>> rte_mbuf structure is also defined as being unit-less. (I think NVIDIA
>> implements it as nanoseconds, but that's an implementation specific choice.)
>>>
>>>>
>>>> A semantic improvement compared to the <rte_timer.h> API is that the
>>>> htimer library can give a definite answer on the question if the timer
>>>> expiry callback was called, after a timer has been canceled.
>>>>
>>>> Below is a performance data from DPDK's 'app/test' micro benchmarks,
>>>> using 10k concurrent timers. The benchmarks (test_timer_perf.c and
>>>> test_htimer_mgr_perf.c) aren't identical in their structure, but the
>>>> numbers give some indication of the difference.
>>>>
>>>> Use case               htimer  timer
>>>> ------------------------------------
>>>> Add timer                 28    253
>>>> Cancel timer              10    412
>>>> Async add (source lcore)  64
>>>> Async add (target lcore)  13
>>>>
>>>> (AMD 5900X CPU. Time in TSC.)
>>>>
>>>> Prototype integration of the htimer library into real, timer-heavy,
>>>> applications indicates that htimer may result in significant
>>>> application-level performance gains.
>>>>
>>>> The bitset implementation which the HWT implementation depends upon
>>>> seemed generic-enough and potentially useful outside the world of
>>>> HWTs, to justify being located in the EAL.
>>>>
>>>> This patchset is very much an RFC, and the author is yet to form an
>>>> opinion on many important issues.
>>>>
>>>> * If deemed a suitable replacement, should the htimer replace the
>>>>     current DPDK timer library in some particular (ABI-breaking)
>>>>     release, or should it live side-by-side with the then-legacy
>>>>     <rte_timer.h> API? A lot of things in and outside DPDK depend on
>>>>     <rte_timer.h>, so coexistence may be required to facilitate a smooth
>>>>     transition.
>>>
>>> It's my immediate impression that they are totally different in both design
>> philosophy and API.
>>>
>>> Personal opinion: I would call it an entirely different library.
>>>
>>>>
>>>> * Should the htimer and htw-related files be colocated with rte_timer.c
>>>>     in the timer library?
>>>
>>> Personal opinion: No. This is an entirely different library, and should live
>> for itself in a directory of its own.
>>>
>>>>
>>>> * Would it be useful for applications using asynchronous cancel to
>>>>     have the option of having the timer callback run not only in case of
>>>>     timer expiration, but also cancellation (on the target lcore)? The
>>>>     timer cb signature would need to include an additional parameter in
>>>>     that case.
>>>
>>> If one thread cancels something in another thread, some synchronization
>> between the threads is going to be required anyway. So we could reprase your
>> question: Will the burden of the otherwise required synchronization between
>> the two threads be significantly reduced if the library has the ability to run
>> the callback on asynchronous cancel?
>>>
>>
>> Yes.
>>
>> Intuitively, it seems convenient that if you hand off a timer to a
>> different lcore, the timer callback will be called exactly once,
>> regardless if the timer was canceled or expired.
>>
>> But, as you indicate, you may still need synchronization to solve the
>> resource reclamation issue.
>>
>>> Is such a feature mostly "Must have" or "Nice to have"?
>>>
>>> More thoughts in this area...
>>>
>>> If adding and additional callback parameter, it could be an enum, so the
>> callback could be expanded to support "timeout (a.k.a. timer fired)", "cancel"
>> and more events we have not yet come up with, e.g. "early kick".
>>>
>>
>> Yes, or an int.
>>
>>> Here's an idea off the top of my head: An additional callback parameter has
>> a (small) performance cost incurred with every timer fired (which is a very
>> large multiplier). It might not be required. As an alternative to an "what
>> happened" parameter to the callback, the callback could investigate the state
>> of the object for which the timer fired, and draw its own conclusion on how to
>> proceed. Obviously, this also has a performance cost, but perhaps the callback
>> works on the object's state anyway, making this cost insignificant.
>>>
>>
>> It's not obvious to me that you, in the timer callback, can determine
>> what happened, if the same callback is called both in the cancel and the
>> expired case.
>>
>> The cost of an extra integer passed in a register (or checking a flag,
>> if the timer callback should be called at all at cancellation) that is
>> the concern for me; it's extra bit of API complexity.
> 
> Then introduce the library without this feature. More features can be added later.
> 
> The library will be introduced as "experimental", so we are free to improve it and modify the ABI along the way.
> 
>>
>>> Here's another alternative to adding a "what happened" parameter to the
>> callback:
>>>
>>> The rte_htimer could have one more callback pointer, which (if set) will be
>> called on cancellation of the timer.
>>>
>>
>> This will grow the timer struct with 16 bytes.
> 
> If the rte_htimer struct stays within one cache line, it should be acceptable.
> 

Timer structs are often embedded in other structures, and need not 
themselves be cache line aligned (although the "parent" struct may need 
to be, e.g. if it's dynamically allocated).

So smaller is better. Just consider if you want your attosecond-level 
time stamp in a struct:

struct my_timer {
     uint64_t high_precision_time_high_bits;
     uint64_t high_precision_time_low_bits;
     struct rte_htimer timer;
};

...and you allocate those structs from a mempool. If rte_htimer is small 
enough, you will fit on one cache line.

> On the other hand, this approach is less generic than passing an additional parameter. (E.g. add yet another callback pointer for "early kick"?)
> 
> BTW, async cancel is a form of inter-thread communication. Does this library really need to provide any inter-thread communication mechanisms? Doesn't an inter-thread communication mechanism belong in a separate library?
> 

Yes, <rte_htimer_mgr.h> needs this because:
1) Being able to schedule timers on a remote lcore is a useful feature 
(especially since we don't have much else in terms of deferred work 
mechanisms in DPDK).
2) htimer aspires to be a plug-in replacement for <rte_timer.h> (albeit 
an ABI-breaking one).

The pure HTW is in rte_htw.[ch].

Plus, with the current design, async operations basically come for free 
(if you don't use them), from a performance perspective. The extra 
overhead boils down to occasionally polling an empty ring, which is an 
inexpensive operation.

>>
>>>>
>>>> * Should the rte_htimer be a nested struct, so the htw parts be separated
>>>>     from the htimer parts?
>>>>
>>>> * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
>>>>     <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it
>>>>     be so?
>>>>
>>>> * rte_htimer struct is only supposed to be used by the application to
>>>>     give an indication of how much memory it needs to allocate, and is
>>>>     its member are not supposed to be directly accessed (w/ the possible
>>>>     exception of the owner_lcore_id field). Should there be a dummy
>>>>     struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize()
>>>>     function instead, serving the same purpose? Better encapsulation,
>>>>     but more inconvenient for applications. Run-time dynamic sizing
>>>>     would force application-level dynamic allocations.
>>>>
>>>> * Asynchronous cancellation is a little tricky to use for the
>>>>     application (primarily due to timer memory reclamation/race
>>>>     issues). Should this functionality be removed?
>>>>
>>>> * Should rte_htimer_mgr_init() also retrieve the current time? If so,
>>>>     there should to be a variant which allows the user to specify the
>>>>     time (to match rte_htimer_mgr_manage_time()). One pitfall with the
>>>>     current proposed API is an application calling rte_htimer_mgr_init()
>>>>     and then immediately adding a timer with a relative timeout, in
>>>>     which case the current absolute time used is 0, which might be a
>>>>     surprise.
>>>>
>>>> * Should libdivide (optionally) be used to avoid the div in the TSC ->
>>>>     tick conversion? (Doesn't improve performance on Zen 3, but may
>>>>     do on other CPUs.) Consider <rte_reciprocal.h> as well.
>>>>
>>>> * Should the TSC-per-tick be rounded up to a power of 2, so shifts can be
>>>>     used for conversion? Very minor performance gains to be found there,
>>>>     at least on Zen 3 cores.
>>>>
>>>> * Should it be possible to supply the time in rte_htimer_mgr_add()
>>>>     and/or rte_htimer_mgr_manage_time() functions as ticks, rather than
>>>>     as TSC? Should it be possible to also use nanoseconds?
>>>>     rte_htimer_mgr_manage_time() would need a flags parameter in that
>>>>     case.
>>>
>>> Do not use TSC anywhere in this library. Let the application decide the
>> meaning of a tick.
>>>
>>>>
>>>> * Would the event timer adapter be best off using <rte_htw.h>
>>>>     directly, or <rte_htimer.h>? In the latter case, there needs to be a
>>>>     way to instantiate more HWTs (similar to the "alt" functions of
>>>>     <rte_timer.h>)?
>>>>
>>>> * Should the PERIODICAL flag (and the complexity it brings) be
>>>>     removed? And leave the application with only single-shot timers, and
>>>>     the option to re-add them in the timer callback.
>>>
>>> First thought: Yes, keep it lean and remove the periodical stuff.
>>>
>>> Second thought: This needs a more detailed analysis.
>>>
>>>   From one angle:
>>>
>>> How many PERIODICAL versus ONESHOT timers do we expect?
>>>
>>
>> I suspect you should be prepared for the ratio being anything.
> 
> In theory, anything is possible. But I'm asking that we consider realistic use cases.
> 
>>
>>> Intuitively, I would use this library for ONESHOT timers, and perhaps
>> implement my periodical timers by other means.
>>>
>>> If the PERIODICAL:ONESHOT ratio is low, we can probably live with the extra
>> cost of cancel+add for a few periodical timers.
>>>
>>>   From another angle:
>>>
>>> What is the performance gain with the PERIODICAL flag?
>>>
>>
>> None, pretty much. It's just there for convenience.
> 
> OK, then I suggest that you remove it, unless you get objections.
> 
> The library can be expanded with useful features at any time later. Useless features are (nearly) impossible to remove, once they are in there - they are just "technical debt" with associated maintenance costs, added complexity weaving into other features, etc..
> 
>>
>>> Without a periodical timer, cancel+add costs 10+28 cycles. How many cycles
>> would a "move" function, performing both cancel and add, use?
>>>
>>> And then compare that to the cost (in cycles) of repeating a timer with
>> PERIODICAL?
>>>
>>> Furthermore, not having the PERIODICAL flag probably improves the
>> performance for non-periodical timers. How many cycles could we gain here?
>>>
>>>
>>> Another, vaguely related, idea:
>>>
>>> The callback pointer might not need to be stored per rte_htimer, but could
>> instead be common for the rte_htw.
>>>
>>
>> Do you mean rte_htw, or rte_htimer_mgr?
>>
>> If you make one common callback, all the different parts of the
>> application needs to be coordinated (in a big switch-statement, or
>> something of that sort), or have some convention for using an
>> application-specific wrapper structure (accessed via container_of()).
>>
>> This is a problem if the timer service API consumer is a set of largely
>> uncoordinated software modules.
>>
>> Btw, the eventdev API has the same issue, and the proposed event
>> dispatcher is one way to help facilitate application-internal decoupling.
>>
>> For a module-private rte_htw instance your suggestion may work, but not
>> for <rte_htimer_mgr.h>.
> 
> I was speculating that a common callback pointer might provide a performance benefit for single-purpose HTW instances. (The same concept applies if there are multiple callbacks, e.g. a "Timer Fired", a "Timer Cancelled", and an "Early Kick" callback pointer - i.e. having the callback pointers per HTW instance, instead of per timer.)
> 
>>
>>> When a timer fires, the callback probably needs to check/update the state of
>> the object for which the timer fired anyway, so why not just let the
>> application use that state to determine the appropriate action. This might
>> provide some performance benefit.
>>>
>>> It might complicate using one HTW for multiple different purposes, though.
>> Probably a useless idea, but I wanted to share the idea anyway. It might
>> trigger other, better ideas in the community.
>>>
>>>>
>>>> * Should the async result codes and the sync cancel error codes be merged
>>>>     into one set of result codes?
>>>>
>>>> * Should the rte_htimer_mgr_async_add() have a flag which allow
>>>>     buffering add request messages until rte_htimer_mgr_process() is
>>>>     called? Or any manage function. Would reduce ring signaling overhead
>>>>     (i.e., burst enqueue operations instead of single-element
>>>>     enqueue). Could also be a rte_htimer_mgr_async_add_burst() function,
>>>>     solving the same "problem" a different way. (The signature of such
>>>>     a function would not be pretty.)
>>>>
>>>> * Does the functionality provided by the rte_htimer_mgr_process()
>>>>     function match its the use cases? Should there me a more clear
>>>>     separation between expiry processing and asynchronous operation
>>>>     processing?
>>>>
>>>> * Should the patchset be split into more commits? If so, how?
>>>>
>>>> Thanks to Erik Carrillo for his assistance.
>>>>
>>>> Mattias Rönnblom (2):
>>>>     eal: add bitset type
>>>>     eal: add high-performance timer facility
>

next prev parent reply	other threads:[~2023-03-01 15:50 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-28  9:39 Mattias Rönnblom
2023-02-28  9:39 ` [RFC 1/2] eal: add bitset type Mattias Rönnblom
2023-02-28 18:46   ` Tyler Retzlaff
2023-03-02  6:31     ` Mattias Rönnblom
2023-03-02 20:39       ` Tyler Retzlaff
2023-02-28  9:39 ` [RFC 2/2] eal: add high-performance timer facility Mattias Rönnblom
2023-03-05 17:25   ` Stephen Hemminger
2023-03-09 15:20     ` Mattias Rönnblom
2023-02-28 16:01 ` [RFC 0/2] Add " Morten Brørup
2023-03-01 11:18   ` Mattias Rönnblom
2023-03-01 13:31     ` Morten Brørup
2023-03-01 15:50       ` Mattias Rönnblom [this message]
2023-03-01 17:06         ` Morten Brørup
2023-03-15 17:03 ` [RFC v2 " Mattias Rönnblom
2023-03-15 17:03   ` [RFC v2 1/2] eal: add bitset type Mattias Rönnblom
2023-03-15 17:20     ` Stephen Hemminger
2023-03-15 18:27       ` Mattias Rönnblom
2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
2023-03-16  3:55     ` Tyler Retzlaff
2023-03-17  1:58     ` Stephen Hemminger
2023-03-22 12:18     ` Morten Brørup
2023-04-03 12:04       ` Mattias Rönnblom
2023-04-04  7:32         ` Morten Brørup
2023-03-24 16:00     ` Morten Brørup
2023-07-06 22:41     ` Stephen Hemminger
2023-07-12  8:58       ` Mattias Rönnblom
2024-10-03 18:36   ` [RFC v2 0/2] Add " Stephen Hemminger
2024-10-03 21:32     ` Morten Brørup
2024-10-06 13:02       ` Mattias Rönnblom
2024-10-06 13:43         ` Morten Brørup
2024-10-06 14:43           ` Mattias Rönnblom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ba94dd30-e9b8-c8dc-5db0-ec8c8458b52f@ericsson.com \
    --to=mattias.ronnblom@ericsson.com \
    --cc=david.marchand@redhat.com \
    --cc=dev@dpdk.org \
    --cc=erik.g.carrillo@intel.com \
    --cc=maria.lingemark@ericsson.com \
    --cc=mb@smartsharesystems.com \
    --cc=stefan.sundkvist@ericsson.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).