Restartable Sequences (aka RSEQ) are a low-level synchronization primitive for doing per-CPU work in user-space on Linux-based systems. The primitive introduces a way for a thread to detect whether it was interrupted (preempted, migrated to another CPU, etc) in the middle of executing some sequence of code with low overhead; combined with access to CPU-local data, this enables algorithms which scale well for many threads on many CPUs. The RSEQ API revolves around user-space indicating that some block of code (based on instruction address) should be treated specially during interruptions.
Internally used inside Google for a few years [0] as well as EfficiOS [1], the concept was accepted by upstream Linux in June 2018. This document covers the difficulties of correctly handling Restartable Sequences in DynamoRIO as well as inside an arbitrary DR client.
[0] Presentation from 2013 of Google's RSEQ: https://blog.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
[1] Presentation from 2016 of EfficiOS's RSEQ: https://blog.linuxplumbersconf.org/2016/ocw/system/presentations/3873/original/presentation-rseq-lpc2016.pdf
During execution, DynamoRIO rewrites the target application block by basic block, copying these fragments into a new memory space along with any instrumentation that clients add. Since the kernel identifies restartable sequences based on their instruction address, the direct application of the standard execution model of DR would be incorrect due to this changing of actual instruction addresses. In addition, the preemption detection is implemented by the kernel redirecting execution to some arbitrary application address, according to configuration from the application. Similar to signal handling, DR must handle this transparently in order to retain control of the application and execute it correctly.
Since Restartable Sequences are registered using a (relatively new) syscall, we could simply blocklist the syscall and claim that the kernel does not support the feature since most libraries will be written to be backwards-compatible for slightly older kernels. Barring that, DynamoRIO will have to support Restartable Sequences explicitly.
The documentation in the kernel does a good job explaining the feature. Here are some excerpts for quick reference in this document.
A summary of API structures, adapted from the source. See source for full documentation.
struct rseq_cs { __u32 version; enum rseq_cs_flags flags; void *start_ip; /* Offset from start_ip. */ intptr_t post_commit_offset; void *abort_ip; } struct rseq { __u32 cpu_id_start; __u32 cpu_id; struct rseq_cs *rseq_cs; /* Flags to disable aborts in specific cases */ __u32 flags; }
A description of the perceived usecase for an rseq code section:
init(rseq_cs) cpu = TLS->rseq::cpu_id_start [1] TLS->rseq::rseq_cs = rseq_cs [start_ip] ---------------------------- [2] if (cpu != TLS->rseq::cpu_id) goto abort_ip; [3] <last_instruction_in_cs> [post_commit_ip] ---------------------------- The address of jump target abort_ip must be outside the critical region, i.e.: [abort_ip] < [start_ip] || [abort_ip] >= [post_commit_ip]
last_instruction_in_cs
is a single instruction which is the “committing store”. After this instruction is executed, the restartable sequence is considered complete and will not be sent to the abort handler on interruption.abort_ip
if the thread is interrupted while it is between the start_ip
and post_commit_ip
(subject to values in rseq.flags
).There are two main challenges in handling Restartable Sequences in DynamoRIO: identifying the sequences (so that they can be handled specially) and actually running the sequences with the correct semantics.
For the upstream API, to initialize the use of restartable sequences on a specific thread, that thread allocates thread-local memory to contain its struct rseq
and calls a specialized rseq
syscall to register the TLS memory; each thread does this initialization at most once. Upon entry to some restartable sequence a thread simply updates its TLS->rseq::rseq_cs
to point to metadata about the sequence; there is no action the application must take to exit the sequence since the kernel can tell whether the thread left the region by comparing the thread's PC to the start_ip
and post_commit_ip
fields.
Unfortunately, this sort of passive API poses a problem for identifying restartable sequences in an arbitrary application: a sufficiently optimized application may store the address of that field in a register and use it indirectly in subsequent basic blocks. Tracking this correctly in general requires something along the lines of instrumenting memory stores, directly or via something like page protection.
Further, the rseq_cs
field points to some struct rseq_cs
which is itself not necessarily constant. One toolchain may implement restartable sequence support by reusing a single struct rseq_cs
, whereas another might generate a static list of all rseq regions at compile time and implement rseq entry as storing an immediate from some hard-coded offset from a segment register. A JIT-compiling system may generate these sequences dynamically.
Officially there is a user-space convention to include an __rseq_table
section in the binary. It contains static information about the restartable sequences in that binary. It is intended to be useful for debuggers to correctly and efficiently run programs which support rseq. This will be helpful for utilizing DynamoRIO on conforming applications, but it is not useful for an "adversarial" program which is written to sidestep dynamic instrumentation, debugging, etc.
The semantics of a restartable sequence is that it is a piece of code that is run atomically with respect to anything else running on that CPU, with the assumption that it may be aborted at any arbitrary time; otherwise it is a collection of arbitrary basic blocks that happen to be adjacent in the text layout. Since DynamoRIO needs to write its modified version of each application basic block to some other piece of memory, we would need to come up with our own emulation of a restartable sequence or somehow delegate to the kernel.
Emulating restartable sequences - that is, modifying the application code to execute with the same guarantees as a restartable sequence - has been attempted in the past and abandoned as too tricky or too slow to do correctly in the execution model of DynamoRIO (see Rejected Alternatives).
Delegating to the kernel to execute the sequence correctly is complicated by the arbitrary number of separate basic blocks that compose the sequence. It is further complicated by instrumentation added by DR clients. Writing instrumentation which is restartable (from any arbitrary address) is not at all easy. This is perhaps the biggest challenge: emulating restartable semantics for application code with interspersed instrumentation added.
The best solution we have come up with is to ''run the restartable sequence twice'': once with instrumentation but without completion, and again without instrumentation and with completion. We take advantage of the sequence being designed to handle being restarted. By separating instrumentation from restartable semantics we simplify the problem. This solution is not perfect and has some limitations, but it is better than any other solution we have found.
The approach works as follows:
Before we came up with the run twice approach we considered other solutions:
Not investigated much: probably much too slow to set/unset affinities all the time.
The idea is to serialize the sequence, which avoids races, and should work for sequences which do not read the cpu id twice.
Long story short: we tried it, but ran into myriad edge cases which made our various implementations unstable in the face of a large number of threads and unix signals.
Upon detection of a restartable sequence, DynamoRIO could conceivably write its generated fragments to a new location and arrange for all relevant basic blocks to be part of the same contiguous restartable sequence. If we change the model of basic block discovery to be a little more proactive this might be a viable solution, but it would require us to make the DR core and extension instrumentation arbitrarily abortable; then, all clients would need to emit abortable instrumentation in those cases (or explicitly do nothing). This would be a fairly large and difficult endeavor. Separating the instrumentation from the restart and running twice is much easier.
The following sections discuss design decisions for each aspect of implementing the run twice solution. There are many complexities in each area, and we had to limit support to subsets of possible behavior to make it work. This goes against our general philosophy: we would prefer to handle all possible application behavior. However, this is just not possible with rseq. We will only successfully execute a conforming application; a non-conforming or "adversarial" program will likely only work by disabling rseq (see the next section).
The implementation progress is tracked by i#2350.
Given the limitations we are placing on the application, we added a fallback option -disable_rseq
to try and support running applications that violate our requirements. This option returns -ENOSYS to any rseq system call. Applications should have alternate code paths, since not all kernels support rseq. Running with the alternate code path is better than not running at all.
This approach will not work when attaching to an already running application, of course.
We rely on the application convention of including an __rseq_cs
or __rseq_table
section in the binary to allow us to locate each rseq sequence. It contains static information about the restartable sequences in that binary. We require that all sequences are present in this section.
Unfortunately, this section is typically not in a loaded segment. That means that DR needs to go to disk in order to read the section, which is expensive. We avoid this for non-rseq-using applications by lazily initializing rseq support: we wait until we see an rseq system call, also checking for rseq on attach.
Perhaps the first alternative method of detecting rseq regions that comes to mind is write-protecting each thread’s TLS->rseq
struct. While the perceived downside of too much overhead is probably accurate, in fact this actually does not work at all, because the kernel updates the structure after a restart (it clears the pointer to an rseq_cs
struct) and it sends a SIGSEGV if it cannot write to that field. This write could happen at any time, in events not detectible from user mode such as thread preempts.
The instrumented execution is treated as regular DR control over application code, but with memory stores removed during mangling (i.e., post-instrumentation). Thus, a client will not have to take any special actions to handle rseq.
To simplify and reduce overhead in identifying the rseq region, DR forces the start of a sequence to be the start of a new block. We do not support the application entering the sequence from the middle, or executing the sequence as non-restartable code at any point: it must be a dedicated restartable sequence.
Eliding stores is complex for instructions with side effects, which we want to keep. Our initial implementation does not support stores with side effects and we simply remove the entire instruction.
Once the instrumented execution reaches the final block in the sequence, we want to execute the sequence again, for the second time. We need to execute it all in one shot, as one contiguous region, since we need to set it up for restartable semantics with the kernel. We thus make a local copy of the sequence, described further below.
An early version, operating on applications which only used rseq sequences as function bodies, had a simpler method of invoking the second execution: call the native sequence and assume it will simply return back. This has a number of drawbacks and assumptions and has been abandoned as we have moved to support more general rseq code. Until recently support was still in the code base under a temporary option -rseq_assume_call
, as a failsafe in case there were stability problems discovered with the new native execution implementation. We have since removed this code and option as we are happy with the new general scheme.
We need clients to restore the application state back into the machine state prior to the second execution. We accomplish this by ending the block after the committing store of the sequence, which is sufficient for today's drreg. i#3798 covers adding an interface for a more-guaranteed barrier.
We considered targeting the abort handler for the second execution, since there may be state to restore for the restart. However, while many abort handlers do restart the sequence, not all do: some simply abort. Furthermore, the abort handler could be arbitrarily complex, making it difficult to make a copy of it (and we cannot easily run the application's copy and regain control afterward). Thus, we target the start of the sequence.
Targeting the start may mean that application state changed by the first execution is incorrect and would normally be reset by the abort handler. We have elided all memory stores in the first execution, so this state is limited to register state. We checkpoint all general-purpose registers which are written anywhere in the sequence, storing them into special slots on entry and restoring them right before the second execution.
It is possible that the abort handler performs some other setup that is required for a restart. We do not support such corner cases.
We make a local copy of the rseq sequence right inside the sequence-ending block fragment, reached via fall-through from the committing store at the end of the first execution. The sequence is inserted as additional instructions and is then mangled normally (mangling changes are assumed to be restartable), but it is not passed to clients. The mangling is necessary for unreachable rip-relative references, stolen registers on AArchXX, segment references, etc.
The sequence copy can contain branches. Any exits are regular block exits, resulting in a block fragment with potentially many exits. This is already well-supported by DR. Intra-sequence branches are marked as meta to avoid mangling. We do not currently support indirect branches which target addresses inside the same sequence as that is non-trivial to handle.
To make the local copy an rseq region, we need to locate the application's rseq TLS. Unfortunately, the system call was poorly designed and will neither let us query the location nor replace it. Our solution is to assume that the loader's static TLS is used, so every thread has a consistent segment + offset address with the same offset. The offset is identified by watching the rseq system calls. For attach, it is identified by searching the possible static TLS offsets: there are not very many which have the required struct rseq
alignment. We identify the correct offset by comparing error codes from the rseq system call. The assumption of a constant offset is documented and verified on subsequent system calls.
A new rseq_cs
structure is allocated for each sequence-ending fragment. It is stored in a hashtable in the rseq module, to avoid the complexities and overhead of adding an additional fragment_t
or "subclass" field. A new flag is set to trigger calling into the rseq module on fragment deletion.
Alternatives that we considered here include:
rseq_cs
per thread and updating it when entering our sequence copy. This requires making the rseq-final fragment thread-private, since we need a single absolute address for the rseq_cs
.struct rseq
per rseq sequence and swapping to it upon entering our sequence copy. That requires 3 system calls, though: unregister the application's; register ours; re-register the application's. We could avoid the 3rd if we took over registration and only restored the application's on detach, but 2 system calls is still too expensive.rseq_cs
as data embedded in the code cache as part of the fragment. It needs to be 64-byte-aligned, so it would take up a lot of space. DR does not have ready-made support for this. This won't work with execute-only memory (i#3796). It also complicates decoding the fragment from the cache: we would probably have to store it at the bottom of the fragment, like we do with selfmod copies.rseq_cs
in thread-private heap pointed at with a field added to fragment_t
. This simplifies the lifetime issues, making it easy to free the rseq_cs
when the fragment is freed. It wastes space, though, with a pointer field for every fragment. We could "subclass" fragment_t
to make a special type, but that adds complexity with many accesses needing to check a flag.rseq_cs
structures handed out, stored in the rseq_areas
list. This requires hooks into fragment deletion. We decided it was simpler to just have a global hashtable once we have those hooks.To avoid crashing due to invalid rseq bounds after freeing the rseq_cs
structure, the rseq pointer is cleared explicitly on completion, and on midpoint exit by the fragment deletion hook along with a hook on the shared fragment flushtime update, to ensure all threads are covered. A shared fragment will not be deleted until all threads have gone through the flushtime hook.
The rseq_cs
's abort handler is a new exit added with the application's signature (obtained by decoding from a known abort handler) as data just before it, hidden in the operands of a nop instruction to avoid problems with decoding the fragment. A local jump skips over the data and exit.
We need three cache addresses to store in our rseq_cs
fields: the start, offset of the committing store's endpoint, and the abort handler. With the offset being used instead of the end address, extra arithmetic is needed, so we cannot easily rely on instr_t operands alone.
We considered several options for how to get cache addresses of the copied region:
rseq_cs
(see above) this problem goes away, but we rejected that solution.patch_list_t
support to emit_fragment? This would take some work, changing how we emit instructions for a fragment. (Today these patch lists are only supported on separately generated code such as context switch code.)We went with the final option: the rseq_cs
fields are filled in via a new post-emit control point, using information stored in labels during mangling. The pointer to the rseq_cs
is inserted with a dummy value and patched in this new control point using a new utility routine patch_mov_immed_ptrsz().
The patching has its own complexities. It's easiest to use insert_mov_immed_ptrsz() to point at therseq_cs
. But with it being heap-allocated, that means we have to allocate it during mangling. That complicates uses of mangle that do not emit a real fragment (translation via recreation, e.g.) or block building aborts, leading to:
bb->ilist
, ensuring we free rseq_cs
on all non-emit-fragment paths.Would it be simpler to emit a placeholder immediate and do the allocate + patch in the emit fragment hook? We'd have to insert a 2-instr immediate just in case we need it, and patch an n-byte nop over the 2nd instr? There's no support today to patch a ptrsz immed: we would have to add it. Xref the patch_list_t also mentioned above. Here we have the emitted instrlist_t, so we have instr boundaries.
The decision was to go with the patching strategy (adding patch_mov_immed_ptrsz()). But we still have a problem: for translation the ilist will be mangled but not patched and so we'll have an ilist with a 2nd mov-immed vs code cache with nops. Proposal: since we're putting the immed into a reg, for x86 it should always be a single instr, and for AArchXX instrs are the same size so nops shouldn't change anything (we use the right thumb or arm nops for 32-bit). (An alternative would be, instead of nop-ing, to just make it a mov-immed of zero to the upper bits.)
To test that our copied rseq sequence is properly registered, we need to execute a restart. That means forcing either a preempt, a migration, or a signal during the sequence. System calls are supposedly disallowed inside the sequence (there is an ifdef-ed check for them in the kernel), which makes it hard to force preemption or migration. The simplest solution is a synchronous signal triggered by an instruction in the sequence, something like ud2a
causing a SIGILL.
We want to test both running to completion and a restart, so I have our tests use a parameter specifying whether to raise the SIGILL. In the test, the code runs each rseq sequence twice, once with and once without the parameter set, to test both completion and restart.
Adding AArchxx support is future work: the patch_mov_immed_ptrsz(), the writes to the rseq struct in TLS, and the rseq tests are currently x86-only.
We do not yet perfectly handle a midpoint exit during the instrumentation execution: we need to invoke the native version as well, for sequences where state changes prior to the exit are examined post-exit. We may want to leverage the __rseq_exit_point_array
section for this.
We do not yet handle indirect branch exits out of a sequence.
We are currently ignoring the flags which indicate which triggers (out of preempt, signal, and migrate) should cause the abort handler to be called. We blindly run a second time even if the preempt and migrate bits are not set, which the application may not expect without a signal arriving or may expect to only happen in a fatal error condition.
There are numerous limitations in our current “run rseq twice” solution. As mentioned above, this goes against our general philosophy: we would prefer to handle all possible application behavior. However, this is just not possible with rseq. We will only successfully execute a conforming application; a non-conforming or "adversarial" program will likely only work by disabling rseq.
The biggest limitation is that we only support applications using toolchain support to allow us to statically identify all restartable sequences, and using static thread-local storage with consistent slots across threads.
We have a detailed list of other limitations in our documentation.
Parts of this page (such as some background and some implementation details) are based on other documents and research by Derek Bruening. Xref i#2350.
The discussion of challenges and proposed solutions are based on work and discussions with Derek Bruening, Hendrik Greving, and Kevin Chowski.
Other citations are included inline.