~mht/cmr

f119b37bfe6bbb4f8eb707b303167d7d938d3035 — Martin Hafskjold Thoresen 1 year, 8 months ago f7a9c8f
Rewrite of the README.

Should add perf plots right in here, and fill out some of the sections.
1 files changed, 164 insertions(+), 60 deletions(-)

M README.md
M README.md => README.md +164 -60
@@ 1,86 1,190 @@
# Memory Reclamation for Rust
# CMR - Concurrent Memory Reclamation

CMR is a memory management system for concurrent applications. The system is described in Martin
Hafskjold Thoresens [masters thesis](thesis). This file offers a quick outline of the reasoning
behind CMR as well as a rough sketch on how it works.
CMR is a memory reclamation scheme for concurrent systems in Rust. It was
developed for Martin Hafskjold Thoresen's master thesis[^master] at the
Norwegian University of Science and Technology, in collaboration with Dan
Alistarh's group at IST Austria, and it draws significant inspiration from
Forkscan[^forkscan]. The goal of the project was to explore alternatives to the
popoular Crossbeam project for concurrent meomry management in Rust.


## Setup
## Overview

[Install Rust here](https://rustup.rs/).
CMR works by splitting the memory space of the program into two: managed
memory, and shared memory. In managed memory lives all objects which lifetime
is managed by the Rust compiler, like `Box<T>`, `Vec<T>`, or even `Arc<T>`s.
This includes most of the program. Shared memory, on the other hand, is where
the objects that the Rust compiler cannot reason about lives. This may include
objects like nodes in a concurrent queue. CMR tracks all objects that live in
shared memory, and is not concerned about the objects that live in managed
memory. CMR keeps track of all pointers to shared memory, and is therefore
able to reason about the lifetime of the objects in shared memory.

Clone the repo, enter it, and run `cargo build`, and you're done.
There are number of advantages of not having to keep track of all program
memory: for instance, if the program consists of only a small part of shared
memory - and most programs do - then the overhead of CMR only depends on the
size of this memory, and not the size of the total program memory[^fork].  It
also encourages programmers to minimize the use of shared memory, since this
must be explicitly handled (one can draw similarities to `unsafe` code in
Rust).


## Reclaiming Memory

The basic algorithm is as follows:

```
$ git clone git@github.com:ist-daslab/rust-drop-box cmr
$ cd cmr
$ cargo build
reclaim:
  1. have all threads report their roots
  2. collect all roots
  3. find all allocations that are reachable from the roots
  4. free() the allocations that were not found.
```

`1`: this implemented using POSIX signals: each thread spawned must register a
signal handler for `SIGUSR1`. In this signal handler we write out a pointer to
a thread local vector, containing pointers to all roots this thread currently
has.

`2`: having pointers to the vectors the reclaming thread can simply do some
pointer following to obtain all of the root pointers to shared memory.

# Why
`3`: Instead of spending valuable time doing the reachability analysis while
all threads are waiting, we branch off a new process which does this for us.
Then we use `mmap`ed memory to communicate between the child process and a
background thread.  The background thread gets back all pointers that were no
longer reachable from any root, and frees this memory in `4`.

This project implements memory management for concurrent systems in Rust. This is not in of itself
novel work; Crossbeam, a popular umbrella project for concurrency, has [crossbeam-epoch], which
uses the popular EBR scheme for memory reclamation. There are also implementations of Hazard
Pointers. CMR is ment to be a concept of an alternative to these schemes.
## Usage

CMR defined its own pointer types `NullabelPtr<T>` and `Ptr<T>`.  All accesses
to shared memory is done through a `Ptr<T>`.  In order to obtain a
`NullablePtr<T>` we must load an `Atomic<T>` into a `Guard<T>`: the `Guard`
guarantees that the location of the loaded pointer is known to CMR.  `Guard`s
are not constructed explicitly, but declared using the `guard!` macro.  Thus, a
complete read from shared memory can look like this:

```rust
guard!(my_guard);
let nullable_ptr = cmr::guard(&some_atomic, my_guard);
if nullable_ptr.is_null() { return Err(...); }
let ptr = nullable_ptr.unwrap();
let value: &T = &*ptr;
```

Main issues with EBR, at least the way it is implemented in Crossbeam, is that the programmer has
to explicitly retire memory, and that operations using shared memory often include a fair amount of
`unsafe` code. In CMR, programmers do not have to retire (and currently, *cannot* retire) memory.
In addition, users of CMR rarely have to write `unsafe` code (with occational exceptions with
initialization); see `data-structures` for examples.
The lifetime system of Rust guarantees that the reference `value` is only valid
as long as `ptr`, which in turn guarantees that `my_guard` is not changed, so
CMR knows that this is a reachable pointer. This makes all operations in this
example safe - no `unsafe` usage is required.

# How

CMR acts as a tracing garbage collector. Whenever a thread wants to read a shared pointer, the
thread registers the address in which that pointer is to be stored (which is usually on the stack).
This way CMR has access to all roots in the program at any time. This also allows us to guarantee
that pointers read are either `null` or valid. When it is time to reclaim memory, which all thread
attempt to do every once in a while, we `fork()` off a new process, and find all memory allocations
that are not reachable from any root in the program. This is possible to do safely, as all types
stored in a shared pointer must implement the `Trace` trait, which writes out other pointers to a
buffer.

An an example, we look at `MsQueue::push`:
### Example


As an example, this is from the `push` operation of the Michael-Scott queue:

```rust
pub fn push(&self, t: T) {
    guards!(_new_node, _tail, _next);
    let new_node = cmr::alloc(_new_node, Node::new(t));
    loop {
        let tail = cmr::guard(_tail, &self.tail).ptr().unwrap();
        let next_ptr = &tail.next;
        let ptr = cmr::guard(_next, next_ptr);
        if ptr::addr(ptr) != 0 {
            let _ = self.tail.cas(tail, ptr, SeqCst);
        } else if next_ptr.cas(ptr::null(), new_node, SeqCst).is_ok() {
            let _ = self.tail.cas(tail, new_node, SeqCst);
            break;
        }
    }
pub struct Node<T> {
    data: std::mem::ManuallyDrop<T>,
    next: cmr::Atomic<Node<T>>,
}

pub struct MsQueue<T> {
    head: cmr::SharedGuard<Node<T>>,
    tail: cmr::Atomic<Node<T>>,
}

impl<T> MsQueue<T> {
  pub fn push(&self, t: T) {
      guards!(_new_node, _tail, _next);
      let new_node = cmr::alloc(_new_node, Node::new(t));
      loop {
          let tail = cmr::guard(_tail, &self.tail).ptr().unwrap();
          let next_ptr = &tail.next;
          let ptr = cmr::guard(_next, next_ptr);
          if ptr::addr(ptr) != 0 {
              let _ = self.tail.cas(tail, ptr, SeqCst);
          } else if next_ptr.cas(ptr::null(), new_node, SeqCst).is_ok() {
              let _ = self.tail.cas(tail, new_node, SeqCst);
              break;
          }
      }
  }
}
```

`guards!` declare that we need three memory locations in which to store shared pointers. We
allocate a new node with `cmr::alloc`. When loading the tail, we know that it will not be `null`,
since the MsQueue is non-empty, so we may convert the `NullablePtr` to a `Ptr`. If the `next` ptr
is not `null` (its address is not `0`), we `cas` the `tail` of the queue to our new suspected tail.
If not, we `cas` the `tail` to the newly allocated node.  Note that despite reading shared memory,
this function does not contain any `unsafe` code.
Note that the `unwrap` call afte `cmr::guard` will never fail, since the
MS-queue has non-emptyness as an invariant.


## Safety

Usage is guaranteed to be safe, due to the forced registering of the guards,
similarily to that of hazard pointers. A difference between the two is that
with hazard pointers we usually require a check after protecting a pointer to
see whether the data node in still in the data structure we're operating on;
this is not required in CMR, since memory cannot be freed in between reading
the pointer and registering it: `cmr::guard` makes sure of this.


## Complications

There are a number of technical complications with this scheme: for instance we
are required to `fork` before having the signaled threads return their
execution, since we cannot work with stale data. However, `fork` is not
"signal-safe", meaning it's not safe to call this within a signal handler. The
reason for this is that if a thread was in the middle of eg. allocating memory
when it was signaled, then internal locks have been taken: these will continue
to be taken in the child process, which blocks all allocation in this process.
This was solved by having threads register when they are allocating, and simply
abort the reclamation attempt should the lock be taken when the threads are
signaled. While working, this caused significant overhead.

We also require the thread local vector of roots to be readable and `clear`able
at any time, so we had to write our own vector, the `SignalVec`, which does
exactly this.



## Performance

The big question is how this performs.



## Project Structure

CMR itself is located in the `src` directory. Example data structures are in
`data-structures`, and benchmarks of these are in `benchmarks`. For comparison,
similar benchmarks for Crossbeam and mutexes from the standard library are in
`extern-benchmarks`.

Prior to Rust 1.32, programs usually had `jemalloc` bundled, which caused some
issues with `cmr`: since we are tracking the allocations, but not the types of
the allocation, we pass in wrong `Layout` arguments to the
`std::alloc::dealloc` function. `jemalloc` depended on this to be correct, and
caused a few issues, so we have a hack for it in `jemalloc-free-hack`, ensuring
that we call `jemallocator::ffi::free` instead of the variant taking the layout
as a parameter.


# Current Shortcommings
## Should *I* use CMR?

We do not run destructors when memory is freed. This turned out to be difficult, as we needed to
decide at runtime whether the type dropped implenented `Drop` or not. This might be possible with
some trick.
While I am happy with how CMR turned out, I don't think I can properly avocate
for using CMR in any setting but for experimenting with Rust. There are a few
reasons for this: (1) due to it's coupling with the operating system, it is
really only usable on Linux, (2) it's overhead is higher than that of epochs in
Crossbeam, (3) the project is not actively maintained[^activity]

Most operations that CMR exposes uses atomics; these always use the `SeqCst` ordering. This is
mainly done for simplicity, although correctness of the system as a whole may suffer if the user is
allowed to specify more relaxed orderings. Experimental benchmarks show that the performance of CMR
does not suffer from this, at least not on x86.


[crossbeam-epoch]: https://github.com/crossbeam-rs/crossbeam-epoch
[thesis]: https://github.com/martinhath/master-thesis/blob/master/thesis.pdf
[^fork]: Strictly speaking, this is not the case, since we use `fork()` to
  branch out a new process in which we run the reachability. Due to CoW
  semantics of memory pages as implemented by the operating system, the OS have
  to clone all pages that are changed while the child process still runs.
[^forkscan]: TODO: link here.
[^master]: TODO: link here.
[^activity]: I suppose this does not need to be a hard requiremeny for a
  library to be used, but it also means that feature requests, like Windows
  support, will probably not be implemented.