~mht/cmr

32abf2d5 — Martin Hafskjold Thoresen 6 months ago
Add license
f119b37b — Martin Hafskjold Thoresen 1 year, 7 months ago
Rewrite of the README.
f7a9c8fc — Martin Hafskjold Thoresen 1 year, 9 months ago
Remove wrong data from lngrad 80 raw

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~mht/cmr
read/write
git@git.sr.ht:~mht/cmr

You can also use your local clone with git send-email.

#CMR - Concurrent Memory Reclamation

CMR is a memory reclamation scheme for concurrent systems in Rust. It was developed for Martin Hafskjold Thoresen's master thesis[^master] at the Norwegian University of Science and Technology, in collaboration with Dan Alistarh's group at IST Austria, and it draws significant inspiration from Forkscan[^forkscan]. The goal of the project was to explore alternatives to the popoular Crossbeam project for concurrent meomry management in Rust.

#Overview

CMR works by splitting the memory space of the program into two: managed memory, and shared memory. In managed memory lives all objects which lifetime is managed by the Rust compiler, like Box<T>, Vec<T>, or even Arc<T>s. This includes most of the program. Shared memory, on the other hand, is where the objects that the Rust compiler cannot reason about lives. This may include objects like nodes in a concurrent queue. CMR tracks all objects that live in shared memory, and is not concerned about the objects that live in managed memory. CMR keeps track of all pointers to shared memory, and is therefore able to reason about the lifetime of the objects in shared memory.

There are number of advantages of not having to keep track of all program memory: for instance, if the program consists of only a small part of shared memory - and most programs do - then the overhead of CMR only depends on the size of this memory, and not the size of the total program memory[^fork]. It also encourages programmers to minimize the use of shared memory, since this must be explicitly handled (one can draw similarities to unsafe code in Rust).

#Reclaiming Memory

The basic algorithm is as follows:

reclaim:
  1. have all threads report their roots
  2. collect all roots
  3. find all allocations that are reachable from the roots
  4. free() the allocations that were not found.

1: this implemented using POSIX signals: each thread spawned must register a signal handler for SIGUSR1. In this signal handler we write out a pointer to a thread local vector, containing pointers to all roots this thread currently has.

2: having pointers to the vectors the reclaming thread can simply do some pointer following to obtain all of the root pointers to shared memory.

3: Instead of spending valuable time doing the reachability analysis while all threads are waiting, we branch off a new process which does this for us. Then we use mmaped memory to communicate between the child process and a background thread. The background thread gets back all pointers that were no longer reachable from any root, and frees this memory in 4.

#Usage

CMR defined its own pointer types NullabelPtr<T> and Ptr<T>. All accesses to shared memory is done through a Ptr<T>. In order to obtain a NullablePtr<T> we must load an Atomic<T> into a Guard<T>: the Guard guarantees that the location of the loaded pointer is known to CMR. Guards are not constructed explicitly, but declared using the guard! macro. Thus, a complete read from shared memory can look like this:

guard!(my_guard);
let nullable_ptr = cmr::guard(&some_atomic, my_guard);
if nullable_ptr.is_null() { return Err(...); }
let ptr = nullable_ptr.unwrap();
let value: &T = &*ptr;

The lifetime system of Rust guarantees that the reference value is only valid as long as ptr, which in turn guarantees that my_guard is not changed, so CMR knows that this is a reachable pointer. This makes all operations in this example safe - no unsafe usage is required.

#Example

As an example, this is from the push operation of the Michael-Scott queue:

pub struct Node<T> {
    data: std::mem::ManuallyDrop<T>,
    next: cmr::Atomic<Node<T>>,
}

pub struct MsQueue<T> {
    head: cmr::SharedGuard<Node<T>>,
    tail: cmr::Atomic<Node<T>>,
}

impl<T> MsQueue<T> {
  pub fn push(&self, t: T) {
      guards!(_new_node, _tail, _next);
      let new_node = cmr::alloc(_new_node, Node::new(t));
      loop {
          let tail = cmr::guard(_tail, &self.tail).ptr().unwrap();
          let next_ptr = &tail.next;
          let ptr = cmr::guard(_next, next_ptr);
          if ptr::addr(ptr) != 0 {
              let _ = self.tail.cas(tail, ptr, SeqCst);
          } else if next_ptr.cas(ptr::null(), new_node, SeqCst).is_ok() {
              let _ = self.tail.cas(tail, new_node, SeqCst);
              break;
          }
      }
  }
}

Note that the unwrap call afte cmr::guard will never fail, since the MS-queue has non-emptyness as an invariant.

#Safety

Usage is guaranteed to be safe, due to the forced registering of the guards, similarily to that of hazard pointers. A difference between the two is that with hazard pointers we usually require a check after protecting a pointer to see whether the data node in still in the data structure we're operating on; this is not required in CMR, since memory cannot be freed in between reading the pointer and registering it: cmr::guard makes sure of this.

#Complications

There are a number of technical complications with this scheme: for instance we are required to fork before having the signaled threads return their execution, since we cannot work with stale data. However, fork is not "signal-safe", meaning it's not safe to call this within a signal handler. The reason for this is that if a thread was in the middle of eg. allocating memory when it was signaled, then internal locks have been taken: these will continue to be taken in the child process, which blocks all allocation in this process. This was solved by having threads register when they are allocating, and simply abort the reclamation attempt should the lock be taken when the threads are signaled. While working, this caused significant overhead.

We also require the thread local vector of roots to be readable and clearable at any time, so we had to write our own vector, the SignalVec, which does exactly this.

#Performance

The big question is how this performs.

#Project Structure

CMR itself is located in the src directory. Example data structures are in data-structures, and benchmarks of these are in benchmarks. For comparison, similar benchmarks for Crossbeam and mutexes from the standard library are in extern-benchmarks.

Prior to Rust 1.32, programs usually had jemalloc bundled, which caused some issues with cmr: since we are tracking the allocations, but not the types of the allocation, we pass in wrong Layout arguments to the std::alloc::dealloc function. jemalloc depended on this to be correct, and caused a few issues, so we have a hack for it in jemalloc-free-hack, ensuring that we call jemallocator::ffi::free instead of the variant taking the layout as a parameter.

#Should I use CMR?

While I am happy with how CMR turned out, I don't think I can properly avocate for using CMR in any setting but for experimenting with Rust. There are a few reasons for this: (1) due to it's coupling with the operating system, it is really only usable on Linux, (2) it's overhead is higher than that of epochs in Crossbeam, (3) the project is not actively maintained[^activity]

[^fork]: Strictly speaking, this is not the case, since we use fork() to branch out a new process in which we run the reachability. Due to CoW semantics of memory pages as implemented by the operating system, the OS have to clone all pages that are changed while the child process still runs. [^forkscan]: TODO: link here. [^master]: TODO: link here. [^activity]: I suppose this does not need to be a hard requiremeny for a library to be used, but it also means that feature requests, like Windows support, will probably not be implemented.