~mht/ads-book

3f6946d82b76eb85386823cc5319ff32abb39c68 — Martin Hafskjold Thoresen 5 years ago 56d3544
Write about concurrency
1 files changed, 246 insertions(+), 4 deletions(-)

M book.tex
M book.tex => book.tex +246 -4
@@ 20,6 20,9 @@
\usepackage{subcaption}
\captionsetup[subfigure]{subrefformat=simple,labelformat=simple}

\usepackage{listings}
\input{listingoptions.tex}

\usepackage{lipsum}

\title{Advanced Data Structures}


@@ 34,8 37,7 @@
\newcommand{\LCA}{$\textsc{LCA}$}
\newcommand{\LA}{$\textsc{LA}$}


\newenvironment{example}[0]{{\parindent=0em \textbf{Example:}}}{\vspace{1em}}
\newenvironment{example}[0]{{\vspace{1em}\parindent=0em \textbf{Example:}}}{\vspace{1em}}
\newenvironment{definition}[1]{{\parindent=0em \textbf{Definition} --- \emph{#1}:}}{\vspace{1em}}

\DeclareMathOperator*{\argmin}{arg\,min} % thin space, limits underneath in displays


@@ 1635,12 1637,252 @@ and its parent is in position $Select(\floor{i/2})$,
similar to in a binary heap.


\chapter{Concurrency}

\topics{Locks, Lock-free structures, lists, priority queues.}

In this chapter we will look at primitives and data strucutres that handles multiple threads acting on them at the same time.
Multithreaded programming is extremely error prone, and implementing the primitives and data strutcures even more so.
For this reason, we will in this chapter write out pseudocode, instead of an english description.

\section{Mutual Exclusion}
\label{sec:mutex}
Imagine we would like to find the sum of an array.
The sequential way to do this is simple. Let $s := 0$, and for each $i$ to up $n$, read $a_i$, add it to $s$, and write to $s$.
We note that in this simple example we \emph{load} and \emph{write} variables;
this is rarely though of in single threaded code, but it is paramount in multithreaded code.
We can try the same approach for the parallel version.
Assume we have $n$ threads, so that one threads job is to add $a_i$ to $s$ for its i.
Now, let $R_1(x)$ and $W_2(x)$ denote that thread 1 reads the variable $x$ and that thread 2 writes the variable $x$ to a shared storage respectively.
Note that we differentiate between \emph{shared} and \emph{local} storage for the threads.
Processor 1s instructions will be $R_1(a_i),\ R_1(r),\ W_1(r)$; note that we have omitted the addition, since it is local.
Processor 2s instruction is similar.
Now, we assume that the actual outcome of this computation is the same as some interleaving of the two threads instruction streams.
Since the interleaving is not defined, we must make sure that we obtain the correct answer no matter how we interleave the streams.
However, consider the following interleaving:
\[R_1(a_i),\ R_1(r),\ R_2(a_j),\ R_2(r),\ W_1(r),\ W_2(r)\]
Both threads read its $a$ and $r$, then to the adding, and then write out to $r$ again.
This is a problem, since $a_i$ will be lost in the sum, as the local $r$ thread 2, which is the value that
gets written out last and which contains $a_j$ does not contain $a_i$, since $r$ was read before thread 1
had written it out.
Note that this problem is just as real if we only have two threads, and each thread gets an interval of the array.

We would like the consistency of sequential execution without giving up the speed increases of parallel execution.
The simplest way to solve our problem above is by using a \emph{Mutex}.
A mutex is a synchronization primitive that has the following properties:
let a \emph{Critical Section} be the interval of the instruction stream which is surrounded by a mutex $lock$ and $unlock$;
a mutex ensures non-overlapping execution of critical sections, and in any non-trivial suffix of the
linearized instruction stream (that is, the interleaved stream),
some $lock$ or $unlock$ call to the mutex succedes.

\subsection{Petersons Mutex Algorithm}
We show an algorithm for making a mutex supporting two threads using only $Read$ and $Write$.
We have three shared variables: $x$, $he\_wants$, and $she\_wants$.
$x$ is the value we would like to use exclusively.
In addition, we have a C-style \texttt{enum}, $turn$, which is either $his$ or $hers$.

\begin{figure}[ht]
    \centering
    \begin{subfigure}{0.45\textwidth}
        \begin{lstlisting}[title={Alices code}]
she_wants = T
turn = his
while he_wants and turn=his
    spin
{
    access(x)
}
she_wants = F
        \end{lstlisting}
    \end{subfigure}
    \begin{subfigure}{0.45\textwidth}
        \begin{lstlisting}[title={Bobs code}]
he_wants = T
turn = hers
while she_wants and turn=hers
    spin
{
    access(x)
}
he_wants = F
        \end{lstlisting}
    \end{subfigure}
\end{figure}

In terms of $lock$ and $unlock$, the code before the curly brackets is $lock$, and the code after is $unlock$.
We show that this implements a mutex for two threads, by contradiction.
Assume that the two threads are both inside the critical section.
Without loss of generality, let Bob be the last writer of $turn$: $W_A(turn) \leq W_B(turn)$.
We can clearly see from the code that $W_A(she\_wants) \leq W_A(turn)$, which implies % and $W_B(he\_wants) \leq W_B(turn)$,
$W_A(she\_wants) \leq W_B(turn)$.
Since $W_A(she\_wants) \leq W_A(turn) \leq W_B(turn) \leq R_B(she\_wants)$, $B$ will read $she\_wants=T$.
In addition, since we let $B$ be the last writer of $turn$, we know that $turn=hers$,
which means that both conditions for the \textbf{while} loop is fulfilled. $B$ will spin!.
Now $B$ will spin until $A$ has gone out of her critical section, and writes $she\_wants=F$.
Then $B$ may enter his critical section, but now $A$ is already done!
No overlap of the critical sections, which is a contradiction.

\section{Lock-Free Algorithms}
What does it mean for an algorithm or data structure to be ``Lock-Free''?
Lock-free code contains special instructions, the most notable of which is the $Compare\text{-}and\text{-}Swap$, or $CAS$ for short.
$CAS$ takes three arguments, an address, an old value and a new value. It checks whether the value stored at the
address given is equal to the old value, and if it is, it writes the new value and returns $T$; if
the value is not equal it returns $F$.
This might not seem special at all, but the key is that it happens \emph{atomically}; that is, we do not have the
problem from Section~\ref{sec:mutex} where one thread may overwrite anothers data.

\begin{example}
    It is possible to make a lock using the $CAS$ instruction:
    Note that we do not need to $CAS$ the write at line 7, since the thread from the critical section
    is the only thread that will change the value of $lock$ when it is set to $F$.
    \begin{lstlisting}[title={CAS-Lock}]
lock = F
while !CAS(&lock, F, T)
    spin
{
    $\dots$
}
lock = F
    \end{lstlisting}
\end{example}

\begin{example}
    We make a map-fold style reducer using the $CAS$ instruction.
    $result$ is the shared variable for the accumulated result.
    We imagine that the threads all get a different slice of the array, which they loop over.
    \begin{lstlisting}[title={CAS-Lock}]
for f in array
    tmp = compute(f)
    do {
        old = Read(result)
        new = result + tmp
    } while !CAS(&result, old, new)
    \end{lstlisting}
\end{example}

We prefer lock-free to locking, because lock-free algorithms does not block threads:
we cannot rely that all threads get a somewhat balanced running time in an interval;
some threads may run for a while, and then stop for a while.
If such a thread has aquired a lock before pausing for a long time, all threads
which need that lock simply have to wait. One could imagine a scheme which allows the threads
that actually do work to somehow be prioritized over threads that are more idle.
In a sense this is what $CAS$ allows us to do: all threads may read at any time, and
if they are to write, they have to check that nothing has changed during the time from they
first read the value to now when they want to write back a new value.

We also note a property of the previous example: it will not block; that is,
no execution will cause $result$ to stay the same for a long time (given that \emph{some} thread gets execution time).
To see why, let us first assume that the domain of $compute$ is non-negative.
We note that the only reason that the $CAS$ should fail is if $result$ has been increased by some other thread.
So while it is possible that a given thread will loop indefinitely if $result$ always keep chaning, this also
means that some \emph{other} thread is doing great in computing its values.


\section{Treiber Stack}
We show a lock-free stack implementation know as a \emph{Treiber Stack}.
The idea, which is a popular one in lock free data structures, is to build the stack out of a linked.
The stack is just a pointer to the head node, which is the top of the stack. A node has a value and a pointer to the next node.
The code for $\textsc{Push}$ and $\textsc{Pop}$ is shown below.

\begin{figure}[ht]
    \centering
    \begin{subfigure}{0.45\textwidth}
        \begin{lstlisting}[title={Push(v)}]
n $= \textsc{Make-Node}(v)$
do {
    n.next = Read(head)
} while !CAS(head, n.next, n)

        \end{lstlisting}
    \end{subfigure}
    \begin{subfigure}{0.45\textwidth}
        \begin{lstlisting}[title={Pop()}]
curr = Read(head)
while curr {
    if CAS(head, curr, curr.next)
        break
    curr = Read(head)
}
return curr
        \end{lstlisting}
    \end{subfigure}
\end{figure}

The idea of $\textsc{Push}$ is to first make the node that is the new head.
Then we set the $next$ pointer to be the head of the stack, such that we only
need to update the stacks head pointer. If this update fails, we update the
$next$ pointer of our local node, and retry.
Intuetivaly we understand that this operation is safe, since the stack itself
is not changed until the $CAS$ is successful, and we are done.

In $\textsc{Pop}$ we read the stacks head pointer, and try to $CAS$ the head pointer
to the next node in the stack. If we fail, someone else have either inserted or removed an element from
the stack, so we have to read the head again.
If we succeed, our new node is not reachable for anyone else, so we do not need to do anything special;
we can simply return the node.


\section{Concurrent Lists}
Next, we want to make a concurrent list, without locks.
We consider an implementation similar to the Treiber Stack, where we make a node on insert, and $CAS$ it into place,
and similarily with deletions.

However, with a little bit of imagination, we run into trouble.
Consider a list with nodes $A, B, C, D$, where the first and last node are sentinel nodes, such that nodes with a value
never has a $null$ pointer, and that the lists head pointer is never a $null$ pointer.
Assume we want to insert the node $B'$ between $B$ and $C$.
We start out by setting $B'.next = C$.
But then another thread comes in and deletes the node $B$, by $CAS$ing $A.next = C$.
Then the first thread resumes its execution, and $CAS$es $B.next = B'$, which succedes.
The insert will look like it succedes, but the node that precedes the new node in the list is
in fact not reachable from the head node!

Another case: we have the same list.
Thread 1 wants to delete $B$, thread 2 wants to delete $C$.
Thread 2 starts out and finds the $B$ node, which it wants to $CAS$ the next pointer on.
But then, Thread 1 comes in and $CAS$es $A.next = C$, which succedes.
Thread 2 $CAS$es $B.next=D$, and both threads returnes sucecsfully.
When we are done, the list is $A, C, D$, so we have just removed a single element namely $B$,
even though both delete calls was ``succesfull''.

This illustrates why lock free programming is hard;
a seemingly simple task has a lot of pitfalls, and recognizing the pitfalls is not trivial.

What can we do to handle these issues?
If we allow \emph{some} locking, we can use the ``hand-over-hand'' approach with locks.
We note that inserting a note just requires locking the nodes around the place
where the new node should be.
If we want to insert between $C$ and $D$ we can lock $A$ and $B$, then release $A$ and lock $C$,
and then release $B$ and lock $D$.
Now we know that the $next$ pointer of $B$ will not change, and we can safely insert the new node.
Note that this also requires locking of delete, but in a slightly different manner:
on delete we need to lock the node we are deleting as well as its predecessor,
so that we do not risk deleting a node we are using for insertion.

$O(n)$ locks are problematic in terms of real world performance.
Is there some way to avoid this?
We can search through the list and find the nodes we need, lock them, and then
search \emph{again}, to ensure that they are still reachable from the head.
Now we are doing a double traversal of the list, but only require two locks.
Whether this is faster or slower depends on use case.

Another approach is to use the least significant bits of the pointers to store information;
these are usually 0 for alignment reasons.
We can then ``tag'' the $next$ pointer of the node we want to use, or delete,
in order for other $CAS$es to fail.












\chapter{Concurrency}

Locks, Lock-free structures, lists, priority queues.

\printbibliography%