7ae769f950ab41947cf13f9ebae994c723253199 — Martin Hafskjold Thoresen 5 years ago 046e255
Write the first part of Integer strucutres. Fusion trees are missing
1 files changed, 124 insertions(+), 1 deletions(-)

M book.tex
M book.tex => book.tex +124 -1
@@ 1269,11 1269,134 @@ which was the claim we needed to prove.

\chapter{Integer Structures}
\topics{van Emde Boas, x-fast trees, y-fast trees, fusion trees.}

van Emde Boas, x-fast trees, y-fast trees, fusion trees.
In this chapter we look at data structures operating on integers in the word machine.
By limiting ourselves to only act on integers we can achieve time bounds as a function of the word size.

The general problem we want to solve is \algo{Predecessor}: we maintain a structure containing $n$ words,
and would like to support insertion, deletion, and predecessor and/or successor.
The universe $U$ we get the integers from is asusmed  to be a power of two $2^w$; we call $w$ the word size.
We also assume that $n \leq \log w$, that is, that the number of elements in the structure is representable in a word.

In a comparison based model, the precesessor problem has a lower bound of $\Theta(\log n)$.
Table~\ref{tab:integer-structures} summarizes the structures we will look at, with time and space complexities.
Note that vEB with hashing or y-fast trees are faster than fusion trees if $w$ is small.

    \begin{tabular}{c c c}
        Name & Operation & Space\\\midrule
        van Emde Boas Trees & $O(\log w)$ & $\Theta(U)$\\
        vEB + Hashing & $O(\log w)$ whp. & $\Theta(n)$\\
        y-fast trees & $O(\log w)$ whp. & $\Theta(n)$\\
        Fusion trees & $O(\log_w n)$ whp. & $\Theta(n)$\\
    The integer structures we look at in this chapter.

\section{van Emde Boas Tree}%
The general idea of the van Emde Boas Tree is to try to obtain the time recurrence $T(n) = T(\sqrt(n)) + O(1)$.
We can do this by splitting the universe into $\sqrt{u}$ clusters, each of size $\sqrt{u}$, and recurse on the cluster.
For universes that are a power of 2, this means splitting the bits of the number if half.
Consider a word $a = \big<c, i\big>$;
we call the most significant bits $c$ and the least significant bits $i$.
This is simply to do in the word machine: we simply mask out the lower bits to get $i$, and we shift down and mask to get $c$.

The van Emde Boas tree is a recursive structure.
At each level we have four things:
$\sqrt{u}$ \emph{Clusters}, a vEB trees of size $\sqrt{u}$ which signals which of the clusters are empty;
one \emph{Summary}, which is also a vEB tree of size $\sqrt{u}$;
the minimum element in the tree; and the maximum element in the tree.
The minimum element is not stored elsewhere in the tree, but the maximum element is.

With this layout, we can see how we will do the recursion:
we index the cluster using $c$, the most significant bits, and then continue by using $i$ in the next query.
We will call the vEB tree $v$, such that $v.cluster[0]$ is the first cluster in the current tree.
We assume no key duplication.

If the queried element is smaller than $v.cluster[c].max$, we find the successor for $i$ in $v.cluster[c]$.
When returning we also have to ``rebuild'' our number, by appending $c$ in front.
If the element is larger than $v.max$, we have to find the next non-empty cluster, and select the minimum element in it, since
this will be our successor. Note that we know one such element will exist, since $x < v.max$.
To find this we do a sucessor query of $c$ in $v.summary$.
When rebuilding we now have to use $c'$, the successor of $c$ in the summary, instead of $c$.

In total, we get one recursive call in either case.

At first this might seem simple. We find the correct cluster, and recurse into it. However, we also need to update the summary, if
the cluter was empty.
This makes for \emph{two} recursive calls, which we cannot afford, since $T(u) = 2T(\sqrt{u}) + O(1)$ solves to $O(\log u)$, rather than $O(\log \log u)$ which is what we want ($w = \log u$).
However, we observe that if the cluster is empty, the insert call to it will be trivial since we only set $v.min$ which is not stored elsewhere in the tree.

Deleting an element is slightly more complex.
First we check if $x = v.min$.
If the tree only contains one element we remove $min$ and $max$, and return;
else we take the min from the first cluster, which we find by taking the min in the summary,
and set \emph{our} min to be this min.
Since min keys are only stored in one place, we continue with $x = v.min$;
that is, by overwriting $v.min$ we deleted the original queried key, but
we have to clean up by deleing the key we just copied.

Now we are either still looking for the original $x$, or we have changed $x$.
In any case, we recurse on cluster $c$, and delete $i$.
After deleting, we have to check if the cluster we recursed on got its $min$ removed,
because we then have to mark it in the summary.
Like with insert, we risk having two recursive calls here, but again one of these calls will be trivial;
this time it is the first call that is trivial, since if we removed the last element in the tree we just
set $v.max = v.min = None$, and did no further recursing.

\section{Saving Space}
The strucutre presented in Section~\ref{sec:vEB} takes $O(U)$ space, which is a lot.
We look at ways to reduce this.

The obvious way to shave off some space is to not store the clusters in an array, but in a hash table.
This way we only pay for the tables we actually use.
Somehow, this gives us a space bound of $O(n \log w)$, which can be improved to $O(n)$ using indirection.

\subsection{Tree View}
We can look at the van Emde Boas tree in a different view.
Consider a balanced binary tree of height $w$, where the leaf nodes are element markers (that is, the path from the root to a leaf is the number,
and the leaf is either 0 or 1, if the element is not or is in the tree).
The internal nodes are the \texttt{OR} of their children.
The upper half (in terms of height) of the tree can be considered at the Summary, and each subtree as one cluster.

In this structure, updates are $O(w)$. Queries can be done faster by noticing that a path from a leaf to the root is monotone:
it is a string of 0s followed by 1s (unless the tree is empty),
so we can binary search it to find the transition point.
Then, we can look at the other child of the node that has the first $1$ in the path,
and get the min or max, depending on if it is to the left or right.
If all subtrees store this, this is constant.
What if we are looking for successor, but get predecessor from this method?
If all 1 leaves store pointers to the next and previous 1 leaves, this can also be done in constant time.

\subsection{X-fast Trees}
We take inspiration from the Tree view in the previous section to build a X-fast tree,
where we store all of the 1s in the tree, as binary strings, in a hash table.
This lets us perform the binary search.
Updates are still $O(\log w)$, since we have to search through the path, which is $w$ long,
queries can use the same binary search trick, to get a time of $\Theta(\log w)$, and
space is $O(nw)$.

\subsection{Y-fast Trees}
We use indirection on the X-fast tree to get faster updates and smaller space.
Split the tree into two structures, an upper and lower structure.
The upper structure consists of $O(n/w)$ of the tree nodes, and the bottom ones are BBSTs of size $O(\log w)$.
Queries are $O(\log w)$ in both structures, but we make updates $O(\log w)$ amortized, since
for each $w$ update in the bottom stucture we only need one update in the top structure.
Now the space is only $O(n/w w + n) = O(n)$. Linear space!

\section{Fusion Trees}
\todo{write this :)}

\chapter{Succinct Structures}
Rank, Select