56d35440f352d840f8e14eb2b6c70f1ed4308e69 — Martin Hafskjold Thoresen 5 years ago 931388c
Write succinct
1 files changed, 87 insertions(+), 2 deletions(-)

M book.tex
M book.tex => book.tex +87 -2
@@ 1300,7 1300,7 @@ Note that vEB with hashing or y-fast trees are faster than fusion trees if $w$ i

\section{van Emde Boas Tree}%
The general idea of the van Emde Boas Tree is to try to obtain the time recurrence $T(n) = T(\sqrt(n)) + O(1)$.
The general idea of the van Emde Boas Tree is to try to obtain the time recurrence $T(n) = T(\sqrt{n}) + O(1)$.
We can do this by splitting the universe into $\sqrt{u}$ clusters, each of size $\sqrt{u}$, and recurse on the cluster.
For universes that are a power of 2, this means splitting the bits of the number if half.
Consider a word $a = \big<c, i\big>$;

@@ 1551,7 1551,92 @@ At last, we combine the two indices (cluster index and internal index to the clu

\chapter{Succinct Structures}
Rank, Select
\topics{Rank, Select}

In this chapter we will look at ways to store data using as little extra space as possible.
These structures are often static, and the one we will look at, the bit vector, will indeed be static.

We divide the space hierarchy for data structures into three levels:
Implicit data structures, which uses the information theorerical optimum space, plus a constant amount of bits;
Succinct data structures, which uses the optimum space + $o(OPT)$ extra space; and
Compact data structures, which uses $O(OPT)$ space.

\section{Bit Vector}
A Bit Vector is an array of bits.
We want to support the following operations:
$Rank(i) = $ the number of 1s in the interval $[0, i]$, and
$Select(i) = $ the position of the $i$th 1.
In a way, select is the inverse operation of rank.
We also want the queries to be fast, and within the memory requirements we have set.

We would like to precompute queries in chunk of size $\frac{1}{2}\log n$.
Since this is a bit vector, the total number of different such chunk is $2^{1/2 \log n} = \sqrt{n}$.
The number of different possible queries to be done on this chunk is $1/2 \log n$,
and the size of each answer requires $\log\log n$ bits to be represented.
This means if we are to make a lookup table of all possibilities, it would take
$\sqrt{n}\cdot \frac{1}{2} \log n\cdot \log\log n = o(n)$.

Next we divide the vector into chunks of size $\log^2 n$, and store the cumulative rank of the chunk boundaries.
Since there is $n / \log^2 n$ chunks and we need $\log n$ bits to store the rank, we use $n / \log n = o(n)$ extra bits.
We now need to handle the chunks of size $\log^2 n$, which is slightly too big for the lookup tables.
We use indirection a 2nd time!

Split each chunk into subchunks of size $1/2 \log n$.
This time, in between the subchunks we store the \emph{local} cumulative rank, instead of the global.
This is what differs this approach from simply dividing the entire vector into $1/2 \log n$ chunks.
Now, since the size of the chunk is $\log^2 n$, the local rank needs only $\log \log n$ bits,
and there are $\frac{\log^2 n}{1/2 \log n} = 2 \log n$ chunk,
so we use $2 \log n \cdot \log\log n$ bits for each chunk.
Again, there are $\frac{n}{\log^2 n}$ chunks, the total extra space we use is:

\[ \frac{n}{\log^2 n} \cdot 2 \log n \cdot \log \log n  = \frac{2n \cdot \log \log n}{\log n} = o(n)\]

On queries, we take the rank of the chunk plus the relative rank of the subchunk in the chunk,
and then the relative rank of the element within in subchunk, which is precomputed using the lookup table.

It is possible to get Rank in $O(n/\log^k n)$ bits for any $k=O(1)$.

This time we store an array of indices of every $\log n \cdot \log \log n$th 1-bit.
That is, instead of dividing the vector into chunks of equal size, we divide it into chunks with the same number of 1s in it.
With this we can find out which chunk a query is in in constant time, but we still need to search inside the chunk.
In addition, the chunk sizes are now different.
Let $r$ be the size of a given chunk.
We consider different sizes of $r$.

If $r \geq {(\log n \log\log n)}^2$ we can afford to store a lookup table of the answers:
there are $\log n\log\log n$ queries for the chunk (one for each 1), and each answer, the index, is $\log n$ bits,
so the cost for \emph{one} such chunk is $\log^2 n \log\log n$.
Since the size of the chunk is at least ${(\log n \log\log n)}^2$ there cannot be more than
$\frac{n}{{(\log n \log\log n)}^2}$ of them, so the total size for this scheme in the entire bit vector is
$\log^2 n \log\log n \cdot \frac{n}{{(\log n \log\log n)}^2} = \frac{n}{\log\log n}$ bits.

If $r \leq {(\log n \log\log n)}^2$ we store the relative indices for every subchunk of ${(\log\log n)}^2$ 1th bit.
We get at most $\frac{n}{{(\log\log n)}^2}$ subchunks, but they only need to store $\log\log n$ bits each, since the index is relative,
which makes the total cost $\frac{n}{{(\log\log n)}}$ bits.
Again we consider different sizes of the now subchunks:
if $r \geq {(\log\log n)}^4$ we store all the answer, as relative indices:
\[O(\frac{n}{{(\log\log n)}^4} \cdot {(\log\log n)}^2 \log\log n) = O(\frac{n}{\log\log n}) \text{ bits.}\]
If not, we have subchunks that are so small that we can use the lookup table.

We end up with using $O(\frac{n}{\log\log n})$ bits, while still having constant time.
As with Rank, it is possible to get Select in $O(n/\log^k n)$ bits for any $k=O(1)$.

\section{Navigating Binary Trees}
We look at a succinct representation of a binary tree.
Consider the tree encoding obtained by depth first search where each node outputs one bit for each children
that is a 1 if the child is present and 0 if it is not\todo{figure}. This representation clearly uses $2n$ bits.
Now if we insert a $(1)$ in front of the encoding, we can nagivate the tree using Rank and Select from the previous section,
using the encoding as a bit string.

For a node $i$ in the bit vector, its left and right children are in positions $2Rank(i)$ and $2Rank(i) + 1$,
and its parent is in position $Select(\floor{i/2})$,
similar to in a binary heap.