~ntietz/isabella-db

isabella-db/docs/0001-initial-design-and-plan.md -rw-r--r-- 6.3 KiB
2d4efdcdNicholas Tietz-Sokolsky Load shards and serve query results from all shards. 9 days ago

This document is an initial design document. It's sparse in details in some places and is meant only to serve as exploration for initial development. It contains omissions and is light on details for things that are in a more exploratory phase.

The main focus is ensuring that the overall direction is correct, as well as discovering things that will need to be learned or implemented.

#Context

IsabellaDB is a chess database. It's meant to be used for analyzing past games and patterns across them to uncover interesting facts or improve as a chess player.

Some things which IsabellaDB will support:

  • Search for games
    • Filter games on metadata (player, event, year, rating, etc.)
    • Filter games based on position matches
    • Filter games based on sequences of position matches
  • Display a game (show move sequence and allow going forward/back through it)
  • Opening tree
    • Explore from the beginning of a game
    • Find games which match a given position
    • See win/lose/draw percents for a given position
    • Apply a game search filter to the above
  • (TODO) Opening repertoires
    • Discover a player's opening repertoire
    • Build an opening repertoire
  • (TODO) Ingest new game data
    • Update as new master game databases are published
    • Pull in data for us amateurs (for openings especially)

#Game Storage and Retrieval

The fundamental functionality of a chess database is to be able to store and retrieve games with various filters. Additional functionality, like analysis, is layered on top of this. As such, game storage and retrieval is the critical core of IsabellaDB.

We'll focus first on storage of games, then retrieval, then indexing.

#Storage

Each game has a set of metadata associated with it, as well as a list of moves.

The metadata contains:

  • Player names
  • Player ratings
  • Game result
  • Game date
  • Event information (event name, site name, round)

The game information contains:

  • Sequence of materialized positions representing the game
  • The algebraic notation representation of the game (for display)

There will be a lot of repetition in metadata, so it will be converted to store only once in memory in a separate names store, and then the game records will point into that store.

Each game will be assigned a unique id when it is ingested into the system, and these will be used as the lookup keys from the indexes.

The initial in-memory storage will be pretty basic and will look like:

type GameID = u64;

struct GameStore {
    games: HashMap<GameID, Game>,
    indexes: ...
}

#Retrieval

There are two types of filters that you can do: filtering based on metadata about a game, and filtering based on what happened within the game.

Retrieval will be pretty straightforward. Given a list of filters, in order of use (this order will be determined by the query planner), we can retrieve them with this pseudocode:

# get the games that match the first filter
games = db.retrieve(filters[0])

for filter in filters[1:]:
    games = games.filter(filter)

To do the retrieval, the db.retrieve call will utilize the index that corresponds to that filter and retrieve the records that match it, either as a range or an exact result list.

#Indexing

To start, we'll have some fairly simple indexes. Over time, we'll build up more complicated indexes to handle more complicated search patterns.

The initial indexes will be:

  • Player name index: How to implement this is TBD and may start as a prefix tree, but ideally would be able to match on any substring within the metadata, not just based on prefixes. Inverted indexes also seem relevant.
  • Event name index: Same as player name index.
  • Rating index: a B-tree index over ratings.
  • Date index: a B-tree index over dates.
  • Position index: We'll have a reverse lookup from position to the games which contained that position. Lookups into this index can use hashes of the positions for quick initial lookups, but will have to do an exact comparison comparison of the positions to avoid collisions.

Indexing on game results would not add any significant benefit, since there are only three classes (win, lose, draw).

Eventually, we'll also have indexes based on features of positions, instead of exact position matches. This includes things like the presence (or absence) of pins, certain pieces being on the board or not, etc. Further development of indexes will be driven by need to support certain types of queries.

Out of scope for the initial design:

  • Persistence. We'll reconstruct these at runtime for now to keep things simple.
  • Updates. We'll want to insert new games into the index at runtime eventually, but for now we'll assume the dataset is static.

#Querying

A user should be able to submit a query for a set of games which match certain criteria. Here is an example of doing that with a few different filters on metadata:

# Find all the games played by Magnus and Fabi after 2015
(search-games
    (match-metadata
        (either-name "Fabiano Caruana") # matches either player
        (either-name "Magnus Carlsen")  # matches either player
        (year > 2015)
    )
)

And here's another example based on positional matching:

# Find games which match the provided position (5 ply of the Italian) with an
# extra filter on metadata
(search-games
    (match-position
        (fen "r1bqkbnr/pppp1ppp/2n5/4p3/2B1P3/5N2/PPPP1PPP/RNBQK2R b KQkq - 3 3")
    )
    (match-metadata
        (white-rating > 2300)
        (black-rating > 2300)
    )
)

The query language will be defined in more detail over time. This is just a sample to guide what capabilities it'll need to provide. This will be one of the later pieces developed.

When a query is received, it will be parsed (as an S-expression), converted to a sequence of operations that can be run to retrieve the data from the store, and optimized (mostly around ordering of application of filters, and use of indexes).

#Development Plan and Roadmap

The initial development for the first phase will focus on getting data ingested and basic indexes constructed. Querying will be in the second phase.

v0.1.0:

  • PGN parser to be able to read in the dataset I have available to me
  • Index on metadata (player names, event names, ratings, dates)
  • Index on positions
  • Search page with basic metadata filters

v0.2.0:

  • Query parser
  • Opening tree page
  • Search page to also include position matches