~amirouche/hyper.dev

ref: 39fe32de5b099a49a814dfa3aae44a18b984ee8c hyper.dev/blog/copernic-github-for-data.html -rw-r--r-- 9.0 KiB
39fe32deAmirouche rebuild 1 year, 7 months ago
                                                                                
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
<!doctype html>
<html lang="en">
    <head>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
	<link rel="stylesheet" href="/static/normalize.css">
	<link rel="stylesheet" href="/static/fonts/texgyrepagella-regular/stylesheet.css">
	<link rel="stylesheet" href="/static/fonts/leaguegothic.css">
	<link rel="stylesheet" href="/static/highlight.css">
	<link rel="stylesheet" href="/static/styles.css">
	<title>hyper.dev - 2020/02/27 - copernic: github for data</title>
    </head>
    <body>
        <div id="background"></div>
	<div id="overlay">
	</div>
	<div id="root">
            <h1><a href="https://hyper.dev">hyper.dev</a> <small>[<a href="https://hyper.dev/feed.xml">feed</a>] [<a href="https://hyper.dev/Amirouche.BOUBEKKI.2020.pdf">resume</a>]</small></h1>
	    <div><h1>2020/02/27 - copernic: github for data</h1>
<p><strong>alpha</strong></p>
<h2>Abstract</h2>
<p>copernic is web application that is (mostly) implemented with Python
programming language.  It is supported by a database that is a
versioned triple store.  It is possible to do time traveling queries
at any point in history while still being efficient to query and
modify the latest version.  The versioned triple store is implemented
using a novel approach dubbed generic tuple store.  copernic goal is
to <strong>demonstrate that versioned databases allow to implement workflows
that ease cooperation</strong>.</p>
<h2>Keywords</h2>
<ul>
<li>data management system</li>
<li>data science</li>
<li>knowledge base</li>
<li>open data</li>
<li>python programming language</li>
<li>quality assurance</li>
<li>reproducible science</li>
<li>version control system</li>
</ul>
<h2>Introduction</h2>
<p>Versioning in production systems is a trick everybody knows about
whether it is through backup, logging systems and ad-hoc <a href="https://code.djangoproject.com/wiki/AuditTrail">audit
trails</a>.  It allows to
inspect, debug and in worst cases rollback to previous states. There
is not need to explain the great importance of versioning in software
management as tools like git, mercurial, and fossil have shaped modern
computing.</p>
<p>Having the power of versioning open the door to manyfold applications.
Like, it allows to implement <strong>a mechanic similar to github's pull
requests and gitlab's merge requests in many products</strong>.  That very
mechanic is explicit about the actual human workflow in entreprise
settings, in particular, when a person validates a change made by
another person: peer review.</p>
<p>The <em>versioned triple store</em> make the implementation of such mechanics
more systematic and less error prone as the implementation can be
shared across various tools and organisations.</p>
<p>copernic takes the path of versioning data and apply the
<strong>change-request mechanic to collaborate around the making of a
knowledge base</strong>, similar in spirit to
<a href="https://wikidata.org/">WikiData</a> and inspired from existing data
management systems like CKAN.</p>
<p>The use of a version control system to store <a href="https://en.wikipedia.org/wiki/Open_data">open
data</a> is a good thing as it
draws a clear path for reproducible science.  But none, meets all the
expectations. <strong>copernic aims to replace the use of git and make
practical cooperation around the creation, publication, storage,
re-use and maintenance of knowledge bases that are possibly bigger
than memory.</strong> Resource Description Framework (RDF) offers a good
canvas for cooperation around open data but there is no solution that
is good enough according to <a href="https://core.ac.uk/download/pdf/76527782.pdf">Collaborative Open Data versioning: a
pragmatic approach using Linked Data, by Canova <em>et
al.</em></a>, until now.</p>
<p>copernic use a novel approach to store triples in an <a href="https://en.wikipedia.org/wiki/Ordered_Key-Value_Store">ordered
key-value
store</a>. It use
<a href="https://www.foundationdb.org/">FoundationDB database storage engine</a>
to deliver a pragmatic versatile ACID-compliant versioned triple store
where people can cooperate around the <strong>making of knowledge</strong>.
copernic only stores changes between versions.  It has also a snapshot
of the latest version.  copernic is the future.</p>
<h2>Implementation</h2>
<p>This project designed to allow cooperation in the large, thanks to the
support of a change-request mechanic similar to github pull-request or
gitlab merge-request, on any structured data.  That includes
relational data, graph-like data and tabular data. That is not exactly
git-for-structured-data (unlike previous iteration).</p>
<p>It is meant to scale both in terms of data size and number of
contributions.</p>
<p>Like wikipedia or wikidata, there is a single version of truth. In
other words, every user gets the same data. There is no per-user data
repositories (as of yet).  I dropped the git-like
Directed-Acyclic-Graph history branches because it would be less
scalable.</p>
<p>The drawback of the current approach is that it does not allow to
"fork" a branch, and to have different versions of the same database
that go in different directions e.g. elaborate some theories in
different branches and then merge only the good one. Except some bug
in the code, it is possible to query a given stash of change, but the
stash only represent a single commit in git parlance: it has no
history.</p>
<p>In other words, change-request work in a way that is similar to git
stash.  That is the diff, additions and deletions of triples, is
stored in 5-tuple store like the rest of the data. Until the change is
applied by a super user. The data part of the change-request is
invisible outside the given change-request, like a stash in git.</p>
<p>5 tuple store was said to be overkill, but I do not know what are the
queries I want to execute, as of today, against the history, so I
index-all-the-things, what blazegraph code calls "perfect
indices". The five tuple is described in
<a href="https://github.com/amirouche/copernic/blob/master/copernic/vnstore.py#L59">vnstore</a>,
will eventually contains the original added or remove triple, plus a
boolean denoting whether it is an addition or a deletion (it is a
called tombstone in postgresql mvcc), and the changeid that is the an
unique identifier for the group of addition and deletion, similar to a
git commit hash.</p>
<p>Once a super user applies the change, that is merely swapping a single
<code>None</code> value with a timestamp, the history is properly serialized
realizing a single branch history.</p>
<p>In fact, you can have a change request that is bigger than available
memory.  Unlike previous iteration, changes (or commits) do not map
one-to-one with database transactions. Hence, there might be integrity
bugs.</p>
<p>It it possible to do time traveling queries, like freebase did. It is
not exposed in the web user interface.</p>
<p>There is always an up-to-date image of the latest data, to speed up
queries.  But the history only store the differences between
successive versions.</p>
<p>The main difference I see with existing RDF databases are:</p>
<ul>
<li>it is versioned in a single branch history, with stashes of changes.</li>
<li>it scales horizontally, thanks to foundationdb.</li>
<li>it does not support SPARQL, as of yet.</li>
<li>it does not support reasoning of any sort.</li>
</ul>
<p>The only way to add or remove triples as of today, is to via the user
interface via "make change-request" button. In particular, the import
link leads to a page without explanation, with a box to input a file
that expects JSON lines.  That is, things like:</p>
<div class="highlight"><pre><span></span><span class="p">[</span><span class="s2">&quot;w3c&quot;</span><span class="p">,</span> <span class="s2">&quot;is-a&quot;</span><span class="p">,</span> <span class="s2">&quot;Web standards organization&quot;</span><span class="p">]</span>
<span class="p">[</span><span class="s2">&quot;truth&quot;</span><span class="p">,</span> <span class="s2">&quot;is&quot;</span><span class="p">,</span> <span class="mi">42</span><span class="p">]</span>
</pre></div>
<p>Where <code>subject</code>, <code>predicate</code> and <code>object</code> can be any json simple data
type.  Also, I call the columns respectively <code>uid</code>, <code>key</code>, and
<code>value</code>.  It is easier to my mind. The code will try to guess the type
of the object: variable, uuid, number, boolean and fallback to string.</p>
<p>The code is at: <a href="https://github.com/amirouche/copernic">https://github.com/amirouche/copernic</a></p>
<p>A demo is available at: <a href="http://copernic.space/">http://copernic.space/</a></p>
<p>The license is AGPLv3+</p>
<p>Enjoy!</p>
</div>
	</div>
        <div id="footer">
            <p>
                As always if you like this article, want to share
                feedback, or tell me what I got wrong. Please <a href="mailto:amirouche@hyper.dev">get
                in touch</a>.
            </p>
            <p>You might want to subscribe to the blog <a href="https://hyper.dev/feed.xml">feed</a>!</p>
            <p>Amirouche ~ zig</p>
        </div>
    </body>
</html>