~egtann/json2sql

Convert json to SQL
bca72bb1 — Evan Tann 4 months ago
add isc license
a3907522 — Evan Tann 6 months ago
add stream schema benchmark
5da4a46e — Evan Tann 6 months ago
add row tests, fix line count

refs

master
browse  log 

clone

read-only
https://git.sr.ht/~egtann/json2sql
read/write
git@git.sr.ht:~egtann/json2sql

You can also use your local clone with git send-email.

json2sql

json2sql converts JSON objects to an indexed sqlite3 schema with appropriate types and a fast import. It's useful anywhere you have newline-separated JSON data.

One great use-case is extracting insights from log data, such as "the average number of requests that came in per user over a 2-hour period where we know customers were experiencing issues and things weren't being inserted into the database." For instance:

$ cat logs.json
{"user_id":1, "request": "GET /", "created_at": "2019-01-01T00:00:00Z"}
{"user_id":1, "request": "POST /login", "created_at": "2019-01-02T00:00:00Z"}
{"user_id":2, "request": "GET /dashboard"}

$ json2sql -f logs.json > logs.sql
$ sqlite3 db < logs.sql
$ sqlite3 db
sqlite> SELECT COUNT(*) FROM logs
        WHERE created_at >= DATE('2019-01-02')
        GROUP BY user_id;
1
sqlite> .quit

$ cat logs.sql
DROP TABLE IF EXISTS data;
CREATE TABLE data (
	'user_id' INTEGER,
	'request' TEXT,
	'created_at' TEXT
);
CREATE INDEX 'user_id_idx'    ON data ('user_id');
CREATE INDEX 'request_idx'    ON data ('request');
CREATE INDEX 'created_at_idx' ON data ('created_at');

INSERT INTO data ('user_id', 'request', 'created_at') VALUES
	(1, "GET /", "2019-01-01T00:00:00Z"),
	(1, "POST /login", "2019-01-02T00:00:00Z"),
	(2, "GET /dashboard", NULL);

This is possible with heavier tools such as Elasticsearch with their own query language, but it should not be necessary to deploy dedicated Elasticsearch servers, learn a new DSL, and pay a company a fee to use proprietary software to do this simple and routine task. It just shouldn't.

You can also do this with tools like jq but the syntax to do these complex aggregations is... complex, whereas everybody knows SQL. It's also not maintained in memory or fast to run additional queries, as you have to parse the json file each time.

As for us, we like SQL and simple things.

Behavior

  • json2sql will deduce your column data-types automatically. If two different data-types are used for the same key, then by default json2sql will throw an error.
  • json2sql will add indexes to every column, mirroring Elasticsearch's behavior and making subsequent queries very fast. Since this database will probably not be dynamically updated with new data, this is fine.

Future Improvements

  • Reducing reflection and allocations, as well as adding concurrency, would probably make it a lot faster. Benchmarks would help.
  • Other SQL dialects, e.g. Postgres or MySQL, could be output with some work.
  • Adding an option to pass in a schema from a previous run would make things much faster on subsequent runs, as we'd only need to iterate through the json file once.