Planning ahead for how we can keep up with an increasing scale.
The primary near-to-mid-term bottlenecks will be:
Anything not mentioned here is thought to be operating well within performance thresholds and scaling at a negligible pace.
The throughput of our GraphQL backends is almost entirely constrained by the max SQL connections. Nearly all blocking requests are spending their time waiting for a connection to free up from the pool. CPU and RAM usage are negligible.
All authentication is routed through meta.sr.ht for token revocation checks, which uses Redis as the source of truth. This may become a bottleneck in the future.
The future Python backend design is going to be pretty thin and mostly constrained by (1) simultaneous connections and (2) GraphQL throughput.
We'll know more about how to address this after we decide if we're keeping Python around in the first place.
Our internet link is fairly cheap bargain shit. This is easy to fix but going to be expensive. Defer until we need it, pricing adjustment for the beta should take this into consideration.
Storage utilization is fine, and easily tuned if necessary. The larger problem is that borg triggers lots of CPU consumption on the hosts which are being backed up. Managable now but a good candidate for future research.
We're already designed with load balancing in mind. Balancing HTTP requests across any number of web servers ought to be trivial. However, horizontal scaling of web appliances is an expensive optimization, and for the most part this is being considered with a low number of nodes (i.e. 3) for the purposes of availability moreso than scaling. We should look into other scaling options before reaching for web load balancing.
Storage is not really an issue, and load avg is consistently <1 even during usage spikes. The main constraint is RAM; right now we're on 64GiB and using about half of it.
We can tackle availability and load balancing in the same fell swoop. When we need to scale up more, we should provision two additional PostgreSQL servers to serve as read-only hot standbys. We can use pgbouncer to direct writable transactions to the master and load balance read-only transactions between all of the nodes. If we need to scale writes up, we can take the read-only load entirely off of the master server and spin up a third standby. The GraphQL backends are already transaction-oriented and use a read-only transaction when appropriate, so this would be fairly easy.
If we need to scale writes horizontally, sharding should be in the cards. I don't expect us to need that for a long time.
Note: right now we have one hot standby but it serves as a failover and off-site backup, and is not typically load-bearing. Latency issues to the backup datacenter would likely make bringing it into normal service a non-starter.
RepoSpanner may help with git storage distribution and availability. A bespoke solution would probably also be pretty straightforward.
Mercurial has really bad performance. The load of hg.sr.ht per-user is about 10x of the per-user git.sr.ht load, but it sees about 1/10th the usage so it balances out more or less. I would like to see some upstream improvements from the Mercurial team to make hosting hg.sr.ht less expensive.
Generating clonebundles is a unique concern which requires lots of CPU usage periodically.
Storage utilization is growing at a managable pace, about 0.1%-0.2%/week.
Watch this chart