~sircmpwn/sr.ht-docs

598c64ac1df4d3c8a02768419a544096db1865b4 — Drew DeVault 14 days ago 72fe32c
Updates to PSQL planning documentt
1 files changed, 129 insertions(+), 32 deletions(-)

M ops/robust-psql.md
M ops/robust-psql.md => ops/robust-psql.md +129 -32
@@ 7,22 7,20 @@ availability, performance, and data integrity.

1. Real-time off-site backup of the database
1. Ability to access the database as it appeared in the past
1. Ability to failover to the secondary in the event of an outage on the
1. Ability to failover to the standby in the event of an outage on the
   primary, with minimal application downtime
1. Ability to failover to the secondary during planned maintenance of the
1. Ability to failover to the standby during planned maintenance of the
   primary, with minimal application downtime

**Prerequisites**

✓ PostgreSQL 12.x on primary

✓ PostgreSQL 12.x on secondary
- ✓ PostgreSQL 12.x on primary
- ✓ PostgreSQL 12.x on standby

**Nice-to-have**

✗ Alpine 3.12 on primary

✓ Alpine 3.12 on secondary
- ✗ Alpine 3.12 on primary
- ✓ Alpine 3.12 on standby

If we have to restart the primary PostgreSQL daemon for any reason, we should
take the opportunity to upgrade the host to Alpine 3.12.


@@ 33,28 31,28 @@ First we'll have to prepare a [warm standby][warm standby] (which can be
promoted to primary at short notice, but does not accept queries), then we can
reconfigure it as a [hot standby][hot standby] (which will accept queries).

1. Set up a user on the primary for the secondary to connect as for streaming
1. Set up a user on the primary for the standby to connect as for streaming
   replication. It must have the REPLICATION and LOGIN privileges and
   [pg_hba.conf][hba changes] must be updated.
1. Take a [base backup][base backup] of the primary and ship it to the
   secondary. Be advised: we have `full_page_writes` disabled as a ZFS
   standby. Be advised: we have `full_page_writes` disabled as a ZFS
   optimization; this may cause performance degredation during the basebackup
   process. `pg_basebackup -PzFt -D basebackup`. Place 'basedirectory' in
   `/var/lib/postgresql`.
1. Nuke the postgres data directory's contents on the secondary, su to postgres,
1. Nuke the postgres data directory's contents on the standby, su to postgres,
   then run these commands: `tar -C ~/12/data -xf ~/basebackup/base.tar.gz`,
   `tar -C ~/12/data/pg_wal -xf ~/basebackup/pg_wal.tar.gz`,
   `touch ~/12/data/standby.signal`
1. Fill out `primary_conninfo` on the secondary with the connection details for
1. Fill out `primary_conninfo` on the standby with the connection details for
   the account created for replication. `host=... port=5432 user=...  password=...`
1. Configure a replication slot on the primary for the secondary:
1. Configure a replication slot on the primary for the standby:
    
        SELECT * FROM pg_create_physical_replication_slot('konpaku');

    Then set `primary_slot_name` to `konpaku` on the secondary.
1. Set `hot_standby` to "on" on the secondary.
    Then set `primary_slot_name` to `konpaku` on the standby.
1. Set `hot_standby` to "on" on the standby.
1. Start the database, monitor logs for restore completion.
1. Sanity check the secondary database.
1. Sanity check the standby database.

Note: some of this procedure is adapted from the
[restore the base backup][restore base backup] docs, but modified cause those


@@ 66,26 64,117 @@ docs are shit.
[restore base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-PITR-RECOVERY
[hba changes]: https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-AUTHENTICATION

# Failover procedure (primary & secondary role swap)
✓ Process tested

TBD
# Deploying pgbouncer

pgbouncer will allow us to quickly change which database server the fleet is
using at any given moment. It does not handle load balancing directly, we would
need to use a secondary approach for this if desirable (DNS round robin and
haproxy are suggested by the docs).

A configuration similar to the following will be deployed fleet-wide:

```
[pgbouncer]
listen_addr = 127.0.0.1
listen_port = 16432
pool_mode = transaction

auth_type = trust
auth_file = /etc/pgbouncer/userlist.txt

max_client_conn = 100
default_pool_size = 20

[databases]
xxx.sr.ht=host=remilia.sr.ht user=xxx.sr.ht password=...
```

config.ini's will point at localhost through pgbouncer rather than directly at
the database servers. The initd scripts will need to be updated to `use
pgbouncer` so that services are started up correctly.

✓ Process tested

# Application-level updates

For web applications, we should abort requests which would write to the database
and instead show a page explaining the circumstances. The user can attempt a
refresh later to re-attempt the same action.

For the SSH pipelines:

- Repository auto-creation attempts should abort and tell the user
- Repository mtime updates should be silently dropped
- If unknown SSH keys can be fetched from meta.sr.ht, use them, but don't store them
- Webhook delivery attempts should be dropped and a warning printed

For builds.sr.ht, if any database update is met with a read-only error,
retry with an exponential backoff and block until it succeeds. If it exceeds a
deadline, then abort the build.

For webhook processing celery workers, it should be acceptable for it to simply
fail. Webhooks are only generated when the database is updated, so if the
database can't be updated, ¯\\\_(ツ)\_/¯!

There may be some room for future improvements in all of these areas, but this
is sufficient for the MVP.

✓ core.sr.ht patch to (somewhat) gracefully handle a read-only database

✗ read-only support for celery jobs

✗ read-only support for git.sr.ht, hg.sr.ht SSH pipeline

✗ read-only support for ongoing builds.sr.ht jobs

# Failover procedure (read-only mode)

*Ideally, this is to be announced in advance, as it will cause a partial outage
across all services.*

First, prepare all services for failover by updating their pgbouncer
configurations to point to the standby. Do not reload pgbouncer until all
service configs have been updated, then use a script to update them all at once.

TODO: Write that script

✓ Process tested

# Standby promotion procedure (read/write mode)

1. Prerequisite: complete the read-only failover procedure
1. Shut off the master
1. Run `pg_ctl promote` on the standby
1. Convert the master to a standby

✗ TODO: Test this process

# Database upgrade procedure

TBD

✗ TODO: Test this process

## Minor version upgrades

TBD

✗ TODO: Test this process

## Major version upgrades

TBD

✗ TODO: Test this process

# Accessing past data

TBD

✗ TODO: Test this process

# Changes to make on the next primary reboot

- Set `wal_level` to replica; we are not going to use logical replication. TODO:


@@ 95,10 184,6 @@ TBD

# Open questions

- Would pgbouncer/pgpool be valuable here? It could be used to quickly redirect
  traffic from some or all nodes to one server or another. If so, we should
  probably put it on a dedicated host so that it's robust to host-level failures
  in the primary or secondary servers.
- What is the plan of execution for replacing the server hardware?
- How often are WALs recycled, how far into the past can we peek?
- Can we/should we detect failure conditions and automate the failover


@@ 107,13 192,13 @@ TBD
- How can we maintain an ongoing high level of confidence in our system?

It may be possible to *always* service SELECT statements from either the primary
or secondary. If we eventually require load balancing, these queries could be
or standby. If we eventually require load balancing, these queries could be
eligible. However, queries (or transactions?) which mutate the database may only
be executed on the primary. Therefore:

- If we require more capacity for writes, is vertical scaling our only option?
- In the event of a primary outage, are we able to transparently direct reads
  to the secondary?
  to the standby?
- Can application-level changes be implemented which would allow for reduced
  service availability (read-only) in the event of a primary outage?
- If applications cannot write to the database, but can write to other data


@@ 123,22 208,22 @@ be executed on the primary. Therefore:

# Mock contingency scenarios

We need to set up a test network with a primary and secondary and induce failure
We need to set up a test network with a primary and standby and induce failure
scenarios to learn how our applications behave under those conditions and
practice the failure resolution steps.
practice the failure resolution steps. Such a test network is already partially
provisioned on Drew's home network.

This will require at least the following:

- A couple of sr.ht applications (meta.sr.ht and git.sr.ht would be good)
- A primary and secondary PostgreSQL server
- A primary and standby PostgreSQL server

At least three separate VMs are required here, four if we set up a standalone
pgbouncer or pgpool. Then we should play out the following scenarios:
We should play out the following scenarios:

- Promoting the secondary to the primary
- Shut down the primary or secondary without pre-preparation
- Promoting the standby to the primary
- Shut down the primary or standby without pre-preparation
- Accessing the database state at an arbitrary point in the past
- Bringing the secondary back up to date after a primary outage
- Bringing the standby back up to date after a primary outage
- Major and minor version upgrades (pg 11 to 12)
- Host OS upgrades (Alpine 3.11 to 3.12 or edge)
- Cause a ZFS resliver on the primary and monitor database performance: does


@@ 147,3 232,15 @@ pgbouncer or pgpool. Then we should play out the following scenarios:

At each stage of these processes, observe application behavior and record these
observations.

# Verification process

The whole setup needs to be performed on a secondary deployment and each service
carefully tested under normal and failure mode conditions. Of particular
interest is:

- Do git.sr.ht's more sophisticated pg commands from the SSH pipeline work
  correctly?
- Does lists.sr.ht's prepared statements get along well with pgbouncer?
- Do all services work reasonably well in a read-only mode?
- Do all services work correctly with pgbouncer's transaction mode?