134328b0ed45995ec33928d0f6738bcdd071fe0a — Drew DeVault 12 days ago 6355d0a
Add robust PSQL planning document
2 files changed, 144 insertions(+), 0 deletions(-)

M ops/index.md
A ops/robust-psql.md
M ops/index.md => ops/index.md +1 -0
@@ 17,6 17,7 @@ Additional resources:
- [Monitoring & alarms](/ops/monitoring.md)
- [Network topology](/ops/topology.md)
- [Provisioning & allocation](/ops/provisioning.md)
- [PostgreSQL robustness planning](/ops/robust-psql.md)

# Operational Resources

A ops/robust-psql.md => ops/robust-psql.md +143 -0
@@ 0,0 1,143 @@
PostgreSQL is a critical part of the SourceHut infrastructure in terms of
availability, performance, and data integrity.

# Streaming replication


1. Real-time off-site backup of the database
1. Ability to access the database as it appeared in the past
1. Ability to failover to the secondary in the event of an outage on the
   primary, with minimal application downtime
1. Ability to failover to the secondary during planned maintenance of the
   primary, with minimal application downtime


✓ PostgreSQL 12.x on primary

✓ PostgreSQL 12.x on secondary


✗ Alpine 3.12 on primary

✓ Alpine 3.12 on secondary

If we have to restart the primary PostgreSQL daemon for any reason, we should
take the opportunity to upgrade the host to Alpine 3.12.

## Preparing the hot standby

First we'll have to prepare a [warm standby][warm standby] (which can be
promoted to primary at short notice, but does not accept queries), then we can
reconfigure it as a [hot standby][hot standby] (which will accept queries).

1. Set up a user on the primary for the secondary to connect as for streaming
   replication. It must have the REPLICATION and LOGIN privileges and
   [pg_hba.conf][hba changes] must be updated.
1. Take a [base backup][base backup] of the primary and ship it to the
   secondary. Be advised: we have full_page_writes disabled as a ZFS
   optimization; this may cause performance degredation during the basebackup
1. Fill out `primary_conninfo` on the secondary with the connection details for
   the account created for replication.
1. Configure a replication slot on the primary for the secondary:
        SELECT * FROM pg_create_physical_replication_slot('konpaku');

    Then set `primary_slot_name` to `konpaku` on the secondary.
1. Set `hot_standby` to true on the secondary.
1. [Restore the base backup][restore base backup] on the secondary. It will
   automatically sync with the primary when complete; monitor the logs. TODO:
   What should our `restore_command` be? Want to figure it out in advance.
1. Sanity check the secondary database.
1. Roll back the `restore_command` change so that we aren't surprised by it
   later. Drop the base backup files as well.

[warm standby]: https://www.postgresql.org/docs/current/warm-standby.html
[hot standby]: https://www.postgresql.org/docs/current/hot-standby.html
[base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-BASE-BACKUP
[restore base backup]: https://www.postgresql.org/docs/current/continuous-archiving.html#BACKUP-PITR-RECOVERY
[hba changes]: https://www.postgresql.org/docs/current/warm-standby.html#STREAMING-REPLICATION-AUTHENTICATION

# Failover procedure (primary & secondary role swap)


# Database upgrade procedure


## Minor version upgrades


## Major version upgrades


# Accessing past data


# Changes to make on the next primary reboot

- Set `wal_level` to replica; we are not going to use logical replication. TODO:
  Will this create WAL files which are incompatible with one another?
- Drop the `archive_command` and set `archive_mode` to off; we're going to use
  streaming replication instead of log shipping.

# Open questions

- Would pgbouncer/pgpool be valuable here? It could be used to quickly redirect
  traffic from some or all nodes to one server or another. If so, we should
  probably put it on a dedicated host so that it's robust to host-level failures
  in the primary or secondary servers.
- What is the plan of execution for replacing the server hardware?
- How often are WALs recycled, how far into the past can we peek?
- Can we/should we detect failure conditions and automate the failover
- What kinds of monitoring and alarms can we/should we set up?
- How can we maintain an ongoing high level of confidence in our system?

It may be possible to *always* service SELECT statements from either the primary
or secondary. If we eventually require load balancing, these queries could be
eligible. However, queries (or transactions?) which mutate the database may only
be executed on the primary. Therefore:

- If we require more capacity for writes, is vertical scaling our only option?
- In the event of a primary outage, are we able to transparently direct reads
  to the secondary?
- Can application-level changes be implemented which would allow for reduced
  service availability (read-only) in the event of a primary outage?
- If applications cannot write to the database, but can write to other data
  stores (filesystem, redis, etc), can inconsistent states be produced?
- Can we simulate typical and excessive levels of application traffic against
  the database for mock scenarios?

# Mock contingency scenarios

We need to set up a test network with a primary and secondary and induce failure
scenarios to learn how our applications behave under those conditions and
practice the failure resolution steps.

This will require at least the following:

- A couple of sr.ht applications (meta.sr.ht and git.sr.ht would be good)
- A primary and secondary PostgreSQL server

At least three separate VMs are required here, four if we set up a standalone
pgbouncer or pgpool. Then we should play out the following scenarios:

- Promoting the secondary to the primary
- Shut down the primary or secondary without pre-preparation
- Accessing the database state at an arbitrary point in the past
- Major and minor version upgrades (pg 11 to 12)
- Host OS upgrades (Alpine 3.11 to 3.12 or edge)
- Cause a ZFS resliver on the primary and monitor database performance: does
  this trigger a "soft outage" where performance is awful but the database is
  ostensibly online?

At each stage of these processes, observe application behavior and record these