~sircmpwn/metrics.sr.ht

Alert on increase in unconfirmed registrations

There is always some fallout, but over the past two weeks the ratio was
never above 0.0001. It did go up to 0.0004 when there was an issue with
email delivery, so 0.0002 seems to be a decent value to trigger an
investigation.
node_rules: take all CPU modes into account

Currently, the "high CPU usage" alert only looks at time spent in user
mode. This can hide issues where a signicant amount of time is spent in
kernel mode ("system"), iowait, or similar.

To take all activity into account, invert the query to assert that the
CPU always spends at least 20% of capacity idling. Using 20% instead of
25% here to try to make this stay somewhat equivalent. Previously, 75%
user plus as assumed 5% system overhead was fine, so it should be again.
Loosen up backup rules

The git.sr.ht backups tend to take a pretty long time these days and we
get some false positives on this.

Might tune this figure back down a bit if/when we switch to bupstash.
chat: add alarm for synIRC
build_rules.yml: correct name of builds submitted metric
Add alerts for high worker utilization

Additionally, update the metric used for high number of builds timing
out and double the limit of high number of build submission since the
high worker utilization alarm should most of the cases that submission
alarm was meant to handle.
chat: add rules for /media/soju-logs

Sigh, really thought we had this already, but apparently not…
fix incorrect expression for "Instance rebooted"

node_boot_time_seconds is not in "seconds since boot", it is "unix time of
boot". Therefore, the unix current time minux the boot unix time is actually
seconds since boot.
.build.yml: upgrade to 3.17

metrics was bumped
Add postgres_rules.yml
Update libera chat alarm
chat: bump Rizon alert to 40

The hard limit is now 50. Set the alert to 40 so that we can contact
Rizon support in time whenm we're getting close.
backup_rules.yml: bump to 72 hours

borg is super slow and only getting slower as our dataset grows. The
long-term solution is to switch to bupstash, but for now this should
reduce the noise.
Add chat.sr.ht rules

Setup alerts monitoring the number of connections to some
well-known IRC networks.
Fix build queue length alert

Accidentally left in an old in-development metric name I used.
build.yml: upgrade to Alpine 3.15
Fix High number of 500 errors alert to work instance-wide

This was originally intentioned to be look at the instance-wide stats,
but I have accidentally copied the wrong query from my experiments.
Filter out low traffic routes from high number of errors alert

Set the cutoff to at least 1 request per minute over the past hour.
Currently around 40 routes reach this rate, which is about 10% of all
routes.
Remove builds short-circuit for patches

I can't get it right, and I'd rather have builds deploy than have
patches succeed builds
Bring back service alarms

This brings back an improved version of the high error count alarm that
was removed for being too noisy, which was mostly caused by the fact
that python services didn't report consistent metrics without prometheus
multiprocessing mode, which has now been implemented. An alert for
webhook queues is also added.
Next