289c99f654da91a03bead2083dec1666c76a6617 — Taavi Väänänen 10 months ago e71928a
add beta cluster blog post
1 files changed, 152 insertions(+), 0 deletions(-)

A content/posts/deployment-prep-needs-a-replacement.md
A content/posts/deployment-prep-needs-a-replacement.md => content/posts/deployment-prep-needs-a-replacement.md +152 -0
@@ 0,0 1,152 @@
title: "Wikimedia needs to re-think MediaWiki staging environments"
date: 2022-11-18
tags: [wikimedia]

Wikimedia's [Beta Cluster] (aka `deployment-prep`) needs to be replaced
with something competely different.

[Beta Cluster]: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep

The [Beta Cluster Wikitech page] describes the projects' ambitions like

[Beta Cluster Wikitech page]: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep

> The **Beta Cluster** aims to provide a staging area that closely
> resembles the Wikimedia production environment. It runs MediaWiki
> and extensions from their master branch, allowing developers and
> power users to test new code before it goes live on Wikimedia
> websites.

This was written in [early 2013], nearly a decade ago. Back then, the
Wikimedia technical community and the WMF were much smaller. The Beta
Cluster was one of the first projects on Wikimedia Labs (which is these
days known as Wikimedia Cloud Services).

[early 2013]: https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Deployment-prep/Documentation&diff=64898&oldid=63311&diffmode=source

The Beta Cluster has from the very beginning attempted to re-use the
same [Puppet] code used in production, with the intention being that
[Beta could be used by community members to test changes]. This hasn't
always been easy as a large part of the code was not designed to run
outside production; there even was an [attempt in 2015] to build a
"stabler Beta Cluster" with an explicit goal of having Puppet do all of
the provisioning.

[Puppet]: https://wikitech.wikimedia.org/wiki/Puppet
[beta could be used by community members to test changes]: https://diff.wikimedia.org/2011/09/19/ever-wondered-how-the-wikimedia-servers-are-configured/
[attempt in 2015]: https://phabricator.wikimedia.org/T88702

To summarize: The original intention of the Beta Cluster was to allow
testing changes to _both_ MediaWiki and the underlying infrastructure.

## The infrastructure is developed elsewhere

As far as I can tell, the Beta Cluster was never maintained by the same
people taking care of the equivalent production infrastructure. The
people maintaining production infrastructure (originally called
TechOps, these days known as the [SRE team]) have different needs than
what the MediaWiki developer and testers do.

[SRE team]: https://wikitech.wikimedia.org/wiki/SRE

The nature of the Beta Cluster made it very inflexible for the
infrastructure people: for example it was hard to test multiple changes
for the same component at the same time, and you needed to be very
careful to not break the cluster entirely because that would be
disruptive to the MediaWiki developers.

Over time, the SRE team developed other systems for testing the
infrastructure. Today the main way used to test infrastructure changes
in a production-like environment is [Pontoon]. Pontoon's primary aim is
to simplify starting disposable 'stacks' that are largely independent
of each other and are much closer to the actual production environment
than what standard Cloud VPS are. Cloud VPS itself has also [moved from
its original use case] of being a development environment for services
that were either currently living or planned to live in production.

[Pontoon]: https://wikitech.wikimedia.org/wiki/Puppet/Pontoon
[moved from its original use case]: https://phabricator.wikimedia.org/T285539

A staging cluster that's trying to emulate production as closely as
possible should be maintained by the same people maintaining
production. Otherwise it's going to be impossible to keep up with
all the changes and code that for whatever reason can't be easily used
the environment it was originally written for.

Rather unsuprisingly this kind of environment hasn't been very stable.
Even worse, since the people responding to outages are usually not
familiar with the system, most fixes end up being hacks that are
decreasing the long-term reliability of the entire platform.

## There is a demand for a better MediaWiki testing environment

Beta Cluster outages get noticed very quickly, which suggests that
people rely on the Beta Cluster working at least somewhat. However, not
everyone needs it for the same reason. Common reasons seem to include:

* Demoing new features to other people
* Testing features that need restricted rights in production
* Testing changes in a more 'production-like environment', for example
  * a wiki running the same software versions as the production cluster
  * a multi-wiki setup (a wiki farm) using [CentralAuth]
  * a wiki running a similar configuration compared to production
  * a wiki that uses [Swift] for media storage
  * a wiki with a working VisualEditor
  * a wiki that's integrated with [Wikidata]
  * a wiki with a statsd stack
  * a wiki using [CirrusSearch]
  * a wiki with a proper job queue setup
  * ... and the list goes on. You get the point.

[CentralAuth]: https://www.mediawiki.org/wiki/Extension:CentralAuth
[Swift]: https://wikitech.wikimedia.org/wiki/Swift
[Wikidata]: https://www.wikidata.org/wiki/Wikidata:Main_Page
[CirrusSearch]: https://www.mediawiki.org/wiki/Extension:CirrusSearch

[Some features are hard to configure.] Others need specialized
dependencies. Either way, considering the Beta Cluster is ([at least
currently]) only for code already merged to the master branch, we
should instead focus on making it easier to run those features locally
or make the relevant interfaces safer so we can be more confident in
for example something storing files on the local disk also working
properly with a Swift backend.

[Some features are hard to configure.]: https://bash.toolforge.org/quip/AWZYhzBIfM03vZ1oSYM9
[at least currently]: https://phabricator.wikimedia.org/T278666

## Going forward

The current model for Beta Cluster maintenance has been unsustainable
for years, and it shows. The current model doesn't work unless it's
maintained by the SRE team directly, which is not optimal for the SRE
team. Therefore I think it's reasonable to make the conclusion that
we need to replace the current Beta Cluster with a different solution
(well, solutions) that are more sustainable to maintain and solve the
same problems more efficiently.

What might that solution look like? Honestly, I'm not completely sure.
What I do know is that **we need to drop the requirement to be as close
as possible to production**, and instead need to focus on what we need
to work on MediaWiki as efficiently as possible.

There are a couple of promising projects I'd like to showcase:
* [mwcli] is a command-line tool which supports managing other services
  running in Docker.
* [Patch demo] can be used to spin up a MediaWiki instance running a
  particular patch from Gerrit.

[mwcli]: https://gitlab.wikimedia.org/repos/releng/cli
[Patch demo]: https://patchdemo.wmflabs.org/

## Acknowledgements

Thanks to [Tyler Cipriani] for providing me access to the 2018 Beta
Cluster Survey, which provided helpful insights on how people use the
Beta Cluster.

[Tyler Cipriani]: https://www.mediawiki.org/wiki/User:TCipriani_(WMF)