@@ 0,0 1,152 @@
+---
+title: "Wikimedia needs to re-think MediaWiki staging environments"
+date: 2022-11-18
+tags: [wikimedia]
+---
+
+Wikimedia's [Beta Cluster] (aka `deployment-prep`) needs to be replaced
+with something competely different.
+
+[Beta Cluster]: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep
+
+The [Beta Cluster Wikitech page] describes the projects' ambitions like
+this:
+
+[Beta Cluster Wikitech page]: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep
+
+> The **Beta Cluster** aims to provide a staging area that closely
+> resembles the Wikimedia production environment. It runs MediaWiki
+> and extensions from their master branch, allowing developers and
+> power users to test new code before it goes live on Wikimedia
+> websites.
+
+This was written in [early 2013], nearly a decade ago. Back then, the
+Wikimedia technical community and the WMF were much smaller. The Beta
+Cluster was one of the first projects on Wikimedia Labs (which is these
+days known as Wikimedia Cloud Services).
+
+[early 2013]: https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Deployment-prep/Documentation&diff=64898&oldid=63311&diffmode=source
+
+The Beta Cluster has from the very beginning attempted to re-use the
+same [Puppet] code used in production, with the intention being that
+[Beta could be used by community members to test changes]. This hasn't
+always been easy as a large part of the code was not designed to run
+outside production; there even was an [attempt in 2015] to build a
+"stabler Beta Cluster" with an explicit goal of having Puppet do all of
+the provisioning.
+
+[Puppet]: https://wikitech.wikimedia.org/wiki/Puppet
+[beta could be used by community members to test changes]: https://diff.wikimedia.org/2011/09/19/ever-wondered-how-the-wikimedia-servers-are-configured/
+[attempt in 2015]: https://phabricator.wikimedia.org/T88702
+
+To summarize: The original intention of the Beta Cluster was to allow
+testing changes to _both_ MediaWiki and the underlying infrastructure.
+
+## The infrastructure is developed elsewhere
+
+As far as I can tell, the Beta Cluster was never maintained by the same
+people taking care of the equivalent production infrastructure. The
+people maintaining production infrastructure (originally called
+TechOps, these days known as the [SRE team]) have different needs than
+what the MediaWiki developer and testers do.
+
+[SRE team]: https://wikitech.wikimedia.org/wiki/SRE
+
+The nature of the Beta Cluster made it very inflexible for the
+infrastructure people: for example it was hard to test multiple changes
+for the same component at the same time, and you needed to be very
+careful to not break the cluster entirely because that would be
+disruptive to the MediaWiki developers.
+
+Over time, the SRE team developed other systems for testing the
+infrastructure. Today the main way used to test infrastructure changes
+in a production-like environment is [Pontoon]. Pontoon's primary aim is
+to simplify starting disposable 'stacks' that are largely independent
+of each other and are much closer to the actual production environment
+than what standard Cloud VPS are. Cloud VPS itself has also [moved from
+its original use case] of being a development environment for services
+that were either currently living or planned to live in production.
+
+[Pontoon]: https://wikitech.wikimedia.org/wiki/Puppet/Pontoon
+[moved from its original use case]: https://phabricator.wikimedia.org/T285539
+
+A staging cluster that's trying to emulate production as closely as
+possible should be maintained by the same people maintaining
+production. Otherwise it's going to be impossible to keep up with
+all the changes and code that for whatever reason can't be easily used
+the environment it was originally written for.
+
+Rather unsuprisingly this kind of environment hasn't been very stable.
+Even worse, since the people responding to outages are usually not
+familiar with the system, most fixes end up being hacks that are
+decreasing the long-term reliability of the entire platform.
+
+## There is a demand for a better MediaWiki testing environment
+
+Beta Cluster outages get noticed very quickly, which suggests that
+people rely on the Beta Cluster working at least somewhat. However, not
+everyone needs it for the same reason. Common reasons seem to include:
+
+* Demoing new features to other people
+* Testing features that need restricted rights in production
+* Testing changes in a more 'production-like environment', for example
+ on
+ * a wiki running the same software versions as the production cluster
+ does
+ * a multi-wiki setup (a wiki farm) using [CentralAuth]
+ * a wiki running a similar configuration compared to production
+ * a wiki that uses [Swift] for media storage
+ * a wiki with a working VisualEditor
+ * a wiki that's integrated with [Wikidata]
+ * a wiki with a statsd stack
+ * a wiki using [CirrusSearch]
+ * a wiki with a proper job queue setup
+ * ... and the list goes on. You get the point.
+
+[CentralAuth]: https://www.mediawiki.org/wiki/Extension:CentralAuth
+[Swift]: https://wikitech.wikimedia.org/wiki/Swift
+[Wikidata]: https://www.wikidata.org/wiki/Wikidata:Main_Page
+[CirrusSearch]: https://www.mediawiki.org/wiki/Extension:CirrusSearch
+
+[Some features are hard to configure.] Others need specialized
+dependencies. Either way, considering the Beta Cluster is ([at least
+currently]) only for code already merged to the master branch, we
+should instead focus on making it easier to run those features locally
+or make the relevant interfaces safer so we can be more confident in
+for example something storing files on the local disk also working
+properly with a Swift backend.
+
+[Some features are hard to configure.]: https://bash.toolforge.org/quip/AWZYhzBIfM03vZ1oSYM9
+[at least currently]: https://phabricator.wikimedia.org/T278666
+
+## Going forward
+
+The current model for Beta Cluster maintenance has been unsustainable
+for years, and it shows. The current model doesn't work unless it's
+maintained by the SRE team directly, which is not optimal for the SRE
+team. Therefore I think it's reasonable to make the conclusion that
+we need to replace the current Beta Cluster with a different solution
+(well, solutions) that are more sustainable to maintain and solve the
+same problems more efficiently.
+
+What might that solution look like? Honestly, I'm not completely sure.
+What I do know is that **we need to drop the requirement to be as close
+as possible to production**, and instead need to focus on what we need
+to work on MediaWiki as efficiently as possible.
+
+There are a couple of promising projects I'd like to showcase:
+* [mwcli] is a command-line tool which supports managing other services
+ running in Docker.
+* [Patch demo] can be used to spin up a MediaWiki instance running a
+ particular patch from Gerrit.
+
+[mwcli]: https://gitlab.wikimedia.org/repos/releng/cli
+[Patch demo]: https://patchdemo.wmflabs.org/
+
+## Acknowledgements
+
+Thanks to [Tyler Cipriani] for providing me access to the 2018 Beta
+Cluster Survey, which provided helpful insights on how people use the
+Beta Cluster.
+
+[Tyler Cipriani]: https://www.mediawiki.org/wiki/User:TCipriani_(WMF)