The woes of the testbot

For those not familiar with me, a little research should make it clear that I am the person behind the testbot deployed in 2008 that has revolutionized Drupal core development, stability, etc. and that has been running tens of thousands of assertions with each patch submitted against core and many contributed modules for 6 years.

My intimate involvement with the testbot came to a rather abrupt and unintended end several years ago due to a number of factors (which only a select few members of this community are clearly aware). After several potholes, detours, and bumps in the road, it became clear to me the impossibility of maintaining and enhancing the testbot under the policies and constraints imposed upon me.

Five years ago we finished writing an entirely new testing system, designed to overcome the technical obstacles of the current testbot and to introduce new features that would enable an enormous improvement in resource utilization that could then be used for new and more frequent QA.

Five years ago we submitted a proposal to the Drupal Association and key members of the community for taking the testbot to the next level, built atop the new testing system. This proposal was ignored by the Association and never evaluated by the community. The latter is quite puzzling to me given:

  • the importance of the testbot
  • the pride this open source community has in openly evaluating and debating literally everything (a healthy sentiment especially in the software development world)
  • I had already freely dedicated years of my life to the project.

The remainder of this read will:

  • list some of the items included in our proposal that were dismissed with prejudice five years ago, but since have been adopted and implemented
  • compare the technical merits of the new system (ReviewDriven) with the current testbot and a recent proposal regarding "modernizing" the testbot
  • provide an indication of where the community will be in five years if it does nothing or attempts to implement the recent proposal.

This read will not cover the rude and in some cases seemingly unethical behavior that led to the original proposal being overlooked. Nor will this cover the roller coaster of events that led up to the proposal. The intent is to focus on a technical comparison and to draw attention to the obvious disparity between the systems.

About Face

Things mentioned in our proposal that have subsequently been adopted include:

  • paying for development primarily benefiting drupal.org instead of clinging to the obvious falacy of "open source it and they will come"
  • paying for machine time (for workers) as EC2 is regularly utilized
  • utilizing proprietary SaaS solutions (Mollom on groups.drupal.org)
  • automatically spinning up more servers to handle load (e.g. during code sprints) which has been included in the "modernize" proposal

Comparison

The following is a rough, high-level comparison of the three systems that makes clear the superior choice. Obviously, this comparison does not cover everything.

Baseline Backwards modernization True step forward
System Current qa.drupal.org "Modernize" Proposal ReviewDriven
Status It's been running for over 6 years Does not exist Existed 5 years ago at ReviewDriven.com
Complexity Custom PHP code and Drupal Does not make use of contrib code Mish mash of languages and environments: ruby, python, bash, java, php, several custom config formats, etc.

Will butcher a variety of systems from their intended purpose and attempt to have them all communicate

Adds a number of extra levels of communication and points of failure
Minimal custom PHP code and Drupal

Uses commonly understood contrib code like Views
Maintainability Learning curve but all PHP Languages and tools not common to Drupal site building or maintenance

Vast array of systems to learn and the unique ways in which they are hacked
Less code to maintain and all familiar to Drupal contributors
Speed Known; gets slower as test suite grows due to serial execution Still serial execution and probably slower than current as each separate system will add additional communication delay An order of magnitude faster thanks to concurrent execution

Limited by the slowest test case

*See below
Extensibility (Plugins) Moderately easy, does not utilize contrib code so requires knowledge of current system Several components, one on each system used

New plugins will have to be able to pass data or tweak any of the layers involved which means writing a plugin may involve a variety of languages and systems and thus include a much wider breadth of required knowledge
Much easier as it heavily uses commons systems like Views

Plugin development is almost entirely common to Drupal development:
define storage: Fields
define display: Views
define execution: CTools function on worker

And all PHP
Security Runs as same user as web process Many more surfaces for attack and that require proper configuration Daemon to monitor and shutdown job process, lends itself to Docker style with added security
3rd party integration Basic RSS feeds and restricted XML-RPC client API Unknown Full Services module integration for public, versioned, read API and write for authorized clients
Stability When not disturbed, has run well for years, primary causes of instability include ill-advised changes to the code base

Temporary and environment reset problems easily solved by using Docker containers with current code base
Unknown but multiple systems imply more points of failure Same number of components as current system

Services versioning which allows components to be updated independently

Far less code as majority depends on very common and heavily used Drupal modules which are stable

2-part daemon (master can react to misbehaving jobs)

Docker image could be added with minimal effort as system (which predates Docker) is designed with same goals as Docker
Resource utilization Entire test suite runs on single box and cannot utilize multiple machines for single patch Multiple servers with unshared memory resources due to variety of language environments

Same serial execution of test cases per patch which does not optimally utilize resources
An order of magnitude better due to concurrent execution across multiple machines

Completely dynamic hardware; takes full advantage of available machines.

*See below
Human interaction Manually spin up boxes; reduce load by turning on additional machines Intended to include automatic EC2 spin up, but does not yet exist; more points of failure due to multiple systems Additional resources are automatically turned on and utilized
Test itself Tests could be run on development setup, but not within the production testbot Unknown Yes, due to change in worker design.

A testbot inside a testbot! Recursion!
API Does the trick, but custom XML-RPC methods Unknown Highly flexible input configuration is similar to other systems built later like travis-ci

All entity edits are done using Services module which follows best practices
3rd party code Able to test security.drupal.org patches on public instance Unknown, but not a stated goal Supports importing VCS credentials which allows testing of private code bases and thus supports the business aspect to provide as a service and to be self sustaining

Results and configuration permissioned per user to allow for drupal.org results to be public on the same instance as private results
Implemented plugins Simpletest, coder None exist Simpletest, coder, code coverage, patch conflict detection, reroll of patch, backport patch to previous branch
Interface Well known; designed to deal with display of several 100K distinct test results; lacks revision history; display uses combination of custom code and Views Unknown as being built from scratch and not begun

Jenkins can not support this interface (in Jenkins terminology multiple 100K jobs) so will have to be written from scratch (as proposal confirms and was reason for avoiding Jenkins in past)

Jenkins was designed for small instances within businesses or projects, not a large central interface like qa.drupal.org
Hierarchical results navigation from project, branch, issue, patch

Context around failed assertion (like diff -u)

Minimizes clutter, focuses on results of greatest interest (e.g. failed assertions); entirely built using Views so highly customizable

Simplified to help highlight pertinent information (even icons to quickly extract status)

Capable of displaying partial results as they are concurrently streamed in from the various workers

Speed and Resource Utilization

Arguably one of the most important advantages of the ReviewDriven system is concurrency. Interestingly, after seeing inside Google I can say this approach is far more similar to the system Google has in place than Jenkins or anything else.

Systems like Jenkins and especially travis-ci, which for the purpose of being generic and simpler, do not attempt to understand the workload being performed. For example Travis simply asks for commands to execute inside a VM and presents the output log as the result. Contrast that with the Drupal testbot which knows the tests being run and what they are being run against. Why is this useful? Concurrency.

Instead of running all the test cases for a single patch on one machine, the test cases for a patch may be split out into separate chunks. Each chunk is processed on a different machine and the results are returned to the system. Because the system understands the results it can reassemble the chunked results in a useful way. Instead of an endlessly growing wait time as more tests are added and instead of having nine machines sitting idle while one machine runs the entire test suite all ten can be used on every patch. The wait time effectively becomes the time required to run the slowest test case. Instead of waiting 45 minutes one would only wait perhaps 1 minute. The difference becomes more exaggerated over time as more tests are added.

In addition to the enormous improvement in turnaround time which enables the development workflow to process much faster you can now find new ways to use those machine resources. Like testing contrib projects against core commits, or compatibility tests between contrib modules, or retesting all patches on commit to related project, or checking what other patches a patch will break (to name a few). Can you even imagine? A Drupal sprint where the queue builds up an order of magnitude more slowly and runs through the queue 40x faster?

Now imagine having additional resources automatically started when the need arises. No need to imagine...it works (and did so 5 years ago). Dynamic spinning up of EC2 resources which could obviously be applied to other services that provide an API.

This single advantage and the world of possibility it makes available should be enough to justify the system, but there are plenty more items to consider which were all implemented and will not be present in the proposed initiative solution.

Five Years Later

Five years after the original proposal, Drupal is left with a testbot that has languished and received no feature development. Contrast that with Drupal having continued to lead the way in automated testing with a system that shares many of the successful facets of travis-ci (which was developed later) and is superior in other aspects.

As was evident five years ago the testbot cannot be supported in the way much of Drupal development is funded since the testbot is not a site building component placed in a production site. This fact drove the development of a business model that could support the testbot and has proven to be accurate since the current efforts continue to be plagued by under-resourcing. One could argue the situation is even more dire since Drupal got a "freebie" so to speak with me donating nearly full-time for a couple of years versus the two spare time contributors that exist now.

On top of lack of resources the current initiative, whose stated goal is to "modernize" the testbot, is needlessly recreating the entire system instead of just adding Docker to the existing system. None of the other components being used can be described as "modern" since most pre-date the current system. Overall, this appears to be nothing more than code churn.

Assuming the code churn is completed some time far in the future; a migration plan is created, developed, and performed; and everything goes swimmingly, Drupal will have exactly what it has now. Perhaps some of the plugins already built in the ReviewDriven system will be ported and provide a few small improvements, but nothing overarching or worth the decade it took to get there. In fact the system will needlessly require a much rarer skill set, far more interactions between disparate components, and complexity to be understood just to be maintained.

Contrast that with an existing system that can run the entire test suite against a patch across a multitude of machines, seamlessly stitch the results together, and post back the result in under a minute. Contrast that with having that system in place five years ago. Contrast that with the whole slew of improvements that could have also been completed in the four years hence by a passionate, full-time team. Contrast that with at the very least deploying that system today. Does this not bother anyone else?

Contrast that with Drupal being the envy of the open source world, having deployed a solution superior to travis-ci and years earlier.

Please post feedback on drupal.org issue.