• Handling multicolumn and aggregated data using Drupal 7 fields

    The following technique and many of the tools were developed/improved by ReviewDriven.

    Problems

    The field API in Drupal 7 combined with Views is very powerful combination. Yet there are certain data structures that are difficult or inefficient to work with using the two tools. One such structure is tables of data with multiple columns which can be stored using the fields API, but have a number of issues that arise.

    • Field data is loaded every time the entity is loaded which can be catastrophic with large datasets.
      • For example, when editing a node with a large dataset attached each value has a corresponding field element generated which eats up memory and processing power quickly and can easily cause white screens.
      • When viewing a node even if the fields are hidden the data is loaded and again is a tax on memory.
    • Relationship between columns is not exported to Views, since it has no way to know, which limits the way values may be displayed.
    • Aggregating columns in Views cannot be done across entities.
    • Views loads field data through the field API which is inefficient for large datasets, but ensures that display formatters and such are respected.

    Solutions

    Thankfully, there are a number of tools that can be utilized to solve each of these problems and in combination provide a very powerful way to handle tables of data and aggregation across entities.

    Field suppress

    From the Field suppress project page.

    Suppress field data from being loaded during entity_load(). Since field data will not be loaded it will not be displayed nor editable through the interface. This can be handy if you are using an alternate means to display or edit data and/or if you have a large amount of data in fields which will cause the node edit (or similar) interface to use a huge amount of memory and take a very long time to build.

    Field suppress solves the first problem, but data will then need to be edited directly through code and manually displayed (ex. Views).

    Field group views

    The next problem requires an interface/API for defining relationships between fields (or columns). Relationships can be defined using Field group views which is a plugin for Field group which provides both a UI and exportables for defining groups of fields. A display group can be defined and the plugin set to Views which will then export the proper relationships to Views and generate a stub view which can then be customized. The view will automatically replace the fields when displaying the entity. A requirement of Field group views is Views field which actually solves the last two problems.

    Views field

    Views field exposes the field tables and revision field tables to Views as base tables. Using the field base tables means that field data can be loaded directly instead of going through the field API. Loading data directly is much more efficient (especially for large datasets) and allows for aggregation across multiple entities, but losses the formatting capabilities of the fields API (could possibly add support for formatting to Views field). Formatting can also be added to the exposed field base table through hook_views_data_alter() in the following manor.

    /**
     * Implements hook_views_data_alter().
     */
    function conduit_views_data_alter(&$data) {
      // May not always be '_value' depending on field type.
      $data['field_data_MY_FIELD']['MY_FIELD_value']['field']['handler'] = 'views_handler_HANDLER';
    }
    

    Please note, there are two bugs that cause annoyances due to changes in the Views API, but do no prevent Views field from working. Feel free to submit patches.

    Example

    I have utilized these tools in combination on a number of projects with quite satisfying results, but I will attempt to provide a few generic examples to provide a clearer picture of how these tools can be used.

    Tallying summary results

    If you have a collection of entities and you want to be able to group the overall field data and perform SQL operations like COUNT() or SUM() Views field makes it easy. The core poll module could be rewritten using this technique so we will use poll as an example (could be done many ways).

    Say you have a node type for “Foo” poll entries that looks something like the following.

    poll_foo_entry

    • poll_foo_value: customizable field capable of storing the value for a poll, in this case lets go with a text field

    Results can be calculated using a view with the base table poll_foo_value and aggregation enabled. The poll_foo_value column can be used as the group by column and additional columns can be added for determining the COUNT(). You could even then display the results using a chart plugin for views.

    I used this technique to create http://survey.reviewdriven.com/results. The tallied results are display on the left.

    Table of data

    Another powerful usecase is storing and displaying a table of data. Lets use a simple example of storing temperature data over time on nodes (possibly a node per region). A possible node structure is as follows.

    temperature_history

    • title: region or some such
    • date: mulivalue date field
    • temperature: mulivalue temperature field

    The date and temperature fields can then be related using Field group views by placing them in the same group and displayed using a view. The fields are related based on each fields delta. In other words the data is stored using the field API in the following manor.

    date[0] = 2011-09-01
    date[1] = 2011-09-02
    date[2] = 2011-09-03
    date[3] = 2011-09-04
    date[4] = 2011-09-06
    
    temperature[0] = 70
    temperature[1] = 69
    temperature[2] = 68
    temperature[3] = 67
    temperature[4] = 66
    

    The number in brackets being the delta which allows the tables to be joined intelligently and produce a table like the following.

    Date Temperature
    2011-09-01 70
    2011-09-02 69
    2011-09-03 68
    2011-09-04 67
    2011-09-06 66

    Improvements

    There are still some areas for improvement, but one only has so much time.

    Per-Bundle Storage

    Storing related field data in a single table allows for the removal of overlapping columns and removes the need to join multiple tables. A field group could then be placed in a separate bundle or otherwise exposed to PBS and stored together in a single table. The database scheme would then be much more manageable and easier to query manually.

    Please keep in mind that PBS is not currently functional.

    Editing

    Being able to edit the dataset using a view would also be a major plus. There a number of modules that provide functionality of this sort, but not in such a flexible manor. Something like Editview would complete Field group views functionality. Large datasets could then be paginated for editing to prevent overload.

    Exciting usage

    Given that Views is plugable virtually an data structure could be displayed using this technique. The advantage over writing a one off field is that the structure is then easy to extend and modify, and requires little to no code to create. Simply export the fields, and field group definitions.

    Fields such as the Name field could be turned into a collection of fields displayed using a view and editable (assuming that gets implemented) using a similar structure.

    The possibilities opened up by this technique are quite exciting and I look forward to seeing what people come up with.

  • Managing upstream integrations and releases

    The AWS SDK for PHP Drupal integration project, that I maintain, provides releases that correspond to each release made by Amazon. Having releases that correspond to an upstream source is something that you see in Linux packaging systems and offers some interesting bonuses. In Linux distribution model individuals do not monitor upstream for changes, and packages are rebuilt and increment version regardless of changes to the packaging scripts themselves. Given that Drupal distributions mimic Linux distributions in many ways and even individual site builds I think the reasons behind making corresponding releases are interesting and worth consideration.

    To clarify what releases corresponding to upstream releases means consider the following from the AWS SDK for PHP project page.

    Since Drupal.org does not allow for three-part version numbers this project follows the AWS SDK for PHP 1.x development line and the release version mapping is as follows. The mapping shows the Drupal release to the Amazon release. Please ensure you are using a Drupal module and Amazon SDK pair that match the version number mapping. For example, use the 4.1 Drupal module with the 1.4.1 Amazon SDK.
    • 4.x -> 1.4.x
    • 3.x -> 1.3.x
    • 2.x -> 1.2.x

    Making a release whenever upstream makes a release means that some releases may end up with little or no changes to the actual Drupal module. The AWS SDK module provides a Drush Make file that is updated each release to point to the matching upstream release, but that is the only change for some of the releases.

    Making arbitrary releases may sound foolish at first, but consider some of the advantages.

    • Drupal site maintainers and distributions developers can receive update notifications from the standard Drupal core update system instead of having to monitor upstream for releases. Making it easier to watch for updates encourages keeping up-to-date with upstream changes which has implied benefits.
    • Any incompatibilities with a specific version of a third-party system or upstream library do not need to be kept track of since the integration project is always used with a specific upstream release.
    • New features or configuration options from an upstream source may be added immediately to the integration module without worry about incompatibility with older releases of the upstream source. Meaning if a configuration variable changes name or a new option is added it can be exposed through a Drupal UI without worry if people are using the appropriate upstream version (since they are always intended to work with a matching release).
    • When building a site or distribution using Drush Make specific versions of upstream libraries can be easily included using the corresponding Drupal project with a proper make file. If one wants AWS SDK 2.6 simply add `projects[awssdk] = 2.6` instead of having to include the Drupal project and override any third party generic make script.
    • Handling of updates, when necessary, is much simplier since you always know the library version that was used with the previous release.

    The general pattern that seems to emmerge is that upstream sources integrate better into Drupal workflow, tools, and infrastructure when matching releases are made.

    It does not seem that any other (or few if they exist) third-party integrations or library integrations on drupal.org follow this workflow. Making corresponding releases may not be appropriate in all situations, but I think it warrants consideration and I am interested to hear thoughts on the subject.

  • Comments on blog were broken

    The reCAPTCHA decided to stop working on my blog. Robert Douglass was kind enough to notify me of the problem. I fixed the problem, but suspect I may have missed some good feedback on my most recent post, Reflections on Drupal Quality Assurance. If you had intended to comment, please revisit the post.

  • Reflections on Drupal Quality Assurance

    We recently launched ReviewDriven, a distributed quality assurance platform, which is the culmination of months of work and knowledge gained working with Drupal and its community over the past 3+ years. I look forward to feedback from the community regarding ReviewDriven, and being able to fund further development of this service and, at the same time, improving the Drupal quality assurance ecosystem.

    Since ReviewDriven is a major event in my life and Drupal career it caused me to reflect on how I got to this point. From my humble beginnings in the Google Highly Open Participation Contest (GHOP), where my testing roots were planted, I received encouragement and mentoring which inspired me to continue working with open source. Previously, I had been interested in contributing to open source, but had never found a place to plug in. Since my start with Drupal I have contributed in a variety of ways to openSUSE, the Linux kernel, and KDE.

    After my initial introduction to Drupal, I was received with enthusiasm and the community helped me get to Drupalcon Boston 2008. I took part in the GHOP and SimpleTest presentations. During the coding sprint after the conference, I was approached by Kieran Lal who offered to help me get to a testing sprint in Paris. Again the Drupal community along with some help from Google made it possible for me to attend. Not only was it my first time out of the United States, but I got to spend a few days working closely with some of Drupal’s best. During the sprint Drupal took a major step towards realizing automated testing with the introduction of SimpleTest (or rather a fork) into Drupal core.

    In the months after the sprint I pushed hard to maintain, add to, and improve the tests in core. At the time patches were committed without much thought given to the tests so keeping the tests passing was a full time job. After discussions with Kieran I ended up taking over the testing.drupal.org (now qa.drupal.org) effort. After a radical redesign and plenty of work we managed to deploy testing.drupal.org and enable integration with the issue queue once we finally got all the tests passing. With the integration also came the adoption of the current “tests always pass” ideology and requirement to include tests with patches which has revolutionized Drupal development. The system even caught some interesting drupal.org bugs.

    Again thanks to community support I was able to attend Drupalcon DC and give a presentation with Kieran on the testing saga. The conference was a lot of fun in general and gave me a chance to meet all the people who I had been working with fairly regularly. Later that year I was accepted to Google Summer of Code (GSoC) for the second time and I worked part-time for Acquia as an intern over the same summer. After an exciting summer, with help from the community, I attended Drupalcon Paris 2009 where I gave another presentation on SimpleTest and the automated testing system with Kieran. After a productive Drupalcon we deployed the second version of the automated testing system and continued to improve the system.

    Before the end of the year I was hired full-time by Examiner to lead their quality assurance effort. The opportunity provided me with first-hand experience on how quality assurance can fit into an enterprise Drupal development workflow. Additionally, I was able to spend a portion of my time improving the automated testing system so that we could enable partial testing of contributed projects. Examiner required a slightly different approach to testing which formed the basis for SimpleTest 7.x-2.x. Examiner sponsored the development team to attend Druaplcon San Francisco 2010 during which I gave a talk at the Core developer summit on Quality Assurance in Drupal 8, a SimpleTest presentation, and a productive BoF on testing.

    In addition to the specifics mentioned I have been blessed to work with and learn from many skilled Drupal developers, and to contribute to Drupal core and contrib which has further refined my skills. My Drupal career has been a great learning experience in addition to being fun and exciting. I look forward to continued involvement with and support from the Drupal community!

  • Drupal 8 thoughts: configuration management and improved installation

    I have been doing a lot of work related to easing the process of building a site from scratch on an individual machine and thus dabbling with configuration management and related topics. Since configuration management is one of the Drupal 8 key initiatives I figured I would share some thoughts I had on the outer fringes of the topic with more to come in the future.

    Installation

    One area of Drupal that has always seemed a bit odd to me has been the installation process/system. The process can play a key part in configuration management across machines and made it impossibly to implement a basic environment system during installation without hacking core. To remedy this issue I believe the installation system can be much improved, simplified, and made much more consistent with the rest of Drupal which will in tern make it easy to implement my environment system in contrib or core.

    The installation system in Drupal 7 was re-factored/rewritten quite a bit and has thus been quite improved, but I think the direction of the installation system could be changed to make it much better. Currently the installer attempts to fake systems in core, like the cache, and actually duplicates a lot of code found elsewhere for module management and what not. Why not simply package a minimal database dump, similar to how the update tests used a Drupal 6 dump written in DBTNG, that can be installed to create an extremely minimal Drupal installation. At that point the installer can act like any other module and provide forms to complete the process. All modules in addition to the required modules can be installed through the standard process invoked from the modules page, but done in an automated fashion through the installer.

    The reasons why this approach is beneficial are: 1) install.inc and install.core.inc could be virtually removed, 2) profiles would no longer be “hackish” during the install phase (solve the current issues related to dependency resolution and what not being the same for modules and profiles), 3) hooks like hook_system_info_alter() would work properly for profiles and modules during early installation phases, and 4) remove race conditions in general caused by maintain two sets of the same code.

    Environments

    In addition, my environment.module would work without hacking core. This concept is nothing new in the development world, but is something I feel would be a great candidate for Drupal 8.

    /**
     * Implements hook_system_info_alter().
     */
    function environment_system_info_alter(&$info, $file, $type) {
      if (!empty($info['dependencies'])) {
        $environment = environment_get();
        if (!empty($info['dependencies'][$environment])) {
          $info['dependencies'] = array_merge($info['dependencies'], $info['dependencies'][$environment]);
        }
        foreach ($info['dependencies'] as $key => $dependency) {
          if (!is_numeric($key)) {
            unset($info['dependencies'][$key]);
          }
        }
      }
    }
    
    /**
     * Get the current environment.
     *
     * @return
     *   The current environment: production, staging, or development.
     */
    function environment_get() {
      return variable_get('environment', 'production');
    }
    

    The above code allows for the environment to be configured in the settings.php file for a site.

    $conf['environment'] = 'development';
    

    During installation a profile (or module) can perform different tasks or conditional code based on the environment. For example, generated users can have a simple password on a development machine and complex ones on production or staging machines.

    function my_profile_install() {
      if (environment_get() == 'development') {
        // Do cool stuff that only devs get to see.
      }
    }
    

    Another cool feature that I have a use-case for on a site I am working on and seems to generally be useful is to enable modules based on the environment. Instead of having to do that in hook_install() or related it makes sense to have a way to specify that in a .info file. The above code allows for the following.

    name = My profile
    description = ....
    version = 0.1
    core = 7.x
    
    dependencies[] = block
    dependencies[] = dblog
    
    dependencies[development][] = views_ui
    dependencies[development][] = fields_ui
    dependencies[development][] = devel
    
    dependencies[production][] = integration_with_third_party
    dependencies[staging][] = integration_with_third_party
    

    Having this type of functionality in core would hopefully encourage better development practices and seems like a great feature to have. I have a number of scripts in combination with drush make, and the above environment utility that allow me to build out a fully functional site on a new box with a single command. I plan to cleanup the scripts, document them, and provide them in a followup post. As always I would love to hear your thoughts on this subject.