Handling multicolumn and aggregated data using Drupal 7 fields

The following technique and many of the tools were developed/improved by ReviewDriven.

Problems

The field API in Drupal 7 combined with Views is very powerful combination. Yet there are certain data structures that are difficult or inefficient to work with using the two tools. One such structure is tables of data with multiple columns which can be stored using the fields API, but have a number of issues that arise.

  • Field data is loaded every time the entity is loaded which can be catastrophic with large datasets.
    • For example, when editing a node with a large dataset attached each value has a corresponding field element generated which eats up memory and processing power quickly and can easily cause white screens.
    • When viewing a node even if the fields are hidden the data is loaded and again is a tax on memory.
  • Relationship between columns is not exported to Views, since it has no way to know, which limits the way values may be displayed.
  • Aggregating columns in Views cannot be done across entities.
  • Views loads field data through the field API which is inefficient for large datasets, but ensures that display formatters and such are respected.

Solutions

Thankfully, there are a number of tools that can be utilized to solve each of these problems and in combination provide a very powerful way to handle tables of data and aggregation across entities.

Field suppress

From the Field suppress project page.

Suppress field data from being loaded during entity_load(). Since field data will not be loaded it will not be displayed nor editable through the interface. This can be handy if you are using an alternate means to display or edit data and/or if you have a large amount of data in fields which will cause the node edit (or similar) interface to use a huge amount of memory and take a very long time to build.

Field suppress solves the first problem, but data will then need to be edited directly through code and manually displayed (ex. Views).

Field group views

The next problem requires an interface/API for defining relationships between fields (or columns). Relationships can be defined using Field group views which is a plugin for Field group which provides both a UI and exportables for defining groups of fields. A display group can be defined and the plugin set to Views which will then export the proper relationships to Views and generate a stub view which can then be customized. The view will automatically replace the fields when displaying the entity. A requirement of Field group views is Views field which actually solves the last two problems.

Views field

Views field exposes the field tables and revision field tables to Views as base tables. Using the field base tables means that field data can be loaded directly instead of going through the field API. Loading data directly is much more efficient (especially for large datasets) and allows for aggregation across multiple entities, but losses the formatting capabilities of the fields API (could possibly add support for formatting to Views field). Formatting can also be added to the exposed field base table through hook_views_data_alter() in the following manor.

<?php
/**
 * Implements hook_views_data_alter().
 */
function conduit_views_data_alter(&$data) {
  
// May not always be '_value' depending on field type.
  
$data['field_data_MY_FIELD']['MY_FIELD_value']['field']['handler'] = 'views_handler_HANDLER';
}
?>

Please note, there are two bugs that cause annoyances due to changes in the Views API, but do no prevent Views field from working. Feel free to submit patches.

Example

I have utilized these tools in combination on a number of projects with quite satisfying results, but I will attempt to provide a few generic examples to provide a clearer picture of how these tools can be used.

Tallying summary results

If you have a collection of entities and you want to be able to group the overall field data and perform SQL operations like COUNT() or SUM() Views field makes it easy. The core poll module could be rewritten using this technique so we will use poll as an example (could be done many ways).

Say you have a node type for "Foo" poll entries that looks something like the following.

poll_foo_entry

  • poll_foo_value: customizable field capable of storing the value for a poll, in this case lets go with a text field

Results can be calculated using a view with the base table poll_foo_value and aggregation enabled. The poll_foo_value column can be used as the group by column and additional columns can be added for determining the COUNT(). You could even then display the results using a chart plugin for views.

I used this technique to create http://survey.reviewdriven.com/results. The tallied results are display on the left.

Table of data

Another powerful usecase is storing and displaying a table of data. Lets use a simple example of storing temperature data over time on nodes (possibly a node per region). A possible node structure is as follows.

temperature_history

  • title: region or some such
  • date: mulivalue date field
  • temperature: mulivalue temperature field

The date and temperature fields can then be related using Field group views by placing them in the same group and displayed using a view. The fields are related based on each fields delta. In other words the data is stored using the field API in the following manor.

date[0] = 2011-09-01
date[1] = 2011-09-02
date[2] = 2011-09-03
date[3] = 2011-09-04
date[4] = 2011-09-06
 
temperature[0] = 70
temperature[1] = 69
temperature[2] = 68
temperature[3] = 67
temperature[4] = 66

The number in brackets being the delta which allows the tables to be joined intelligently and produce a table like the following.

Date Temperature
2011-09-01 70
2011-09-02 69
2011-09-03 68
2011-09-04 67
2011-09-06 66

Improvements

There are still some areas for improvement, but one only has so much time.

Per-Bundle Storage

Storing related field data in a single table allows for the removal of overlapping columns and removes the need to join multiple tables. A field group could then be placed in a separate bundle or otherwise exposed to PBS and stored together in a single table. The database scheme would then be much more manageable and easier to query manually.

Please keep in mind that PBS is not currently functional.

Editing

Being able to edit the dataset using a view would also be a major plus. There a number of modules that provide functionality of this sort, but not in such a flexible manor. Something like Editview would complete Field group views functionality. Large datasets could then be paginated for editing to prevent overload.

Exciting usage

Given that Views is plugable virtually an data structure could be displayed using this technique. The advantage over writing a one off field is that the structure is then easy to extend and modify, and requires little to no code to create. Simply export the fields, and field group definitions.

Fields such as the Name field could be turned into a collection of fields displayed using a view and editable (assuming that gets implemented) using a similar structure.

The possibilities opened up by this technique are quite exciting and I look forward to seeing what people come up with.

Comments

Great overview!
Now that Fields has covered 'every field as a table' it's time to regain some of the virtues of relational databases and normalization of data, and re-introduce the D6-capabilities: Every non-translatable, single-value field should belong to another field/entity. Per bundle storage is a great initiative to add a UI for this.

You should perhaps write on the per bundle storage part that this module is currently 100% unusable.

That is a good point, I'll edit the original post to include it.

Nice article.

Not sure what to say about the memory increase in D7. It is so easy and relatively cheap today to get a larger VM for your site. The memory increase is unfortunate but the remedy is not exactly hard.

Views loads field data through the field API which is inefficient for large datasets, but ensures that display formatters and such are respected.

I guess you are wanting a return to D6 where Views actually performed a JOIN to get all the fields in the View. The D7 way is superior IMO since Views just has to get the IDs and then load each entity via node_load_multiple(), user_load_multiple(), etc. Those new functions are optimized for loading field data for all the IDs in one fell swoop. The improvement is that the original query has far fewer joins.

I'll also say that ad hoc querying of field data is easier in D6 versus D7 since the schema does not change on your when someone changes from single value to multiple or unshared field to shared field.

The memory requirements well exceed 256MB in my case which I have my development box set to. One of my particular datasets can be up to thousands of rows with ~5 columns. Obviously, this is probably not the most common usecase. The inefficiency comes in that every time I load the node all that data is queried and stored in arrays regardless of whether I am using it (field suppress solves that, but in the process kills Views ability to display data). Views seems to do a much more efficient job with memory when it directly queries and displays all the data. My guess would be that it loops over the query data instead of core loading it all into memory first. Editing requires even more memory as it generates field items for all thousand+ rows at once (pagination is a must).

So I wasn't trying to say that Views query methods in D7 are inefficient, but instead that the end result of the whole process (ie. it sits in memory, probably even multiple copies during processing).

Thanks for pointing that out though as it is a very useful distinction.

What is considered a 'large' dataset?

I currently manage a complex drupal app that has about a dozen nodereference related content types with some having as many as a couple dozen fields (I was careful to set this up with d6 such that all the fields except the multi-value ones are stored with the node to avoid lots of joins).

Needless to say, I'm now afraid of the d6 -> d7 upgrade I'm facing down the road.

If the dataset is small enough to be loaded into memory in arrays all at once you are fine. It will end up using more memory then looping over the query result directly, but you should be fine. Some of my datasets had thousands of rows with ~5 columns or so which complete destroys memory.

Not sure where the line is drawn (I would guess you are fine), but as Moshe Weitzman pointed out you can fairly easily get a more powerful server if necessary.