A myriad of features may prompt the need to aggregate your data, like showing an average score based on multiple values, or even simply showing the amount of entries that abide to a certain condition. Usually this is a trivial query, but this is often untrue when dealing with a huge dataset.

What’s the problem with a large dataset?

In moderately sized datasets, you could just construct a query using MySQL’s aggregate functions, like:

SELECT COUNT(*)
FROM products
WHERE
    category = 'accessories' AND
    color = 'green'

Obviously, the above query would return all entries that, for columns category and color, have the values ‘accessories’ and ‘green’. It would do so by looping all entries in this table, comparing the values for category and color to those respective values.

In a table with only a few hundred entries, this is easy. You can imagine that once the amount of entries in a table grows to millions, looping all entries is not a terribly bright idea, as it will take MySQL quite some time and effort to calculate that result. Once once the rate of requests outgrows the rate MySQL is able to respond to those requests, you’re done.

One solution could be to add accurate indexes. MySQL can then promptly return the requested amount without having to put much effort into it, since those indexes would save it from having to loop all entries. If the conditions grow complex, however, you may find it gets increasingly less sane, if at all possible, to attempt to fix the problem by applying an index to the conditional columns.

Rollup tables

Rollup tables are tables where totals for (combinations of) conditions are saved. A “summary table”, that holds pre-aggregated values for all conditions you may need to fetch totals for.

If your data table has millions of entries, you can imagine it makes more sense directing your queries against a very small rollup table, than at the huge primary dataset.

An example rollup table could look like this:

category	color	total
shirts	red	23578347
shirts	green	14364323
shirts	blue	46723343
accessories	red	3452465
accessories	green	867665
accessories	blue	7609852
pants	red	56878766
pants	green	87067876
pants	blue	759457363

Assuming you accurately keep these totals in sync with the atomic source data in products, you’d be much better of executing this query:

SELECT total
FROM products_rollup
WHERE
    category = 'accessories' AND
    color = 'green'

Keeping rollup data in sync

You’ll always want the rollup table’s totals to accurately reflect the data in the source data’s table, and keeping them in sync is definitely the biggest challenge. For every new insert, update or delete in the primary table, an update to several rows in the rollup table may be needed. To achieve this, you’ll need to add some additional logic to your application.

Full recalc

In it’s most basic form, you could recalculate all rollup values immediately after updating data in the source table. What you’d do is query the source table (in our example, products) and GROUP BY the rollup columns. MySQL will loop all source records and respond with up-to-date rollup values, which you can immediately write to your rollup table:

REPLACE INTO products_rollup
SELECT category, color, COUNT(*)
FROM products
GROUP BY category, color

This is the easiest approach to keep your data in sync where, no matter what you change to your source table’s data, all rollup data will accurately be updated.

The downside, however, is that you still perform this rather expensive query, that loops all entries in the source table. All reads will target the rollup table, but now every write will result in this hefty query. If you’re in a write-heavy environment, this too may eventually lead to trouble.

Per-entry update

Letting MySQL recalculate all rollup values is needlessly intensive. If we know exactly what changes, we can simply update only those specific rollup values.

Imagine that we add a new red accessory to our products table: we don’t need to recalculate all values, we can just increase the rollup value for accessories/red with 1. All other rollup values can remain unchanged.

Although it’s slightly more work to implement than the full recalc, this too can be done generic. For every single update to the source table, the relevant changes to the rollup table can be deduced. All we need to do is perform 1 read query to the source table before updating the data, and 1 after. Then we know the original and updated values, and know exactly which rollup values to update.

For starters, we would query the source table this entry’s columns relevant to the rollup table: category and color:

SELECT category, color
FROM products
WHERE id = 1

This could, for example, return:

category	color
shirts	green

After this, we update an entry, e.g.:

UPDATE products
SET
    category = 'shirts',
    color = 'blue'
WHERE id = 1

And now, again, we issue another read-query for this entry, to identify which to the rollup table relevant values have changed:

SELECT category, color
FROM products
WHERE id = 1

This would now return:

category	color
shirts	blue

Now we know exactly what rollups should change. The entry was originally shirts/green. It is not anymore, so we should deduct this one entry from that total for shirts/green. Since the entry is now shirts/blue, we should increase that rollup:

UPDATE products_rollup
SET value = value - 1
WHERE category = 'shirts' AND color = 'green'

UPDATE products_rollup
SET value = value + 1
WHERE category = 'shirts' AND color = 'blue'

You’ll probably have noticed that this approach results in a couple of additional queries, none of which is expensive to execute, though. The 2 read queries can efficiently use the (primary key) index on the source table’s id column, and both update queries are better than replacing all values in the rollup table.

Matthias Mullie

Using rollup tables to aggregate data in large datasets

What’s the problem with a large dataset?

Rollup tables

Keeping rollup data in sync

Full recalc

Per-entry update

Projects

Scrapbook: cache environment

Post to email

Minify

Top Posts

Measuring software complexity

Geographic searches within a certain distance

Regular expressions for pros