Patrick Galbraith (capttofu) wrote,
  • Location: Sharon, NH
  • Mood: busy
  • Music: Nora Jones, BB King

Hot cache data, sharding

In the last several months at Grazr, we've been wrestling with a large database (running on MySQL) of feeds and feed items. The schema is essentially a feeds table with child tables items, items_text (text), and enclosures. We have this database to provide the means for users to be able to merge (a Stream) feeds so that you have an aggregate feed with items for whatever feeds you want in the list of feeds for your merge. It works great, the only problem being the volume of data, which more data means the query to produce that merge becomes slower. We want this merge to be able to be run on the fly, and if it's too slow, the user experience is unacceptable.

So, now I'm in the process of implementing a "Hot Cache" of feeds with an LRU (Least Recently Used) policy. The idea being, that this cache provides a smaller data set for performing the merge query against. We need to be able to handle storing much more data than we currently do while still being able to allow for fast merges.

The "hot cache" will basically work like this: If a merge request is made, it first checks a catalog table which indicates whether the merge's feeds are contained in the cache. If they aren't, an entry is made in this catalog table, triggering a UDF which calls a daemon which in turn calls a perl script to copy all the necessary data-- the items for the feeds that are part of the merge. Since the request for the merge must occur quickly as possible, the process cannot wait for the triggered script to complete, so the merge is performed against the full data set. The script that copies data to the cache, once completed, marks the entry in the catalog as being completed, and next time the merge request comes through, it uses the hot cache table instead.

The consistency of the cache is maintained by various triggers on other tables that ensure when the original data set is updated, deleted, added to, that those changes are also made in the cache.

Currently, there is only one hot cache table. To allow growth, there would also be the need for multiple cache tables. This is the part that is tricky. I would like to keep using triggers, but I'm still trying to figure out how to make it so I can use dynamic SQL in a trigger. I would want to have also a column in the catalog table that says what cache the data can be found in. If I'm to use a trigger to update the caches, I need to be able to have the table name be a variable in the trigger. The only way I've found to do this is by setting a table name variable, creating the trigger that does something like "set @query= concat('INSERT ... ', @tablename, ...) ; prepare stmt from @query; execute stmt" but unfortunately, trying to do this one gets:


ERROR 1336 (0A000): Dynamic SQL is not allowed in stored function or trigger


Argh.

So I can always fall back on using triggered UDF to call the perl script.

Why am I posting this? I'm just musing, and wondering what some other developers who read this have done to allow for being able to store a lot of data and still have a means to use that data. How to shard. I have several ideas, and each needs to be tested.
Tags: lots of data, mysql, schema design, sharding
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic
  • 7 comments