Patrick Galbraith (capttofu) wrote,
  • Mood: cheerful
  • Music: The Vault

My long-promised Sphinx post (part 1)

I gave a webinar several weeks ago about Grazr's infrastructure, lessons we learned about scaling and over-building. One thing that I noticed a lot of interest in was how we used Sphinx to not only improve our search, but releive the database of FULLTEXT indices, which were a performance issue for us.

The purpose of these posts are to give you an idea of how Sphinx works, any limitations it has, and how you can use for both search functionality, as well as freeing your database from having to use FULLTEXT. I think Sphinx is a great project, and want to share info that can help promote its use.

There is a lot of information on Sphinx to share, so I'm going to make this a multiple post, in parts. I'm already working on a book, and Sphinx information could be a small book of its own!

Grazr overview, from FULLTEXT to Sphinx

Grazr is a company who derives it's searchable content from feeds, storing the items of those feeds in a schema that is essentially a parent feeds table with a child table of feed items "feeds", 1:N. The items tables are 1:1, divided between meta data, "items" and feed content, text/blob, "items_text". There is also an additional child table of items 1:N, "enclosures" for any enclosures in an item, which could be more than one. is a site that requires being able to create an aggregate feed of feed items from multiple feeds of the users choice inown as a "merge" or "stream", as well as being able to apply search terms to the query that builds this aggregate feed.

To accomplish this, we utilized FULLTEXT on the items_text table on the item content column, as well we moved two other columns from the main items table to items_text that we want to be able search on, as well as sorting on a date column.

This worked ok at first, but as data grew in size, the item_text table became extremely slow, which we assumed was just the table itself. We tried numerous optimizations that we thought would speed up inserts and updates. Some of them were:

* A smaller subset of items/items_text for the most common feeds

* Changing single inserts to bulk inserts. This required not relying on auto_increment, pre-assigning primary key values to items being inserted using a sequence table.

* Getting rid of updates, instead deleting the items that had changed and needed to be updated to deleting all of the items at once and then reinserting. The effect as that we did everything in bulk, either delete or insert.

* Making sure we used every possible index and using the most optimal query possible.

* Trying every trick in the book to get the item date sort to not use a file sort. This was impossible because the query used a join. We tried everything to get around this to no avail.

In April, we attended the MySQL User's conference, were we heard more about Sphinx from Andrew Aksyonoff. It sounded intriguing in that it integrated with MySQL and was very fast at searching and sorting. I also liked the idea that it had a storage engine.

After a month or two back from the conference, and more optimization tweaks that gave us some improvement, we embarked on another project that was very search-specific that we realized we needed a better search engine for. We were looking at Lucene and were going to set up Solr, while one day I decided to see how to set up Sphinx. It was much less work to get working than having to set up Tomcat et al. Plus, we're a Perl shop, and aren't Java experts. Some initial testing with
Sphinx, using the perl module Sphinx::Search was very encouraging.

In the past three months, our new project, Vibemetrix, which is a great feed search engine, as well as Grazr, is now using Sphinx.

In the past three months, our new project, Vibemetrix, which is essentially a blog search engine, as well as Grazr, is now using Sphinx.

For, switching from FULLTEXT to using Sphinx gave us the biggest performance improvement yet! It allowed us to:

* Switch to InnoDB
* Use Sphinx for sorting, getting rid of the ORDER BY

The merge query I used to perform which used the top level feeds table's primary key for the join I could now simply search Sphinx based on this key, return the items table's primary key and just query on that, without an order by. This made the query that performed the "merge" 10x faster. Also, the query that used to use FULLTEXT could be removed. Also, replication lags became far less frequent. Prior to removing FULLTEXT, a 300 second lag would take 15+ minutes to catch up. Afterward, a 300 second replication lag would take a minute to catch up!

Sphinx simply provides the primary keys for whatever items we needed, which we in turn obtain from MySQL, making for a fast, primary key query to get the data for Search results.

Some things about Sphinx:

* Very fast
* SQL-like syntax, similar in ways to the FULLTEXT syntax
* The fulltext indx is not updatable, but attributes are. Fear not - there is a wonderful mechanism for making it essentially "updatable", more details to follow.
* Supports distributed/clustering of indices served through "distributed" indices, across multiple servers
* Very fast index building
* Integrates nicely with MySQL
* Very helpful team that maintains Sphinx, even at crazy hours in the night!

There are more features that I'm probably leaving out here, but these are the ones that I'm knowlegeable about. I still have much to learn about Sphinx.

Sphinx Concepts

Sphinx fulltext index is not updatable, but it's attributes are. You can for instance update any attribute (think of them as the identifier keys for what you index) to a new value. Also, Sphinx supports a "distributed" index. A distributed index is is just a bunch of outgoing links that the master (aggregator) searchd will follow, one way refs from distributed index to whatever physical index on other nodes, acting as a single index while under the hood can be an aggregate of multiple indices.

This image gives a simple picture of a distributed index that covers four separate indexes, two each on two separate servers.

distrubuted index

This gives you the ability to shard your indexes the way you shard your data in the database. You can have multiple parts that are served by one virtual index.

For having update functionality, you can have a main, large index that isn't re-built as often, and a smaller delta index that you rebuild regularly since Sphinx is very fast at building indices. At certain intervals, you can merge this index back into the main, and build the delta from a new starting point, always keeping this delta a small size.

To be able to have this all work properly, you would have a "counter" database table.

This "counter" table keeps track of the maximum id, or any type of "line" you draw that marks the start and stop of the two indices. If using an id
, the maximum value of the primary key of the table of the main index was built from, at the time it was last built is the "end" point of the main index, and the beginning point of the delta index. When the delta index is built, its data query selects all the data greater than that maximum id, which ensures that the delta index is small. Also, when the indices are merged, that's when the main index is reset to the value of the maximum id at the time of the merge, hence the delta index now starts from a greater point.

It's also helpful, though not needed, to keep track of the maximum id of your delta index, which too is the maximum id of the data source when the delta index was last built. This gives you the ability to see how current your indices are, and have some "inventory" of what records in the database the index covers in terms of range.

Sphinx components

Sphinx comprises several components:

* searchd -- a server that processes queries either locally or remotely

* indexer -- the program that builds indices

* search -- a simple program that searches against the index directly, does not work with distributed indexes since searchd is what allows them to work.

* API -- sphinx has an API that allows for various methods to search Sphinx

* Sphinx Storage Engine -- a MySQL storage engine that allows tables to be created that are essentially virtual table clients to searchd, allow Sphinx query results to be processed as table row result sets -- think join of one "real" table to a sphinx table, not having to implement "glue" application code, requiring a separate db query with Sphinx results. Another benefit of the Sphinx storage engine is in the case you have an application that is written in a language that can't talk to Sphinx, but can talk to MySQL will be able to use Sphinx.


Sphinx has a configuration file that controls how sphinx works, sphinx.conf. It is organized into several main components, which have an inheritance model. These components are:

* data source definitions -- this is a data source, either database or xml,
that Sphinx uses to obtain data in order to build indices
* index definitions -- defines index values such as the data source and
path of an index
* indexer -- defines settings for the indexer
* searchd -- defines settings for the searchd daemon

Date Source Definitions

A source definition is the source Sphinx uses to build it's indices. It can be a database or XML pipe.

The following is the best way to explain the source definition part of sphinx.conf, along with comments to explain each part.

source idx_items
  # data source type
  # known types are 'mysql', 'pgsql', 'xmlpipe', 'xmlpipe2'
  type      = mysql

  # these are db connection parameters
  sql_host  = localhost
  sql_user  = username
  sql_pass  = password
  sql_db    = feeddb
  sql_port  = 3306

  # range queries are to reduce the size of the result set
  # being returned as you are indexing. This can be particularly
  # helpful with MyISAM to avoid long locks
  # this particular query sets obtains the $start and $stop values
  # that are used in the main data query determined by the next parameter,
  # sql_range_step
  sql_query_range  = SELECT MIN(id),MAX(id) FROM itemtab

  # the size of the result set. So,if the complete number of items is 1M,
  # then there will be 10 data obtaining queries until all items have
  # been indexed
  sql_range_step                = 100000

  # a query to run prior to the main data obtaining query to set the
  # counter table position. In this case, the position in the pointer
  # table is set to the max id of the items table, for the index named
  # host:items_idx. This is a way to have the counter table provide
  # accounting not only for the local index, but for indices on
  # other servers running Sphinx.
  sql_query_pre = UPDATE sph_counter SET max_id= (SELECT MAX(id) FROM items) WHERE index_name = 'host:items_idx'

  # main data query. These are the columns that are indices. The first
  # column must be the primary key or unique key
  sql_query = SELECT id, item_date, title, content, author, concat('feed_id:',feed_id) as feed_id, item_date FROM v_items WHERE id >= $start AND id <= $end

  # another attribute that sphinx will use to index by, required for
  # grouping.
  sql_attr_timestamp      = item_date

  # other types
  # sql_attr_float - float, double
  # sql_attr_str2ordinal - string in from source, 32-bit uint number out
  # sql_attr_multi - list of 32-bit uints
  # sql_attr_bigint - in version 0.9.9

  # This is the query for debugging, using the 'search' utility, which it
  # uses this query to obtain data from the database
  sql_query_info = SELECT id, concat('feed_id:',feed_id) as feed_id, link,
  title, content, author FROM v_items WHERE id=$id;

  # there are other options that aren't listed here for brevity.

# "items_delta" inherits all settings, unless specified/overridden, from
# it's parent, "items".
# So, the format is "child : parent" obviously
source items_delta : items
  # only two parameters are overridden here. First, sql_query_pre, to
  # update the counter table for the delta index entry, and the range
  # query, which obtains the range for any id greater than the max item
  # id of the main index
    sql_query_pre = UPDATE sph_counter SET max_id= (SELECT MAX(id) FROM
    items) WHERE index_name = 'items_delta_idx:host'
      sql_query_range  = SELECT MIN(id),MAX(id) FROM v_items WHERE id > (SELECT max_id FROM sph_counter WHERE index_name = 'items_idx:host')

Index definitions

The index definitions define indices Sphinx's indexer builds. These too have an inheritance model.

index items_idx
  source       = items   # the "items" source defined about

  # path of the index file
  path         = /usr/local/sphinx/var/data/items_idx

  # yes, you are not limited to 3, like FULLTEXT!
  min_word_len    = 1

  # we used utf8
  charset_type    = utf-8

  # strip out HTML as the index is built
  html_strip      = 1

# this child index, we only override two parameters
index items_stemmed_idx : index
  path        = /usr/local/sphinx/var/data/items_stemmed_idx
  # this creates an index with stemming
  morphology  = stem_en


# now for the delta index
index items_delta_idx : index
  # we override the data source to get the proper range of data
  source  = items_delta
  path    = /usr/local/sphinx/var/data/items_delta_idx

This is a distrubuted index, allowing you to specify local and
remote agents

index items_dist : items_idx
  # required
  type      = distributed

  # All indices specified as "local", will search serially, but in
  # parallel with agents.
  # "local" means +1 index to search serially within this single master
  # process. The next line below, commented out:
  # local   = items_idx,items_delta_idx
  # is almost identical to
  agent   = localhost:3312:items_idx, items_delta_idx
  # except that the option with "agent=localhost..." would fork an
  # additional child

  # "agent" means +1 remote agent to contact wherever it is
  # and gives better performance for searches in parallel,
  # in this example, the index list is grouped together because
  # items_delta_idx is a small index compared to items_idx, and there
  # wouldn't be much benefit to having it's own agent
  agent   = hostB:3312:items_idx,items_delta_idx

  # NOTES from Andrew:
  # The master process sends out all queries, the does all the local
  # stuff if any, then waits for all the remove replies.
  # Hence, 1 fork + serial local search VS 2 forks + just wait"


This is not an index setup used at Grazr, but is shown here to give an example of how you can have multiple index parts. Each of these indexes would have different data sources which in turn could in turn return different parts of the total data, in whatever way, ranges, different tables (though requiring unique IDs accross all data sources)

index items_multiparts : items_idx
  # Different than the delta index setup, if you have multiple large
  # indexes, you would want to list them on
  # separate lines.
  agent   = hostC:3312:items_part1
  agent   = hostC:3312:items_part2
  agent   = hostC:3312:items_part3
  agent   = hostD:3312:items_part4
  agent   = hostD:3312:items_part5
  agent   = hostE:3312:items_part6

The other two sections for the indexer and search are pretty self-explanatory, and aren't listed here. The sample sphinx.conf that comes with the distribution is a good starting point. The point here was to show mainly the data source definitions and the index definitions, as well as a distributed index.

Counter/Position table

The counter table is the other bit of information worth explaining. It basically provides a way to set the maximum primary key value of your data per index_name, which I ended up using "host:indexname" as the index name for this table. You could simply use counter_id, but I prefer having a canonical name to refer to.

CREATE TABLE sph_counter (
  counter_id int(11) NOT NULL,
  max_id bigint(20) unsigned NOT NULL,
  index_name varchar(32) NOT NULL default '',
  last_updated timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  PRIMARY KEY  (`counter_id`),
 KEY `index_name` (`index_name`)

The data for a working setup would look like:

mysql> select max_id, index_name, last_updated from sph_counter;
| max_id | index_name               | last_updated        |
|   135414507 | items_idx:hostA          | 2008-09-25 03:00:01 |
|   135565199 | items_delta_idx:hostB    | 2008-09-25 13:05:01 |
|   135301579 | items_idx:hostB          | 2008-09-24 23:07:35 |
|   135563903 | items_delta_idx:hostB    | 2008-09-25 13:02:25 |

This too is simplified by all indices that are kept track of, but you'll see that the main index has a lower id, which is the id of the main index last time it was merged. For hostA, that was at 3AM. This means that items_delta_idx has indexed id 135414507 through 135565199.

For hostB, the last merge was at 23:07 the previous day. The items_delta_idx on hostB contains id 135301579 through 135563903.

A perl script is what we use to ensure the counter table is updated when the merge occurs.

This part 1 of my multi-post on Sphinx should have at least given a good overview of what Sphinx is, as well as the configuration file, sphinx.conf to further drill home how Sphinx works. The next parts will cover how to install Sphinx, how to use Sphinx with the Sakila test database, an overview of Sphinx query syntax, and then some code examples for searching Sphinx.

I hope this information is useful to you ;)
Tags: fulltext indexes, mysql, search functionality, sphinx
  • Post a new comment


    Anonymous comments are disabled in this journal

    default userpic