Archive for March, 2010

MySQL Indexing Speed

My friend Jay and I have been discussing the design of his web site, slamgreetings.com. During some of the discussions on the database design I found out that he’s using character strings as the index when relating one table to another. I’ve always been told that character strings don’t index nearly as fast as integer indexes. But I don’t know what that difference is. So I thought I’d run some tests and see what the differences are. Here are the results…

First, the database. You need to know what I’m looking at, so here’s what the table looks like:

CREATE TABLE indextest (
  id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
  addr VARCHAR(64) NOT NULL,
  intindex BIGINT(20) UNSIGNED NOT NULL,
  chrindex CHAR(32) NOT NULL,

  PRIMARY KEY (id),
  UNIQUE addr (addr),
  UNIQUE intindex (intindex),
  UNIQUE chrindex (chrindex)
);

I then created about 200,000 records, filling in the addr column with random data that looks a little bit like an email address — 10 to 20 random letters and numbers, followed by an ‘@’ character, followed by 10 to 15 random letters and numbers and then a ‘.com’. The chrindex column is the MD5() value of the addr column and the intindex is set to the value of the id column — but more on that later.

Now that the contents of the database are set up, it’s time to run a few queries. I ran two queries on the table, using each of the different indexes. The queries look like this:

select sum(t1.id+t2.id+length(t2.addr)) FROM indextest t1 LEFT JOIN indextest t2 ON t1.intindex=t2.intindex;

select sum(t1.id+t2.id+length(t2.addr)) FROM indextest t1 LEFT JOIN indextest t2 ON t1.chrindex=t2.chrindex;

The idea with the queries is to force traversing the entire table and using the indexes to relate back to the same record, thus fully exercising the indexes so that they can be measured. And with 200,000 records, there is enough data in the tables to grind the hard drive for a while.

Execution times of the queries are as follows:

Engine = MyISAM Engine = InnoDB
int char int char
3.36 41.19 3.45 39.16
3.39 40.13 3.44 40.13
3.44 40.56 3.39 38.72
3.42 39.97 3.50 39.47
3.69 39.09 3.42 38.08
3.46 40.19 3.44 39.11

The times shown indicate the amount of time each of the queries take. I ran each query five times and then changed the table’s engine type and ran them again. I wanted to see if there were any improvements between the MyISAM and InnoDB table types. The numbers indicate a very slight improvement when using the Engine=InnoDB specification on the table.

So it’s not surprising to see that the integer based indexes are much faster than their character based counterparts. What did surprise me is that the difference in speed is almost a factor of ten, especially considering the sizes of the indexes are not that much different — 20 bytes for the integer based index and 32 bytes for the character based index.

The bigger surprise is when I changed the values in the integer index. (I said there’d be more on the integer indexes later.) I performed an update on the table, as follows:

update indextest set intindex=conv(substr(md5(addr),16,16), 16, 10);

This sets the value of the integer index to be the low-order 16 bytes of the character index, which was the MD5() value of the addr column. Fortunately, when I did this, I didn't run into any collisions since I specified the indexes are unique.

Running the same queries as before, but with different values in the intindex column, produced very different results. The query took 6 min 48.95 sec to execute. Why?

The first explanation I came come up with is the size of the data being indexed. With the first set of queries, only the low order 4 bytes of the intindex column were used, so the database didn't have to do a full 20 byte compare for each and every test. The second test, with the larger numbers used in the intindex column, forced the database to do very long compares. Then again, all the index values are integers, so the tests should be comparing the entire integer, not just the low order 4 bytes.

I think a better explanation is the ordering of the data via the indexes. In the first set of tests, the order of the records in the table was roughly equivalent to the ordering of the records via the index. So traversing the records was pretty much just a matter of reading straight through the data file. The second set of tests performed the searches in a very different index order, probably forcing a very disrupted read through the data file -- as well as traversing the indexes. Having to skip around within the file 200,000 times could very well explain the difference in speed. I suppose I could recreate the table using the intindex ordering and run the tests again. But I got my main answer: integer indexes are much faster.

I ran these tests on a WinXP machine, so the OS is running at 32 bits. There may be some differences using a 64 bit OS and a 64 bit install of MySQL.

performance, SQL

No Comments