Patrick Galbraith ([info]capttofu) wrote,
@ 2008-06-02 13:01:00
Previous Entry  Add to memories!  Tell a Friend  Next Entry
Deleting mismatches - which is more optimal?
I have tables that should be 1:1 relations but occasionally, they become mismatched, one table having more records than the other. I have been wondering which is the most optimal way to delete things like this.

I have a delete using a join (item_id is primary key):

mysql> delete from items_keys using items_keys left join items using (item_id) where items.item_id is NULL;

vs. using a subquery:

mysql> delete from items_keys where item_id not in (select item_id from items);

Sometimes, the 2nd seems faster, but using a join seems more orthodox to me. I'm curious what people think about this.

If I make these into select queries (instead of delete) and run explain, I get:

mysql> explain select count(*) from items_keys left join items using (item_id) where items.item_id is NULL;+----+-------------+------------+--------+---------------+---------+---------+--------------------------+-------+--------------------------------------+
| id | select_type | table      | type   | possible_keys | key     | key_len | ref                      | rows  | Extra                                |
+----+-------------+------------+--------+---------------+---------+---------+--------------------------+-------+--------------------------------------+
|  1 | SIMPLE      | items_keys | index  | NULL          | PRIMARY | 8       | NULL                     | 73915 | Using index                          | 
|  1 | SIMPLE      | items      | eq_ref | PRIMARY       | PRIMARY | 8       | grazr.items_keys.item_id |     1 | Using where; Using index; Not exists | 
+----+-------------+------------+--------+---------------+---------+---------+--------------------------+-------+--------------------------------------+
2 rows in set (0.03 sec)

mysql> explain select count(*) from items_keys where item_id not in (select item_id from items);
+----+--------------------+------------+-----------------+---------------+---------+---------+------+-------+--------------------------+
| id | select_type        | table      | type            | possible_keys | key     | key_len | ref  | rows  | Extra                    |
+----+--------------------+------------+-----------------+---------------+---------+---------+------+-------+--------------------------+
|  1 | PRIMARY            | items_keys | index           | NULL          | PRIMARY | 8       | NULL | 73918 | Using where; Using index | 
|  2 | DEPENDENT SUBQUERY | items      | unique_subquery | PRIMARY       | PRIMARY | 8       | func |     1 | Using index              | 
+----+--------------------+------------+-----------------+---------------+---------+---------+------+-------+--------------------------+
2 rows in set (0.01 sec)

'rows' is about the same. Index usage appears similar. Though using a subquery seems like it could be more expensive. Maybe this is a simple question I'm asking.


(15 comments) - (Post a new comment)

Not Exists
[info]kjordan2001
2008-06-02 06:21 pm UTC (link)
You can also try:
explain select count(*) from items_keys where not exists (select items.item_id from items WHERE items.item_id=items_keys.item_id);

Although from the looks of the explain, that might end up similar to the left join anyways, although I've always found left joins to be slow on large tables. And I don't think explain will reveal any magic about which is faster since the big one that explain will show anyways is the primary table.

(Reply to this) (Thread)

Re: Not Exists
[info]capttofu
2008-06-03 12:50 am UTC (link)
Thanks for the tip! I always like new ways of doing things.

(Reply to this) (Parent)(Thread)

(Reply from suspended user)

(Reply from suspended user)
What does "Seems" mean?
[info]billkarwin.openid.org
2008-06-02 06:57 pm UTC (link)
Educated guesses have their place, but empirical evidence should have greater weight. You've already used EXPLAIN to see what optimization plan the two queries have, and you've shown that they are equivalent, given your data and metadata in this case.

(Reply to this) (Thread)

Re: What does "Seems" mean?
[info]capttofu
2008-06-03 12:56 am UTC (link)
Reality and empirical evidence don't always correspond. "Seems" is just exactly what I meant it to be - of all the times when I've run each of these in different environments, diff the 2nd one "seems" faster and executes faster. That in itself isn't evidence because of things like query cache, file system caching, etc. It's just like developing something, you test the hell out of it, get numbers that show you what you should use, then when you make it live on a web site it performs completely different than what your tests showed.

(Reply to this) (Parent)(Thread)

(Reply from suspended user)

(Reply from suspended user)
This this really a 1:1 relationship?
(Anonymous)
2008-06-02 08:28 pm UTC (link)
If this should be a 1:1 relationship why not combine the two tables into one? Then, if the two sets of data elements get out of sync, it is a simple table scan to delete the rows where one set of elements exists for an item but not the other.

My concern is you are effectively joining all the rows from one table against all the rows in the other table. This will work for small data volumes, but for larger data volumes it won't scale. If the data size remains small, perhaps performance isn't a problem, but there appears to be a reason you are concerned about performance.

(Reply to this)

What is causing the multile rows
(Anonymous)
2008-06-03 12:49 am UTC (link)
Hi Patrick,

"I have tables that should be 1:1 relations but occasionally, they become mismatched, one table having more records than the other."

mmm, your query removes those rows for which there is *no corresponding row*.y Your description is broader than that. Can yu clarify your meaning? TY.

I think you should try and look for what is causing the unexpexted rowa...that seems like the obvious approach...

"Sometimes, the 2nd seems faster, but using a join seems more orthodox to me. I'm curious what people think about this. "

Maybe you don't agree, but my answer is: if this were a SELECT, then yes, I agree...if it is a DELETE or even UPDATE, then...Hell no..

(this is just my syntactical tic of me - ignore please if you don't like it.)

kind regards,
roland

(Reply to this) (Thread)

Re: What is causing the multile rows
[info]capttofu
2008-06-03 01:46 am UTC (link)
Because I'm trying as much as possible to limit the main table to only the columns (fast meta-data lookup) that I need in a big query that I run against it. Also, I have another table that too is a 1:1 table that contains blobs that I also keep separate - something you definitely want to do.

I tend to agree with what you say about joins used in delete.

(Reply to this) (Parent)(Thread)

Re: What is causing the multile rows
[info]awfief
2008-06-03 03:00 am UTC (link)
The subquery is the more mathematical way of doing it -- and by mathematical I mean "pure logic" -- ie, in 1*(2+3) we want (2+3) evaluated first. But since that's not how subqueries work in MySQL, your hunch of the JOIN being better is spot on.

What worries me is you seem to be saying "every time I run this...." and I'm wondering why you need to run this more than once....after the first time, I'd put a UNIQUE constraint on something to ensure the 1:1 relationship....

(Reply to this) (Parent)(Thread)

Re: What is causing the multile rows
[info]capttofu
2008-06-03 04:10 pm UTC (link)
It is 1:1 - what I mean by not being 1:1 is that during initial insertion I insert into the lookup table first, then the blob table, multiple records with a bulk insert. Once in a great while, one of those queries fails, hence all the records fail to be inserted into the other table. Also, I pre-generate the primary key with a sequence table and build the bulk insert (not relying on auto-increment), so I certainly always have 1:1 enforced, just that sometimes there are "holes", so to clean it up so that there aren't any records that aren't matched in the other table, I'll perform the query to delete those records without a corresponding record in the other table.

(Reply to this) (Parent)(Thread)

Re: What is causing the multile rows
[info]awfief
2008-06-03 07:34 pm UTC (link)
So the point is, how can you ensure referential integrity? Can you make a TRIGGER or use a transaction?

(Reply to this) (Parent)(Thread)

Re: What is causing the multile rows
[info]capttofu
2008-06-03 09:02 pm UTC (link)
Transaction. I need to wrap these multiple bulk-inserts into a transaction, which will solve this. I originally had this coded for MyISAM, but changed the tables to InnoDB with the intent that I would make those inserts transactional - now is the time to make good on that plan.

Thanks for reminding me ;)

Question for you: Have you had any issues with huge bulk inserts and transactions - particularly with transaction logs (sizes)?

(Reply to this) (Parent)


(15 comments) - (Post a new comment)

Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…