Expecto Sanitas

This will be a somewhat shorter post than my previous rants, and hopefully set me down a path mixing relatively infrequent essay-ish posts with more frequent, but tighter focused post. These will generally be about something that has interested me or annoyed me enough to remark about it.

In this blog post, I'll be discussing SQL joins. Specifically, the habit of using a cross join and a where clause instead of an inner join and on clause, such as:

SELECT col1, col2 FROM table1, table2 WHERE table1.foo = table2.bar

instead of

SELECT col1, col2 FROM table1 JOIN table2 ON table1.foo = table2.bar

To make this more concrete, here's an example in the form of an online store.

Say we have two tables in our database, called users and transactions. The user table has a user's name, address, and a numerical id for that user; the transaction table has a numerical id for the transaction, the amount for the transaction, and the id of the user associated with the transaction. Let's say I want to get all the transaction ids associated with users in the zipcode 12345. I can write this query in at least two different ways:

SELECT transaction.id FROM transaction JOIN user ON transaction.user = user.id WHERE user.zipcode = 12345

SELECT transaction.id FROM transaction, user WHERE transaction.id = user.id AND user.zipcode = 12345

So far, so good. Both of these queries will return the same results, and within at least PostgreSQL 8.3.3 and MySQL 5.0.51, both have the exact same EXPLAIN plans.

Spacial Locality of Table References

Let's introduce two more tables: product, and product_transaction. The former table indicates products we have in stock; the latter table indicates which products were associated with a particular transaction. Now I want the number of each product a user in the zipcode mentioned above bought. Again, this can be done one of two different ways:

SELECT
user.id, product.sku, count(product.sku)
FROM user
JOIN transaction ON user.id = transaction.user
JOIN product_transaction ON transaction.id = product_transaction.transaction
JOIN product ON product.id = product_transaction.product
WHERE
user.zipcode = 12345
GROUP BY user.id, product.sku

SELECT
user.id, product.sku, count(product.sku)
FROM
user, transaction, product_transaction, product
WHERE
user.id = transaction.id AND
transaction.id = product_transaction.transaction AND
product_transaction.product = product.id AND
user.zipcode = 12345
GROUP BY user.id, product.sku

Both queries are still manageable, but there's a pathology that is starting to develop in latter query syntax; the conditions by which tables are joined are becoming more spatially segregated from the references to those tables. The associations are not made immediately obvious except by the names of the tables themselves. So lets remove the names:

SELECT
table1.id, table2.id, count(table2.id)
FROM
table1
JOIN table4 ON table1.id = table4.t1_id
JOIN table3 ON table4.id = table3.t3_id
JOIN table2 ON table3.t2_id = table2.id
WHERE
table1.zc = 12345

SELECT
table1.id, table2.id, count(table2.id)
FROM
table1, table2, table3, table4
WHERE
table1.id = table4.t1_id AND
table4.id = table3.t3_id AND
table3.t2_id = table2.id AND
table1.zc = 12345
GROUP BY user.id, product.sku

Which of these queries makes more immediately obvious the relationships between tables? If I were to ask you how table4 is related to table2, which query would you rather have to reference?

The former query specifies the join relationship as soon as each table is declared, allowing the user some indication on structure even if the tables are named poorly.

The latter separates the relationship out, sending the coder hunting through conditions in the WHERE clause to find the table name they are after. There is also no guarantee that the conditions in the WHERE clause will be in the same order as the table references in the FROM clause. This problem only becomes further exacerbated by additional table references and join conditions.

To slightly modify the words of Harold Abelson and Gerald Sussman, SQL queries must be written for people to read, and only incidentally for RBDMSs to execute. The latter set of queries break that philosophy by making the developer have to work hard to understand things, with no gain in query performance. The only benefit is less typing now, due to trading typing 'JOIN' and 'ON' for 'AND' and commas. Less typing is a weak reason and generally not accepted anywhere else as an excuse for making things cryptic, so it makes no more sense that it would acceptable here.

Connecting with your OUTER self

Seeing the results of our query above, the marketing department decides they're doing a great job, and decide to start packing up for an early departure to the local pub. Catching wind of this, you start feeling envious because you have a list of bugs to fix so long it causes your browser to start paging to disk. So, you decide to show them just how many people their advertising campaign isn't working on. If you're using the former query above, you're in good luck. The necessary changes are highlighted in bold:

SELECT
user.id product.sku, count(transaction.id)
FROM user LEFT OUTER JOIN transaction ON user.id = transaction.user
WHERE
user.zipcode = 12345
GROUP BY user.id
HAVING count(transaction.id) = 0

As for the latter, you have more work ahead of you:

SELECT
 user.id, transaction.id, count(transaction.id)
FROM
 user, transaction
 user LEFT OUTER JOIN transaction ON user.id = transaction.user

WHERE
 user.id = transaction.id AND
 user.zipcode = 12345
GROUP BY user.id
HAVING count(transaction.id) = 0

The use of an LEFT OUTER JOIN is necessary, as the use of a CROSS JOIN will not produce a table containing user rows with no associated transaction rows; a Cartesian product of anything against the empty set is the empty set. So you're stuck doing heavier surgery to the latter query, but at least you got to give your hands a break when you were first typing it. Those small victories can add up.

Performance Anxiety

The other question which came to mind while thinking about this was if doing the joins one way or the other affects performance in any way across most modern day databases. As I mentioned above, both MySQL and PostgreSQL will construct same query plan for either syntax, and I would be surprised if the parsing cost was of any significance either way. I dug into the documentation for both MSSQL and Oracle, and found that they, too, will construct exactly the same query plan either way. So from a performance perspective, the difference between these two is non-existent.

Conclusion

There is no difference between implicit WHERE-clause constrained cross joins, and their ON-clause constrained inner join cousins regarding performance. The only real trade-off is between greater readability and faster coding time, and as several authors far smarter and experienced that myself have stated, trading off readability for efficiency is almost always not worth the added pain later on. So, the next time you catch yourself joining tables with commas, and packing everything into your WHERE clause, I ask that you take a moment to consider readability and whether or not using explicit ON-clauses in an inner join is the better way to go.

Expecto Sanitas

Sunday, August 3, 2008

Implicit Cross Joins Considered Harmful

Spacial Locality of Table References

Connecting with your OUTER self

Performance Anxiety

Conclusion

Blog Archive

Contributors

Analytics