Performance,  SQL party

Merge join Vs Hash join Vs Nested loop join

Advertisements

 

This month’s TSQL Tuesday party is being hosted by Stuart R Ainsworth (Blog| Twitter). I am very glad to write my first blog post as t-SQL Tuesday post on my newly designed website.

SQL server has three types of internal joins. I know most of folks never heard this join type because it’s not logical join and it’s not often used in their codes.

Then, when it will be used?

Well the answer is “it depends”.

This means it depends upon the record sets and indexes. The query optimizer will be smart and always try to pick up the most optimal physical joins. As we know SQL optimizer creates a plan cost based and depends upon the query cost it will choose the best join.

How the query optimizer will choose the join type internally?

Well, there is some algorithm has written internally for the query optimizer to choose the join type.

Let’s go for some practical examples and will finally summarize it.
First I will give some basic idea how the join will work and when/How the optimizer will decide to use anyone of the internal join (Physical join).

• It depends upon the table size
• It depends upon the index on the join column
• It depends upon the Sorted order on the join column

Test:

The test has done by following configuration.

RAM: 4 GB
Server : SQL server 2008 (RTM)

 

create table tableA (id int identity ,name varchar(50))
declare @i int
set @i=0
while (@i<100)
begin
insert into tableA (name)
select name from master.dbo.spt_values
set @i=@i+1
end
--select COUNT(*) from dbo.tableA --250600
go
create table tableB (id int identity ,name varchar(50))
declare @i int
set @i=0
while (@i<100)
begin
insert into tableB (name)
select name from master.dbo.spt_values
set @i=@i+1
end
-- select COUNT(*) from dbo.tableB --250600
select * from dbo.tableA A join tableB B
on (a.id=b.id)

 

Test1: Without Index

Let’s create a clustered index

create unique clustered index cx_tableA on tableA (id)
create unique clustered index cx_tableB on tableB (id)

Test1: With Index

If either of the table has indexed then it goes hash join. I haven’t shown this picture here. You can drop either of the table indexes and test it.

Test2: Without Index

Let’s create a medium table

 

create table tableC (id int identity,name varchar(50))
insert into tableC (name)
select name from master.dbo.spt_values
-- select COUNT(*) from dbo.tableC --2506
create table tableD (id int identity,name varchar(50))
insert into tableD (name)
select name from master.dbo.spt_values
select * from dbo.tableC C join tableD D
on (C.id=D.id)
-- select COUNT(*) from dbo.tableD --2506

 

Test2: With Index

Let’s create a clustered index

create unique clustered index cx_tableC on tableC (id)
create unique clustered index cx_tableD on tableD (id)

If either of the table has indexed then it goes merge join. I haven’t shown this picture here. You can drop either of the table indexes and test it.

Test3: Without Index

 

create table tableE (id int identity,name varchar(50))
insert into tableE (name)
select top 10 name from master.dbo.spt_values
-- select COUNT(*) from dbo.tableE --10
create table tableF (id int identity,name varchar(50))
insert into tableF (name)
select top 10 name from master.dbo.spt_values
-- select COUNT(*) from dbo.tableF --10

 

Let’s create a clustered index

 

create unique clustered index cx_tableE on tableE (id)
create unique clustered index cx_tableF on tableF (id)

Test3: With Index

If either of the table has indexed then it goes Nested loop join. I haven’t shown this picture here. You can drop either of the table indexes and test it.

You can also join tables vice versa like big table Vs Medium table Vs small table

select * from dbo.tableA A join tableC C
on (a.id=C.id)

select * from dbo.tableA A join tableE E
on (a.id=E.id)

select * from dbo.tableC C join tableE E
on (C.id=E.id)

 

 

In this case if all the table has indexed then it goes Nested loop join. If they don’t then hash join. If either of the table has indexed then it goes Nested loop join. I haven’t shown this picture here.

Still you can force optimizer to use any one of the internal joins, but it’s not good practice. The query optimizer is smart it will dynamically choose the best one.

Here just I used the merge hint so the optimizer goes to merge join instead of a hash join (Test1 without an index)

 

select * from dbo.tableA A join tableB B
on (A.id=B.id)option (merge join)

select * from dbo.tableA A inner merge join tableB B
on (A.id=B.id)

 

 

 

Table 1: Test uses a unique clustered index

From the table diagram:

  • If both the tables have NO index then the query optimizer will choose “Hash joins” internally.
  • If both the tables have indexes then the query optimizer will choose “Merge (For big tables) /Nested loop (For small tables)” internally.
  • If either of the tables have indexes then the query optimizer will choose “Merge (For medium tables) /Hash (For big tables) /Nested loop (For small & big Vs small tables)” internally.

Table 1: Test using clustered index

(create clustered index cx_tableA on tableA (id))
Table size With index (Both) Without Index(Both) Either of table has index
Big (Both) HASH HASH HASH
Medium (Both) HASH HASH HASH
Small (Both) NESTED LOOP NESTED LOOP HASH
Big Vs Small(medium) HASH HASH HASH

 

 

From the table diagram:

This test has done using without a unique clustered index. See if the index created without unique keyword then there is no guarantee SQL will not know its unique (data) so it will create 4-byte integer GUID unique-identifier by default.

Look the diagram there is no MERGE join if the clustered index created without unique.

Thanks @Dave for your email 🙂 . Now the second diagram added.

 

Conclusion:

Merge Join

Merge join is possible for the tables have an index on the join column. The index either clustered or covering non-clustered index. It’s best join for this circumstance. Why it’s best join? Because it needs an index for both the tables. So it’s already sorted and easily match and return the data.

Hash Join

Hash joins is possible for tables with no index (or) either of the big tables has indexed. It’s best join for this circumstance. Why it’s best join? Because it’s worked great for big tables with no index and run the query parallel (more than one processor) and give the best performance. Most of folk says its heavy lifter join.

Nested loop Join

Nested loop join is possible for small tables with index (or) either of the big tables have indexed. It’s best join for this circumstance. Why it’s best join? Because it works great for small tables like, compares each row from one table to each row from the other table ‘looping’.

For more: Read Gail Shaw’s blog. It has Craig Freedman’s link too. Why I gave Gail’s link instead of directly give Craig‘s link. It’s worth reading and most of my readers are from India and they have to read all her posts.

I hope now you can understand how the query optimizer will choose the most optimal join types.

I’m currently working as a SQL server DBA in one of the top MNC. I’m passionate about SQL Server And I’m specialized in Administration and Performance tuning. I’m an active member of SQL server Central and MSDN forum. I also write articles in SQL server Central. For more Click here

36 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

+ 38 = 40