Monday 30 September 2013

Peculiar Index Selections

A friend came to me with an interesting “problem” recently.  He was looking at a query plan and based on the indexes, and the DML query he was running, the index selected appeared to be a very odd choice in that it didn’t include any of the columns used in the query.  Here’s a brief rundown of what he sent:

Given these two tables:

TABLE Import (~300000 records)
(ImportID BIGINT, OrganisationCode NVARCHAR(50))
INDEX (ImportID, OrganisationCode)

TABLE Organsation (~2000 Records)
(OrganisationCode NVARCHAR(50), OrganisationType INT, label NVARCHAR(500))
PRIMARY KEY CLUSTERED OrganisationCode
INDEX (OrganisationCode) INCLUDE (label)
INDEX (OrganisationType)

And this query:
SELECT i.importID, i.organisationCode
FROM Import i
LEFT OUTER JOIN Organisation o ON i.organisationCode = o.organisationCode
WHERE i.importID = 5052
AND o.organisationCode IS NULL

importID = 5052 limits import table to ~6000 records

You'd expect it to use the clustered index on organisation, yes? Failing that the index on organistionCode including label? Nope it does a scan on the organisationType index, no predicate and an output of organisationCode. The only conceivable reason for this that I can come up with is that it's the smallest object to scan and has the OrganisationCode contained implicitly, seems slightly mad though.


He asked if I’d seen this behaviour before.  Well, yes.  I think in truth we all have.  In my early DBA days I would have thought very little of it – I know better than the optimizer, stupid thing, I’ll add a hint to force the correct index.  Naivety, arrogance, call it what you will.  It turns out the optimizer is far more clever that it first seems…
Let’s create the tables:

IF OBJECT_ID('dbo.Import') IS NOT NULL
   
DROP TABLE dbo.Import;
GO

CREATE TABLE dbo.Import -- (~300000 records)
(
    
ImportID bigint NOT NULL
   
,OrganisationCode nvarchar(50) NOT NULL
);
GO

CREATE NONCLUSTERED INDEX IX_Import_ImportID ON dbo.Import (ImportID, OrganisationCode);
GO

IF OBJECT_ID('dbo.Organisation') IS NOT NULL
   
DROP TABLE dbo.Organisation;
GO

CREATE TABLE dbo.Organisation --(~2000 Records)
(
    
OrganisationCode nvarchar(50) NOT NULL
   
,OrganisationType int NOT NULL
   
,label nvarchar(500) NULL
   
,CONSTRAINT PK_Organisation PRIMARY KEY CLUSTERED (OrganisationCode)
);
GO

CREATE NONCLUSTERED INDEX IX_Organisation_OrganisationCode ON dbo.Organisation (OrganisationCode) INCLUDE (label);
GO

CREATE NONCLUSTERED INDEX IX_Organisation_OrganisationType ON dbo.Organisation (OrganisationType);
GO



Now let’s put some data in there. We’ll try and average each importID to around 6000 rows, and fill up the label field for each Organisation.



DECLARE @NoOfRows INT = 4000; -- we only need 2000 records but we create more to allow for duplicates when generating the data

-- cross join allows up to 2047*2047 rows = 4,190,209.
WITH tally AS (
   
SELECT TOP (@NoOfRows)
        
ROW_NUMBER() OVER(ORDER BY a.type) AS N
   
FROM master.dbo.spt_values a
   
CROSS JOIN master.dbo.spt_values b
   
WHERE a.type = 'P'
     
AND b.type = 'P'
)
, insertdata AS (
   
SELECT
        
LEFT(CONVERT(nchar(36),NEWID()),FLOOR(RAND(CONVERT(varbinary(max),NEWID()))*36)+1) AS OrganisationCode
       
,FLOOR(RAND(CONVERT(varbinary(max),NEWID()))*256) AS OrganisationType
       
,REPLICATE(N'X',500) AS Label
   
FROM tally
)
, sequenced AS (
   
SELECT
        
*
       
,ROW_NUMBER() OVER(PARTITION BY OrganisationCode ORDER BY OrganisationCode) AS Sequence
   
FROM insertdata
)
INSERT INTO dbo.Organisation
   
(OrganisationCode, OrganisationType, label)
SELECT TOP 2000
    
OrganisationCode, OrganisationType, label
FROM sequenced
WHERE Sequence = 1;
GO


DECLARE @NoOfRows INT = 300000;

-- cross join allows up to 2047*2047 rows = 4,190,209.
WITH tally AS (
   
SELECT TOP (@NoOfRows)
        
ROW_NUMBER() OVER(ORDER BY a.type) AS N
   
FROM master.dbo.spt_values a
   
CROSS JOIN master.dbo.spt_values b
   
WHERE a.type = 'P'
     
AND b.type = 'P'
)
, insertdata AS (
   
SELECT
        
N%50 AS ImportID
       
,LEFT(CONVERT(nchar(36),NEWID()),FLOOR(RAND(CONVERT(varbinary(max),NEWID()))*36)+1) AS OrganisationCode
   
FROM tally
)
INSERT INTO dbo.Import
   
(ImportID, OrganisationCode)
SELECT
    
ImportID, OrganisationCode
FROM insertdata;
GO




Finally, we assess the DML query:



SELECT
    
i.ImportID
   
,i.OrganisationCode
FROM dbo.Import AS i
LEFT OUTER JOIN dbo.Organisation AS o ON i.OrganisationCode = o.OrganisationCode
WHERE i.ImportID = 47
 
AND o.OrganisationCode IS NULL;


 


NOTE – I’m not using importID 5052 as per the original example, as it doesn’t exist in my sample data.  It’s irrelevant for the purpose of this article.

I get the following query plan:

image

Ta da!  The optimizer has chosen to scan an index where the index key does not appear in the query at all.  Why?  Well the answer is essentially what my friend suggested, in that this index is the smallest, but let’s explore.

Without tying to explaining the fundamentals of indexes (Kimberly L. Tripp does an awesome job at this: http://www.sqlskills.com/blogs/kimberly/category/indexes/), there is a b-tree of multiple levels, ultimately leading to the leaf pages which contain either a pointer to the clustered index (for a nonclustered index) or the data pages themselves (for a clustered index).  I ran the following query to find the index details:


select
    
object_name(ps.object_id) table_name
   
,i.name index_name
   
,ps.index_type_desc
   
,ps.index_depth
   
,ps.index_level
   
,ps.page_count
   
,ps.record_count
   
--,ps.*
from sys.dm_db_index_physical_stats(db_id(), null, null, null, 'DETAILED') ps
inner join sys.indexes i
on  ps.object_id = i.object_id
and ps.index_id = i.index_id
where ps.object_id = object_id('dbo.Organisation')
order by ps.object_id, ps.index_id, ps.index_level





And here are the results:

image

Immediately you can see that the chosen index, IX_Organisation_OrganisationType, has only 2 levels where the others have 3.  There are 17 pages in total (16 leaf pages and 1 for the b-tree), where the others have 290 each.

Yes, the index is smaller than the other two, but that doesn’t explain why the optimizer chose it when ‘OrganisationType’ doesn’t appear in the query!

Well actually it does.  Earlier I mentioned that the leaf pages of a nonclustered index contain pointers to the clustered index, i.e. the clustering key.  In our case the clustering key is OrganisationCode.  If you look back to our DML query you’ll see that the only field we need from the dbo.Organisation table is OrganisationCode.  The optimizer is clever enough to know that all 3 indexes are ‘covering’ indexes for this query, and it selects the smallest for performing the scan. 

What can we take away from this?  Sure, there’s more at play in terms of the cost-based estimation and join selection (for example, recreate the tables but only put 20 characters in each label field, then look at the query plan), but what we do know is to never underestimate the optimizer.  If an index choice seems odd, find out why it was chosen -  don’t just assume you know better.

No comments:

Post a Comment