Introduction

I bought a book from Crossword; he packed the book and added two bookmarks into my pack. A thought came to my mind. Why do we need this bookmark? I can easily memorize the page number and next time resume from the same page when I resume reading, or read them all over to reach to the point where I stopped reading. But not all have a blessed memory; moreover, there are better things to remember, my grand pa would rather bookmark and rely on it to help him resume reading. It’s a kind of simple index, isn’t it?

This article focuses on how MS SQL Server uses indexes to read and write data. Logically, data is stored in record sets in a table. We have fields identifying the type of data contained in each of the record sets. Tables are a collection of record sets which are either in the form of unorganized heaps or organized clustered index. By default, tables are stored in the form of heaps where the next inserted record is simply added in the next available space on the table page. Data is arranged by SQL Server in the form of extents and pages. Each extent is of size 64 KB having 8 pages of 8KB sizes. An extent may have data of multiple or same table, but each page holds data from a single table only.

So resuming with the discussion, each inserted record by default is added to the next available row into the data page. And there is no attempt to keep the data organized or sorted by default. This option seems excellent for adding data to the table, but does not provide an optimum solution when there is an attempt to retrieve the data. Suppose in a library the books are organized simply in the order they are received. There is no sort upon author, genre, or title. There would be no problem in adding or storing the books into our library. But how about getting a book for issue? Bizarre, isn’t it? This would lead to a full scan of all the books to get the required book. A tough time is guaranteed.

This is exactly how SQL Server works too. Here’s when the index chips in.

Note: All code has been tested on MS SQL Server 2008 R2.

Table Indexes

A SQL Server table by default stores data as heaps. A heap is a table that does not have any clustered index defined on it. A table stored as a heap has no enforced physical order, but a clustered index does. Data is inserted into the heap table as described in the library example.

Heaps work very well for storing data, and are very efficient in handling new records, but they are not so great when it comes to finding specific data in a table. This is where indexes come in. SQL Server supports two basic types of indexes: clustered and non-clustered. It also supports XML indexes, which is not discussed in this article; XML indexes are quite different from the regular relational indexes that will be used to locate the majority of data in database tables.

The key difference between clustered and non-clustered indexes is the leaf level of the index. In non-clustered indexes, the leaf level contains pointers to the data. In a clustered index, the leaf level of the index is the actual data.

Clustered Index

As I mentioned before, a table with a clustered index is not stored as a heap. Heaps and clustered indexes are thus mutually exclusive. A clustered index is a collection of organized table rows.

The telephone directory is a perfect example of a clustered index. All the rows of the telephone directory are clustered on the combination of last name and first name. When scanning through the pages looking for a phone number, you are scanning both the index and the data. When the indexed value is found, so is the rest of the pertinent data; i.e., when you look for Amit Singh’s phone number, Amit Singh (name) acts as the clustered index, and when you find the name, you find the data, i.e., his phone number.

This is also true of SQL Server clustered indexes. Clustered indexes can be created to sort data by a particular attribute, or column of the row. Going back to the library example, libraries organize books in a clustered index based on genre and/or topic, and then break that organization down further by author. When clustered indexes are created on columns that have duplicate values, SQL Server generates an internal number to uniquely identify duplicate clustered index keys. The non-leaf level of the clustered index when using the phone book analogy can be thought of as the names at the top of the page. The leaf level of a clustered index is the actual data row, not just a pointer to the data.

We will follow this up closely with an example:

USE TestDB
GO

CREATE TABLE Sales(
 ID INT IDENTITY(1,1)
,ProductCode VARCHAR(20)
,Price FLOAT(53)
,DateTransaction DATETIME);

I have created a table Sales, and then created a Stored Procedure to insert 2,00,000 records into the Sales table. This sizable chunk of data will help us to notice the differences very clearly.

CREATE PROCEDURE InsertIntoSales
AS 
SET NOCOUNT ON
BEGIN
DECLARE @PC VARCHAR(20)='A12CB'
DECLARE @Price INT = 50
DECLARE @COUNT INT = 0
      WHILE @COUNT<200000
      BEGIN
      SET @PC=@PC+CAST(@COUNT AS VARCHAR(20))
      SET @Price=@Price+@COUNT
      INSERT INTO Sales VALUES (@PC,@Price,GETDATE())
      SET @PC='A12CB'
      SET @Price=50
      SET @COUNT+=1
      END
END

EXEC InsertIntoSales

Now we have created the table and inserted 2,00,000 records into it, but there is no index defined on any column.

Pressing CTRL+M will “Include the Actual Execution Plan” in the results. Let's run the below query.

SET STATISTICS IO ON
SELECT * FROM Sales WHERE ID=189923

ID          ProductCode          Price                  DateTransaction
----------- -------------------- ---------------------- -----------------------
189923      A12CB189922          189972                 2011-03-21 12:07:48.310

(1 row(s) affected)

Table 'Sales'. Scan count 1, logical reads 1129, physical reads 0, 
  read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


(1 row(s) affected)

The Execution plan tab on the results shows that the record has been retrieved on a table scan and the logical reads are 1129.

Now let’s build a clustered index on the ID column of the Sales table.

CREATE CLUSTERED INDEX CL_ID ON SALES(ID);

Let us press CTRL+M and rerun the same query:

SET STATISTICS IO ON
SELECT * FROM Sales WHERE ID=189923

ID          ProductCode          Price                  DateTransaction
----------- -------------------- ---------------------- -----------------------
189923      A12CB189922          189972                 2011-03-21 12:07:48.310

(1 row(s) affected)

Table 'Sales'. Scan count 1, logical reads 3, physical reads 0, 
  read-ahead reads 0, lob logical reads 0, 
  lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

The Execution plan tab on the results shows that the record has been retrieved on Index seek and the logical reads are 3. After the Clustered index creation, SQL Server has been able to reduce the logical reads dramatically and the query has been optimized. Clearly, the index knows where to look for the record.

Non-Clustered Index

Non-clustered indexes are more like the indexes in the back of a book. When the indexed value is found, so is a pointer that tells the location of the actual data. Non-clustered indexes can be built on a heap or a clustered index. The leaf level of a non-clustered index contains the indexed column (or columns) and a pointer to the actual data to which the indexed value refers. When the non-clustered index is built on a heap, the pointer is a physical location of the data. When it is built on a clustered index, the pointer is the clustered index key value.

Getting back to sales, if there is a query like:

SET STATISTICS IO ON
SELECT * FROM Sales WHERE ProductCode like 'A12CB908%' order by Price
Press Control+M and execute the query
There are arround 111 records falling retrived 
-----------------------------------------------------
(111 row(s) affected)

Table 'Sales'. Scan count 1, logical reads 1130, physical reads 0, 
  read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

We find that the query first used the clustered index to get 111 records and then used a sort operation; the logical reads are as high as 1130. There is also a missing index suggestion.

Let’s consider SQL Server’s advice and create a non-clustered index on the Price column.

CREATE NONCLUSTERED INDEX NONCI_PC ON SALES(ProductCode);
Press Control+M and rerun the same query.
SET STATISTICS IO ON
SELECT * FROM Sales WHERE ProductCode like 'A12CB908%' order by Price
-------------------------------
(111 row(s) affected)

Table 'Sales'. Scan count 1, logical reads 351, physical reads 0, 
  read-ahead reads 7, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


(1 row(s) affected)

The logical reads have been minimized and the revised execution plan is as in the figure. This was the example where a non-clustered index was built on a table having a clustered index.

Non-Clustered Indexes on Heaps

When a non-clustered index is built on a table organized as a heap, the indexed column or columns are sorted along with a pointer to the physical location of the data.

For example, let’s go back to the library example. If the physical location of every book that came into this unorganized library were recorded in an index as it was placed on the shelf, that index could be referenced to find the location of a book instead of scanning all the shelves. The downside of this technique is that similar records (or in the library example, similar books) could be located in completely different places. For example, searching for books on SQL Server could return several books; each one located in opposite ends of the library. Retrieving the books may take more effort than would be required if all the SQL Server books were clustered together.

Whether to create a clustered index or leave the records in a heap is a design decision that is typically driven by how the data is accessed. When data from a table is primarily accessed by a predictable attribute or column, then it may be useful to cluster the rows of the table on that specific column.

However, if the column is based on a large data type, creating a clustered index on it will be costly as far as storage and index maintenance.

In a simple one-column index built on a heap table, the index itself is a great deal like a two-column table. The first column records the indexed value and the second column records the physical location of the row in which the indexed value can be found. The physical location is essentially an identifier that specifies the Extent ID, Page ID, and Row ID of the indexed value on the page.

With respect to the sales example, let’s delete the clustered index CL_ID created on the ID column and re-evaluate.

DROP INDEX Sales.CL_ID;

SET STATISTICS IO ON
SELECT * FROM Sales WHERE ProductCode like 'A12CB908%' order by Price
------------------------------------
(111 row(s) affected)

Table 'Sales'. Scan count 1, logical reads 114, physical reads 0, 
   read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

The logical reads have been further optimized and the execution plan also has been revised.

In this case, the query uses the non-clustered index to be run on the heap.

Non-Clustered Indexes on Clustered Indexes

When a non-clustered index is built on a clustered index, the pointer value in the index is a representation of the clustered index key value. For example, in the telephone directory example, the white pages of the phone book are just like a clustered index in SQL Server. I live in a small town and my phone book contains an interesting additional index just after the white pages. I call them the “slightly off-white pages.” These off-white pages contain every published phone number in town listed in sorted order, along with the last name and first name of the phone number’s holder. This is a perfect example of a non-clustered index built on a clustered index. The phone number can be used to discover the last name–first name combination, and then the last name–first name combination can be used to find the address, if it is listed.

We have seen the example of non-clustered index on a clustered index with respect to the sales table.

As far as the query is concerned, we have been able to optimize the table by having a non-clustered index on the heap; it’s the best and the most optimized results with least logical reads.

Summary

Building indexes are totally based on the criteria of querying. There is no hard and fast rule on the number of non-clustered indexes that can be created on a table. The columns on which DMLs are executed frequently qualify for indexing. SQL Server allows at most one clustered index in any version. As far as non-clustered indexes are concerned, SQL Server 2005 allows 249 of them to be created while 2008 allows 999 non-clustered indexes.

To add to that, it is not always the case that the query can be optimized further by simply creating an index on a column. And as there are no free services, indexes charge considerable fees. Every time there is a DML (insert/update/delete) fired on an indexed table, SQL Server updates the indexes to be able to identify the record. Hence if there are more indexes, it’s liable that the DMLs will take a longer time to execute. Hence simply creating a large number of indexes doesn’t serve the purpose. That is why it’s advised to drop all indexes before a BCP or BULK INSERTs, and rebuild them upon completion of the activity.

Also, indexes lead to defragmentation of the tables and they charge the costly time of DBAs for their maintenance. If used judiciously, they enhance the performance of queries many folds. The next article discusses maintenance and defragmentation activities for indexes.

推荐.NET配套的通用数据层ORM框架:CYQ.Data 通用数据层框架
新浪微博粉丝精灵,刷粉丝、刷评论、刷转发、企业商家微博营销必备工具"